




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1InformationThegoalofInformationextraction(IE)istransformtextintoastructuredformat(e.g.databaserecords)accordingtoitscontentE.g.Heterogeneousresearchershomepagesaretransformedintodatabaserecordscontainingname,position,institution,researchinterests,projects,etcE.g.Terrorismnewsarticlesaretransformedintorecordsincludingkindof date,instigator, aldamages,etc2WhatisInformationInformationYouhaveaninformationneed,butwhatyougetbackisn’tinformationbut s,whichyouhopehavetheinformationInformationItisoneapproachtogoingfurtherforaspecialcase:There’ssomerelationyou’reinterestedYourqueryisforelementsofthatAlimitedformofnaturallanguage3InformationExtractionofSeminarAnnouncements4InformationExtractionofSeminarAnnouncements5InformationExtractionAsAnAnnotation6从竞争对手 上提取关键数 7ExtractingCorporateDataDataextractedSourceSourcewebpage.Colorhighlightsindicatetypeof(e.g.,red=E.g.,informationneed:E.g.,informationneed:WhoistheCEOofMarketSoft?AndrewCommercialAANotaNeedNeed9Producttextual例:digitalImageCaptureDevice:1.68millionpixel1/2-inchCCDImageCapture TotalPixelsApprox.3.34Image Imagingsensor TotalPixels:Approx.2.11million 1,688(H)x1,248(V) TotalPixels:Approx.3,340,000(2,140[H]x1,560[V]EffectivePixels:Approx.3,240,000(2,088[H]x1,550[V]RecordingPixels:Approx.3,145,000(2,048[H]x1,536[V]TheseallcameoffthesameAndthisisavery Alwaysevaluateperformanceonindependent,manually-annotatedtestdatanotusedduringsystem teMeasureforeach Totalnumberofcorrectextractionsinthesolution NTotalnumberofslot/valuepairsextractedbythesystem:Numberofextractedslot/valuepairsthatarecorrect(i.e.thesolution te):ComputeaveragevalueofmetricsadaptedfromRecall=Precision=F-Measure=Harmonicmeanofrecalland Entity 关系提取(RelationalThreegenerationsofIEHand-BuiltSystems–KnowledgeEngineeringRuleswrittenbyRequireexpertswhounderstandboththesystemsand tiveguess-test-tweak-repeatRulesdiscoveredautomaticallyusingpredefinedtem usingmethodslikeILPRequirehuge,labeledcorpora(effortisjustMachineLearning(Sequence)Models[1997–Onedecodesastatisticalmodelthatclassifiesthewordsthetext,usingHMMs,randomfieldsorstatisticalLearningusuallysupervised;maybepartiallyNamedEntityClintondecidedtosendspecialenvoyMickeyKantortothespecialAsianeconomicmeetinginSingaporethisweek.Ms.XuemeiPeng,tradeministerfrom ,andMr.HidetoSuzukifromJapan’sMinistryofTradeandIndustrywillalsoattend.Singapore,whoishostingthemeeting,willprobablyberepresentedbyitsforeignandeconomicministers.TheAustralianrepresentative,Mr.Langford,willnotattend,thoughnoreasonhasbeengiven.ThepartieshopetoreachaframeworkforcurrencyExtractedNamedEntities MickeyKantorMs.XuemeiPengMr.HidetoSuzukiMr.Langford
FiniteFiniteStateAcceptor 精确匹配,exact-matchlinkse.g. "matchingonly e.g."?"matches"100"or "ore.g.CAPmatchesanycapitalizede.g.ifHON-LIST:=(Mr,Ms, ,itwouldmatchanyofthosewordsintheFiniteStateAcceptor
CAPmatchesanycapitalizedHONLIST称呼(Mr,Ms,,有限状态变换器,AFiniteStateTransducer(FST)e.g."YES<firstnameFiniteState
HON:= FirstName:= LastName:=CAPmatchesanycapitalizedHON-LIST:=称呼(Mr,Ms, , If<right-context><?"not"("attend"|Thenentity.role=If<left-context><("meet"|"meeting")("in"|Thenentity.role="JohnSnellreportingforWallStreet.TodayFlexiconInc.announcedatenderofferfor .for$30pershare,representinga30%premiumoverFriday’sclosingprice.FlexiconexpectstoacquireSupplyhousebyQ42001withoutproblemsfromfederalregulators"[acquirer< [acquiree<l-acquiree-FSM>< [share-price<money-FSM><r-stock-FSM>][date<l-event-date-FSM><date-FSM>]][acquirer"FlexiconInc."][acquiree"Supplyhouse [share-price"30USD"][date"Q4]IfwethinkofthingsfromthedatabasepointofWewanttobeabletodatabase-styleButwehavedatainsomehorridtextualmanagementsystemthatdoesn’tallowsuchWeneedto“wrap”thedatainacomponentunderstandsdatabase-styleHencethetermManypeoplehave“wrapped”manywebCommonlysomethinglikeaPerlOfteneasytodoasaone-ButhandcodingwrappersinPerlisn’tverySitesarenumerous,andtheirsurfacestructurerapidly(around10%failureseachAmazonBook<b> aref=" <img "width=90height=140<fontface=verdana,arial,helveticasize=-<span<span<b>ListPrice:</b><span<b>OurPrice:<font<b>YouSave:</b><fontcolor=#990000><b>$2.99</b><p> ExtractedBook Title:TheAgeofSpiritualMachinesWhenComputersExceedHumanIn Author:RayKurzweilList-Price:$14.95Price:$11.96:: teSlotsintem tetypicallyfilledbyasubstringfromthe Someslotsmayhaveafixedsetofpre-specifiedpossiblefillersthatmaynotoccurinthetextitself.Terroristact:threatened,attempted,Jobtype:clerical,service,custodial,type:SECSomeslotsmayallowmultipleProgramming smayallowmultiple tes Wrappertool-Wrappertoolkits:Specializedprogrammingenvironmentsforwriting&debuggingwrappersbyhandAgingWorldWideWebWrapperJavaExtraction&DisseminationInformationJungleeTask:WrapperLearningwrappersiswrapperSometimes,therelationsareWebpagesgeneratedbyaWrapperinductionisusuallyregularrelationswhichcanbeexpressedbythestructureofthe theiteminboldinthe3rdcolumnofthetableistheWrapperinductiontechniquescanalsoIfthereisapageaboutaresearchprojectXandthereisalinkneartheword‘people’toapagethatisaboutaYthenYisamemberoftheproject[e.g,Tom SimpleExtractionSpecifyanitemtoextractforaslotusingaregularexpressionpattern.Pricepattern:Mayrequirepreceding(pre-filler)patterntoidentifypropercontext.AmazonlistPre-fillerpattern:“<b>ListPrice:</b><spanMayrequiresucceeding(post-filler)patterntoidentifytheendofthefiller.AmazonlistPre-fillerpattern:“<b>ListPrice:</b><span teExtractslotsinorder,startingthesearchforthefillerofthen+1slotwherethefillerforthenthslotended.Assumesslotsalwaysinafixedorder.List…Makepatternsspecificenoughtoidentifyeachfilleralwaysstartingfromthebeginningofthe.WrapperHighlyregular Relativelysimpleextractionpatternslearningalgorithm
Writingaccuratepatternsforeachslotforeach(e.g.eachwebsite)requireslaborioussoftwareAlternativeistousemachineBuildatrainingsetspairedwithhuman-producedfilledextraction Learnextractionpatternsforeachslotusinganappropriatemachinelearningalgorithm.WrapperDelimiter-basedUse<B>,</B>,<I>,</I>forLearningLRlabeled <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<B>Egypt</B>
l1,r1,…,lK,Example:Find4<B>,</B>,<I>,l1 r1,l2,r2 LR:Finding<B>Congo</B><B>Egypt</B><B>Beize</B><B>Spain</B>LR:Findingl1,l2and<B>Congo</B><B>Egypt</B><B>Belize</B><B>Spain</B>AproblemwithLRDistractingtextinheadand<B>Congo</B><B>Egypt</B><B>Belize</B>One(ofmany)solutions:Ignorepage’sheadand<B>Congo</B><B>Egypt</B><B>Belize</B>
body}LRandHLRTwrappersareextremelyThoughapplicabletomanytabularRecentwrapperinductionresearchhasexploredmoreexpressivewrapperclasses:DisjunctiveMultipleattributeMissingMultiple-valuedHierarchicallynestedWrapperverificationandBoostedwrapperWrapperinductionisonlyidealforrigidly-structuredmachine-generatedHTML……orisCanweusesimplepatternstoextractfromnaturallanguage ……Name:Dr.JeffreyD.Hermes…Who:ProfessorManfredPaul...willbegivenbyDr.R.J.Pangborn…Ms.Scottwillbespeaking…KarenShriver,Dept.of…MariaKlawe,UniversityofNaturalLanguageProcessing-basedInformationExtractionNaturalLanguageProcessing-basedInformationExtractionIfextractingfromautomaticallygeneratedwebpages,simpleregexpatternsusuallywork.Ifextractingfrommorenatural,unstructured,human-writtentext,someNLPmayhelp.Part-of-speech(POStagging(词性Markeachwordasanoun,verb,preposition,Syntacticparsing(句法分析Identifyphrases:NP,VP,Semanticwordcategories(e.g.fromExtractionpatternscanusePOSorphraseCrimePrefiller:[POS:V,Hypernym: MUC:theNLPgenesisofDARPAfundedsignificanteffortsinIEintheearlytomid1990’s.MessageUnderstandingConference(MUC)wasanannual petitionwhereresultswerepresented.FocusedonextractinginformationfromnewsTerroristIndustrialjointmanagementInformationextractionisofparticulartothe ligence ExampleofIEfromFASTUSBridgestoneSportsCo.saidFridayithadsetupajointventure withalocalconcernandExampleofIEfromFASTUSThejointventure,BridgestoneSports Co.,capitalizedat20millionnew dollars,willstartproductioninJanuary1990withproductionof20,000ironand“metalwood”clubsamonth. :“iron“ironand‘metalwood’StartRelationship:TIE-“alocalconcern”“aJapanesetradingJoint “BridgestoneSports BridgestoneSportsCo.saidFridayithadsetupajoint withalocalconcernandaJapanesetradinghousetoproducegolfclubstobe dtoJapan.Thejointventure,BridgestoneSports Co.,capitalizedat20millionnew dollars,willstartproductioninJanuary1990withproductionof20,000ironand“metalwood”clubsamonth. :“iron“ironand‘metalwood’StartRelationship:TIE-“alocalconcern”“aJapanesetradingJoint set
Basedonfinitestateautomata(FSA)plex
aJapanesetradinghadsetproductionof20,000ironandmetalwood [set
2.BasicSimplenoungroups,verbgroupsandplexComplexnoungroupsandverbPatternsforeventsofinteresttotheBasic tesaretobeMerging tesfromdifferentpartsofthetextsmergediftheyprovideinformationaboutsameentityor Rule-basedExtractionDeterminingwhich holdswhatofficeinwhat ],[office]ofVukDraskovic,leaderoftheSerbianRenewal[org](named,appointed,etc.) ]PNATOappointedWesleyClarkasCommanderinDeterminingwhereanorganizationis[org]inNATOheadquartersin[org][loc](division,branch,headquarters,KFORKosovoLearningforWritingaccuratepatternsforeachslotforeach(e.g.eachwebsite)requireslaborioussoftwareAlternativeistousemachineBuildatrainingsetof spairedwithhuman-producedfilledextractiontem Learnextractionpatternsforeachslotusinganappropriatemachinelearningalgorithm.StatisticalgenerativeSequenceModelsarestatisticalmodelsofwholetokensequencesthateffectivelylabelsubsequencesBestknowncaseisgenerativeHiddenMarkovModels(HMMs)Well-understoodunderlyingstatisticalmodelsmakeiteasytousedwiderangeoftoolsfromstatisticaldecisionPortable,broadcoverage,robust,goodRangeoffeaturesandpatternsusablemaybeNotnecessarilyasgoodforcomplexmulti-slotNameExtractionvia
ApplyingHMMsto generatedbyastochasticprocessmodelledbyanHMMTokenState nation”foragiven‘Background’stateemitstokenslike‘the’,‘said’,‘Money’stateemitstokenslike‘million’,‘euro’,‘Organization’stateemitstokenslike‘university’, ’,…Extraction:viatheViterbialgorithm,adynamicprogrammingtechniqueforefficientlycomputingthemostlikelysequenceofthatgenerated HMMHMM=probabilisticFSAHMM=statess1,s2,…(specialstartstates1specialendstatesn)tokenalphabeta1,a2,…statetransitionprobsP(si|sj)tokenemissionprobsP(ai|sj)Widelyusedinmanylanguageprocessingtasks,e.g.,speechrecognition[Lee,1989],POStagging[Kupiec,1992],topicdetection[Yamronetal,HMMforresearchpapers:transitions[Seymoreetal.,99]HMMforresearchpapers:emissions[Seymoreetal.,99]ICMLtoappear
universityofcaliforniadartmouthcollege Trainedon2millionwordsofBibTeXdatafromthe
LearningGoodnews:Iftrainingdatatokensaretaggedwiththeirgeneratingstates,thensimplefrequencyratiosareaum-likelihoodestimateoftransition/emission Easy.(Usesmoothingtoavoidzeroprob
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025广西柳州市柳南区委社会工作部招聘专职化城市社区工作者16人模拟试卷有完整答案详解
- 2025河南新乡市辉县市大成高级中学招聘模拟试卷附答案详解(考试直接用)
- 2025年福建省龙岩市武平县招聘教育卫生干部10人考前自测高频考点模拟试题完整答案详解
- 2025贵州安顺市紫云苗族布依族自治县利源融资担保有限责任公司招聘1人模拟试卷及答案详解(考点梳理)
- 2025广西河池市凤山县农业农村局招募水稻等产业特聘农民技术员2人考前自测高频考点模拟试题有完整答案详解
- 2025年门窗定制安装服务合同
- NVP-BQS481-生命科学试剂-MCE
- 2025年供暖招标采购试题及答案
- 车辆工程笔试试题及答案
- 2025年饭店管理考试试题及答案
- 武汉从业资格证摸拟考试及答案解析
- 小学数学数与代数全学年复习资料
- 2025至2030医药级一氧化氮行业产业运行态势及投资规划深度研究报告
- 2025海康威视安检机用户手册
- 2025 精神障碍患者暴力行为应对护理课件
- 创新驱动人工智能+法律服务研究报告
- 《物联网技术》课件-第3章 无线传感器网络
- 保健行业员工知识培训课件
- 人民调解员培训课件
- 某局关于2025年度国家安全工作情况及2025年度风险评估的报告
- 2022年金华市婺城区城乡建设投资集团有限公司招聘笔试试题及答案解析
评论
0/150
提交评论