挖掘技术textmining10信息抽取_第1页
挖掘技术textmining10信息抽取_第2页
挖掘技术textmining10信息抽取_第3页
挖掘技术textmining10信息抽取_第4页
挖掘技术textmining10信息抽取_第5页
已阅读5页,还剩77页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1InformationThegoalofInformationextraction(IE)istransformtextintoastructuredformat(e.g.databaserecords)accordingtoitscontentE.g.Heterogeneousresearchershomepagesaretransformedintodatabaserecordscontainingname,position,institution,researchinterests,projects,etcE.g.Terrorismnewsarticlesaretransformedintorecordsincludingkindof date,instigator, aldamages,etc2WhatisInformationInformationYouhaveaninformationneed,butwhatyougetbackisn’tinformationbut s,whichyouhopehavetheinformationInformationItisoneapproachtogoingfurtherforaspecialcase:There’ssomerelationyou’reinterestedYourqueryisforelementsofthatAlimitedformofnaturallanguage3InformationExtractionofSeminarAnnouncements4InformationExtractionofSeminarAnnouncements5InformationExtractionAsAnAnnotation6从竞争对手 上提取关键数 7ExtractingCorporateDataDataextractedSourceSourcewebpage.Colorhighlightsindicatetypeof(e.g.,red=E.g.,informationneed:E.g.,informationneed:WhoistheCEOofMarketSoft?AndrewCommercialAANotaNeedNeed9Producttextual例:digitalImageCaptureDevice:1.68millionpixel1/2-inchCCDImageCapture TotalPixelsApprox.3.34Image Imagingsensor TotalPixels:Approx.2.11million 1,688(H)x1,248(V) TotalPixels:Approx.3,340,000(2,140[H]x1,560[V]EffectivePixels:Approx.3,240,000(2,088[H]x1,550[V]RecordingPixels:Approx.3,145,000(2,048[H]x1,536[V]TheseallcameoffthesameAndthisisavery Alwaysevaluateperformanceonindependent,manually-annotatedtestdatanotusedduringsystem teMeasureforeach Totalnumberofcorrectextractionsinthesolution NTotalnumberofslot/valuepairsextractedbythesystem:Numberofextractedslot/valuepairsthatarecorrect(i.e.thesolution te):ComputeaveragevalueofmetricsadaptedfromRecall=Precision=F-Measure=Harmonicmeanofrecalland Entity 关系提取(RelationalThreegenerationsofIEHand-BuiltSystems–KnowledgeEngineeringRuleswrittenbyRequireexpertswhounderstandboththesystemsand tiveguess-test-tweak-repeatRulesdiscoveredautomaticallyusingpredefinedtem usingmethodslikeILPRequirehuge,labeledcorpora(effortisjustMachineLearning(Sequence)Models[1997–Onedecodesastatisticalmodelthatclassifiesthewordsthetext,usingHMMs,randomfieldsorstatisticalLearningusuallysupervised;maybepartiallyNamedEntityClintondecidedtosendspecialenvoyMickeyKantortothespecialAsianeconomicmeetinginSingaporethisweek.Ms.XuemeiPeng,tradeministerfrom ,andMr.HidetoSuzukifromJapan’sMinistryofTradeandIndustrywillalsoattend.Singapore,whoishostingthemeeting,willprobablyberepresentedbyitsforeignandeconomicministers.TheAustralianrepresentative,Mr.Langford,willnotattend,thoughnoreasonhasbeengiven.ThepartieshopetoreachaframeworkforcurrencyExtractedNamedEntities MickeyKantorMs.XuemeiPengMr.HidetoSuzukiMr.Langford

FiniteFiniteStateAcceptor 精确匹配,exact-matchlinkse.g. "matchingonly e.g."?"matches"100"or "ore.g.CAPmatchesanycapitalizede.g.ifHON-LIST:=(Mr,Ms, ,itwouldmatchanyofthosewordsintheFiniteStateAcceptor

CAPmatchesanycapitalizedHONLIST称呼(Mr,Ms,,有限状态变换器,AFiniteStateTransducer(FST)e.g."YES<firstnameFiniteState

HON:= FirstName:= LastName:=CAPmatchesanycapitalizedHON-LIST:=称呼(Mr,Ms, , If<right-context><?"not"("attend"|Thenentity.role=If<left-context><("meet"|"meeting")("in"|Thenentity.role="JohnSnellreportingforWallStreet.TodayFlexiconInc.announcedatenderofferfor .for$30pershare,representinga30%premiumoverFriday’sclosingprice.FlexiconexpectstoacquireSupplyhousebyQ42001withoutproblemsfromfederalregulators"[acquirer< [acquiree<l-acquiree-FSM>< [share-price<money-FSM><r-stock-FSM>][date<l-event-date-FSM><date-FSM>]][acquirer"FlexiconInc."][acquiree"Supplyhouse [share-price"30USD"][date"Q4]IfwethinkofthingsfromthedatabasepointofWewanttobeabletodatabase-styleButwehavedatainsomehorridtextualmanagementsystemthatdoesn’tallowsuchWeneedto“wrap”thedatainacomponentunderstandsdatabase-styleHencethetermManypeoplehave“wrapped”manywebCommonlysomethinglikeaPerlOfteneasytodoasaone-ButhandcodingwrappersinPerlisn’tverySitesarenumerous,andtheirsurfacestructurerapidly(around10%failureseachAmazonBook<b> aref=" <img "width=90height=140<fontface=verdana,arial,helveticasize=-<span<span<b>ListPrice:</b><span<b>OurPrice:<font<b>YouSave:</b><fontcolor=#990000><b>$2.99</b><p> ExtractedBook Title:TheAgeofSpiritualMachinesWhenComputersExceedHumanIn Author:RayKurzweilList-Price:$14.95Price:$11.96:: teSlotsintem tetypicallyfilledbyasubstringfromthe Someslotsmayhaveafixedsetofpre-specifiedpossiblefillersthatmaynotoccurinthetextitself.Terroristact:threatened,attempted,Jobtype:clerical,service,custodial,type:SECSomeslotsmayallowmultipleProgramming smayallowmultiple tes Wrappertool-Wrappertoolkits:Specializedprogrammingenvironmentsforwriting&debuggingwrappersbyhandAgingWorldWideWebWrapperJavaExtraction&DisseminationInformationJungleeTask:WrapperLearningwrappersiswrapperSometimes,therelationsareWebpagesgeneratedbyaWrapperinductionisusuallyregularrelationswhichcanbeexpressedbythestructureofthe theiteminboldinthe3rdcolumnofthetableistheWrapperinductiontechniquescanalsoIfthereisapageaboutaresearchprojectXandthereisalinkneartheword‘people’toapagethatisaboutaYthenYisamemberoftheproject[e.g,Tom SimpleExtractionSpecifyanitemtoextractforaslotusingaregularexpressionpattern.Pricepattern:Mayrequirepreceding(pre-filler)patterntoidentifypropercontext.AmazonlistPre-fillerpattern:“<b>ListPrice:</b><spanMayrequiresucceeding(post-filler)patterntoidentifytheendofthefiller.AmazonlistPre-fillerpattern:“<b>ListPrice:</b><span teExtractslotsinorder,startingthesearchforthefillerofthen+1slotwherethefillerforthenthslotended.Assumesslotsalwaysinafixedorder.List…Makepatternsspecificenoughtoidentifyeachfilleralwaysstartingfromthebeginningofthe.WrapperHighlyregular Relativelysimpleextractionpatternslearningalgorithm

Writingaccuratepatternsforeachslotforeach(e.g.eachwebsite)requireslaborioussoftwareAlternativeistousemachineBuildatrainingsetspairedwithhuman-producedfilledextraction Learnextractionpatternsforeachslotusinganappropriatemachinelearningalgorithm.WrapperDelimiter-basedUse<B>,</B>,<I>,</I>forLearningLRlabeled <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<B>Egypt</B>

l1,r1,…,lK,Example:Find4<B>,</B>,<I>,l1 r1,l2,r2 LR:Finding<B>Congo</B><B>Egypt</B><B>Beize</B><B>Spain</B>LR:Findingl1,l2and<B>Congo</B><B>Egypt</B><B>Belize</B><B>Spain</B>AproblemwithLRDistractingtextinheadand<B>Congo</B><B>Egypt</B><B>Belize</B>One(ofmany)solutions:Ignorepage’sheadand<B>Congo</B><B>Egypt</B><B>Belize</B>

body}LRandHLRTwrappersareextremelyThoughapplicabletomanytabularRecentwrapperinductionresearchhasexploredmoreexpressivewrapperclasses:DisjunctiveMultipleattributeMissingMultiple-valuedHierarchicallynestedWrapperverificationandBoostedwrapperWrapperinductionisonlyidealforrigidly-structuredmachine-generatedHTML……orisCanweusesimplepatternstoextractfromnaturallanguage ……Name:Dr.JeffreyD.Hermes…Who:ProfessorManfredPaul...willbegivenbyDr.R.J.Pangborn…Ms.Scottwillbespeaking…KarenShriver,Dept.of…MariaKlawe,UniversityofNaturalLanguageProcessing-basedInformationExtractionNaturalLanguageProcessing-basedInformationExtractionIfextractingfromautomaticallygeneratedwebpages,simpleregexpatternsusuallywork.Ifextractingfrommorenatural,unstructured,human-writtentext,someNLPmayhelp.Part-of-speech(POStagging(词性Markeachwordasanoun,verb,preposition,Syntacticparsing(句法分析Identifyphrases:NP,VP,Semanticwordcategories(e.g.fromExtractionpatternscanusePOSorphraseCrimePrefiller:[POS:V,Hypernym: MUC:theNLPgenesisofDARPAfundedsignificanteffortsinIEintheearlytomid1990’s.MessageUnderstandingConference(MUC)wasanannual petitionwhereresultswerepresented.FocusedonextractinginformationfromnewsTerroristIndustrialjointmanagementInformationextractionisofparticulartothe ligence ExampleofIEfromFASTUSBridgestoneSportsCo.saidFridayithadsetupajointventure withalocalconcernandExampleofIEfromFASTUSThejointventure,BridgestoneSports Co.,capitalizedat20millionnew dollars,willstartproductioninJanuary1990withproductionof20,000ironand“metalwood”clubsamonth. :“iron“ironand‘metalwood’StartRelationship:TIE-“alocalconcern”“aJapanesetradingJoint “BridgestoneSports BridgestoneSportsCo.saidFridayithadsetupajoint withalocalconcernandaJapanesetradinghousetoproducegolfclubstobe dtoJapan.Thejointventure,BridgestoneSports Co.,capitalizedat20millionnew dollars,willstartproductioninJanuary1990withproductionof20,000ironand“metalwood”clubsamonth. :“iron“ironand‘metalwood’StartRelationship:TIE-“alocalconcern”“aJapanesetradingJoint set

Basedonfinitestateautomata(FSA)plex

aJapanesetradinghadsetproductionof20,000ironandmetalwood [set

2.BasicSimplenoungroups,verbgroupsandplexComplexnoungroupsandverbPatternsforeventsofinteresttotheBasic tesaretobeMerging tesfromdifferentpartsofthetextsmergediftheyprovideinformationaboutsameentityor Rule-basedExtractionDeterminingwhich holdswhatofficeinwhat ],[office]ofVukDraskovic,leaderoftheSerbianRenewal[org](named,appointed,etc.) ]PNATOappointedWesleyClarkasCommanderinDeterminingwhereanorganizationis[org]inNATOheadquartersin[org][loc](division,branch,headquarters,KFORKosovoLearningforWritingaccuratepatternsforeachslotforeach(e.g.eachwebsite)requireslaborioussoftwareAlternativeistousemachineBuildatrainingsetof spairedwithhuman-producedfilledextractiontem Learnextractionpatternsforeachslotusinganappropriatemachinelearningalgorithm.StatisticalgenerativeSequenceModelsarestatisticalmodelsofwholetokensequencesthateffectivelylabelsubsequencesBestknowncaseisgenerativeHiddenMarkovModels(HMMs)Well-understoodunderlyingstatisticalmodelsmakeiteasytousedwiderangeoftoolsfromstatisticaldecisionPortable,broadcoverage,robust,goodRangeoffeaturesandpatternsusablemaybeNotnecessarilyasgoodforcomplexmulti-slotNameExtractionvia

ApplyingHMMsto generatedbyastochasticprocessmodelledbyanHMMTokenState nation”foragiven‘Background’stateemitstokenslike‘the’,‘said’,‘Money’stateemitstokenslike‘million’,‘euro’,‘Organization’stateemitstokenslike‘university’, ’,…Extraction:viatheViterbialgorithm,adynamicprogrammingtechniqueforefficientlycomputingthemostlikelysequenceofthatgenerated HMMHMM=probabilisticFSAHMM=statess1,s2,…(specialstartstates1specialendstatesn)tokenalphabeta1,a2,…statetransitionprobsP(si|sj)tokenemissionprobsP(ai|sj)Widelyusedinmanylanguageprocessingtasks,e.g.,speechrecognition[Lee,1989],POStagging[Kupiec,1992],topicdetection[Yamronetal,HMMforresearchpapers:transitions[Seymoreetal.,99]HMMforresearchpapers:emissions[Seymoreetal.,99]ICMLtoappear

universityofcaliforniadartmouthcollege Trainedon2millionwordsofBibTeXdatafromthe

LearningGoodnews:Iftrainingdatatokensaretaggedwiththeirgeneratingstates,thensimplefrequencyratiosareaum-likelihoodestimateoftransition/emission Easy.(Usesmoothingtoavoidzeroprob

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论