信息检索六tfidf汇总PPT课件_第1页
信息检索六tfidf汇总PPT课件_第2页
信息检索六tfidf汇总PPT课件_第3页
信息检索六tfidf汇总PPT课件_第4页
信息检索六tfidf汇总PPT课件_第5页
已阅读5页,还剩44页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

-,1,湖南大学计算机与通信学院刘钰峰,互联网信息搜索六tfidfandvectorspaces,-,2,回顾,1、中文分词2、词典压缩3、postinglist压缩4、tfidf,-,3,Scoringdocuments,Howdoweconstructanindex?Whatstrategiescanweusewithlimitedmainmemory?,-,4,Scoring,WewishtoreturninorderthedocumentsmostlikelytobeusefultothesearcherHowcanwerankorderthedocsinthecorpuswithrespecttoaquery?Assignascoresayin0,1foreachdoconeachqueryBeginwithaperfectworldnospammersNobodystuffingkeywordsintoadoctomakeitmatchqueriesMoreon“adversarialIR”underwebsearch,-,5,Linearzonecombinations,Firstgenerationofscoringmethods:usealinearcombinationofBooleans:E.g.,Score=0.6*+0.3*+0.05*+0.05*Eachexpressionsuchastakesonavaluein0,1.Thentheoverallscoreisin0,1.,Forthisexamplethescorescanonlytakeonafinitesetofvalueswhatarethey?,-,6,Exercise,OnthequerybillORrightssupposethatweretrievethefollowingdocsfromthevariouszoneindexes:,bill,rights,bill,rights,bill,rights,Author,Title,Body,1,5,2,8,3,3,5,9,2,5,1,5,8,3,9,9,Computethescoreforeachdocbasedontheweightings0.6,0.3,0.1,-,7,Generalidea,Wearegivenaweightvectorwhosecomponentssumupto1.Thereisaweightforeachzone/field.GivenaBooleanquery,weassignascoretoeachdocbyaddinguptheweightedcontributionsofthezones/fields.TypicallyuserswanttoseetheKhighest-scoringdocs.,-,8,Indexsupportforzonecombinations,InthesimplestversionwehaveaseparateinvertedindexforeachzoneVariant:haveasingleindexwithaseparatedictionaryentryforeachtermandzoneE.g.,bill.author,bill.title,bill.body,1,2,5,8,3,2,5,1,9,Ofcourse,compresszonenameslikeauthor/title/body.,-,9,Zonecombinationsindex,Theaboveschemeisstillwasteful:eachtermispotentiallyreplicatedforeachzoneInaslightlybetterscheme,weencodethezoneinthepostings:Atquerytime,accumulatecontributionstothetotalscoreofadocumentfromthevariouspostings,e.g.,bill,1.author,1.body,2.author,2.body,3.title,Asbefore,thezonenamesgetcompressed.,-,10,bill,1.author,1.body,2.author,2.body,3.title,rights,3.title,3.body,5.title,5.body,Scoreaccumulation,AswewalkthepostingsforthequerybillORrights,weaccumulatescoresforeachdocinalinearmergeasbefore.Note:wegetbothbillandrightsintheTitlefieldofdoc3,butscoreitnohigher.Shouldwegivemoreweighttomorehits?,1235,-,11,-,12,Term-documentcountmatrices,Considerthenumberofoccurrencesofaterminadocument:BagofwordsmodelDocumentisavector:acolumnbelow,-,13,Bagofwordsviewofadoc,ThusthedocJohnisquickerthanMary.isindistinguishablefromthedocMaryisquickerthanJohn.,Whichoftheindexesdiscussedsofardistinguishthesetwodocs?,-,14,Countsvs.frequencies,WARNING:InalotofIRliterature,“frequency”isusedtomean“count”ThustermfrequencyinIRliteratureisusedtomeannumberofoccurrencesinadocNotdividedbydocumentlength(whichwouldactuallymakeitafrequency)WewillconformtothismisnomerInsayingtermfrequencywemeanthenumberofoccurrencesofaterminadocument.,-,15,Termfrequencytf,LongdocsarefavoredbecausetheyremorelikelytocontainquerytermsCanfixthistosomeextentbynormalizingfordocumentlengthButisrawtftherightmeasure?,-,16,Documentfrequency,Butdocumentfrequency(df)maybebetter:df=numberofdocsinthecorpuscontainingthetermWordcfdfferrari1042217insurance104403997Document/collectionfrequencyweightingisonlypossibleinknown(static)collection.Sohowdowemakeuseofdf?,-,17,tfxidftermweights,tfxidfmeasurecombines:termfrequency(tf)orwf,somemeasureoftermdensityinadocinversedocumentfrequency(idf)measureofinformativenessofaterm:itsrarityacrossthewholecorpuscouldjustberawcountofnumberofdocumentsthetermoccursin(idfi=1/dfi)butbyfarthemostcommonlyusedversionis:SeeKishorePapineni,NAACL2,2002fortheoreticaljustification,-,18,Summary:tfxidf(ortf.idf),Assignatf.idfweighttoeachtermiineachdocumentdIncreaseswiththenumberofoccurrenceswithinadocIncreaseswiththerarityofthetermacrossthewholecorpus,-,19,再论TF,-,20,Real-valuedterm-documentmatrices,Function(scaling)ofcountofawordinadocument:BagofwordsmodelEachisavectorinvHerelog-scaledtf.idf,Notecanbe1!,-,21,Documentsasvectors,Eachdocjcannowbeviewedasavectorofwfidfvalues,onecomponentforeachtermSowehaveavectorspacetermsareaxesdocsliveinthisspaceevenwithstemming,mayhave20,000+dimensions(Thecorpusofdocumentsgivesusamatrix,whichwecouldalsoviewasavectorspaceinwhichwordslivetransposabledata),-,22,Whyturndocsintovectors?,Firstapplication:Query-by-exampleGivenadocd,findothers“like”it.Nowthatdisavector,findvectors(docs)“near”it.,-,23,Intuition,Postulate:Documentsthatare“closetogether”inthevectorspacetalkaboutthesamethings.,t1,d2,d1,d3,d4,d5,t3,t2,-,24,Cosinesimilarity,Distancebetweenvectorsd1andd2capturedbythecosineoftheanglexbetweenthem.Notethisissimilarity,notdistanceNotriangleinequalityforsimilarity.,-,25,Cosinesimilarity,Avectorcanbenormalized(givenalengthof1)bydividingeachofitscomponentsbyitslengthhereweusetheL2normThismapsvectorsontotheunitsphere:Then,Longerdocumentsdontgetmoreweight,-,26,Cosinesimilarity,CosineofanglebetweentwovectorsThedenominatorinvolvesthelengthsofthevectors.,Normalization,-,27,Normalizedvectors,Fornormalizedvectors,thecosineissimplythedotproduct:,-,28,Example,Docs:AustensSenseandSensibility,PrideandPrejudice;BrontesWutheringHeights.tfweightscos(SAS,PAP)=.996x.993+.087x.120+.017x0.0=0.999cos(SAS,WH)=.996x.847+.087x.466+.017x.254=0.889,-,29,Queriesinthevectorspacemodel,Centralidea:thequeryasavector:WeregardthequeryasshortdocumentWereturnthedocumentsrankedbytheclosenessoftheirvectorstothequery,alsorepresentedasavector.Notethatdqisverysparse!,-,30,Summary:Whatsthepointofusingvectorspaces?,Awell-formedalgebraicspaceforretrievalKey:Ausersquerycanbeviewedasa(very)shortdocument.Querybecomesavectorinthesamespaceasthedocs.Canmeasureeachdocsproximitytoit.Naturalmeasureofscores/rankingnolongerBoolean.QueriesareexpressedasbagsofwordsOthersimilaritymeasures:see/strehl/diss/node52.htmlforasurvey,-,31,Digression:spammingindices,Thiswasallinventedbeforethedayswhenpeoplewereinthebusinessofspammingwebsearchengines.Consider:Indexingasensiblepassivedocumentcollectionvs.Anactivedocumentcollection,wherepeople(andindeed,servicecompanies)areshapingdocumentsinordertomaximizescoresVectorspacesimilaritymaynotbeasusefulinthiscontext.,-,32,Interaction:vectorsandphrases,Scoringphrasesdoesntfitnaturallyintothevectorspaceworld:“tangerinetrees”“marmaladeskies”Positionalindexesdontcalculateorstoretf.idfinformationfor“tangerinetrees”BiwordindexestreatcertainphrasesastermsForthese,wecanpre-computetf.idf.TheoreticalproblemofcorrelateddimensionsProblem:wecannotexpectend-userformulatingqueriestoknowwhatphrasesareindexedWecanuseapositionalindextoboostorensurephraseoccurrence,-,33,VectorsandBooleanqueries,VectorsandBooleanqueriesreallydontworktogetherverywellInthespaceofterms,vectorproximityselectsbyspheres:e.g.,alldocshavingcosinesimilarity0.5tothequeryBooleanqueriesontheotherhand,selectby(hyper-)rectanglesandtheirunions/intersectionsRoundpeg-squarehole,-,34,Vectorspacesandotheroperators,Vectorspacequeriesareaptforno-syntax,bag-of-wordsqueriesCleanmetaphorforsimilar-documentqueriesNotagoodcombinationwithBoolean,wild-card,positionalqueryoperatorsBut,-,35,Querylanguagevs.scoring,Mayallowuseracertainquerylanguage,sayFreetextbasicqueriesPhrase,wildcardetc.inAdvancedQueries.Forscoring(oblivioustouser)mayusealloftheabove,e.g.forafreetextqueryHighest-rankedhitshavequeryasaphraseNext,docsthathaveallquerytermsneareachotherThen,docsthathavesomequeryterms,orallofthemspreadout,withtfxidfweightsforscoring,-,36,Efficientcosineranking,Findthekdocsinthecorpus“nearest”tothequeryklargestquery-doccosines.Efficientranking:Computingasinglecosineefficiently.Choosingtheklargestcosinevaluesefficiently.Canwedothiswithoutcomputingallncosines?n=numberofdocumentsincollection,-,37,Efficientcosineranking,Whatweredoingineffect:solvingthek-nearestneighborproblemforaqueryvectorIngeneral,wedonotknowhowtodothisefficientlyforhigh-dimensionalspacesButitissolvableforshortqueries,andstandardindexesareoptimizedtodothis,-,38,Useheapforselectingtopk,BinarytreeinwhicheachnodesvaluethevaluesofchildrenTakes2noperationstoconstruct,theneachofklogn“winners”readoffin2lognsteps.Forn=1M,k=100,thisisabout10%ofthecostofsorting.,1,.9,.3,.8,.3,.1,.1,-,39,Computingasinglecosine,Foreverytermi,witheachdocj,storetermfrequencytfij.Sometradeoffsonwhethertostoretermcount,termweight,orweightedbyidfi.Atquerytime,useanarrayofaccumulatorsAjtoaccumulatecomponent-wisesumIfyoureindexing5billiondocuments(websearch)anarrayofaccumulatorsisinfeasible,Ideas?,-,40,Computingtheklargestcosines:selectionvs.sorting,Typicallywewanttoretrievethetopkdocs(inthecosinerankingforthequery)nottototallyorderalldocsinthecorpusCanwepickoffdocswithkhighestcosines?,-,41,Canweavoidallthiscomputation?,Yes,butmayoccasionallygetananswerwrongadocnotinthetopkmaycreepintotheanswer.,-,42,Limitingtheaccumulators:Bestmcandidates,Preprocess:Pre-compute,foreachterm,itsmnearestdocs.(Treateachtermasa1-termquery.)lotsofpreprocessing.Result:“preferredlist”foreachterm.Search:Forat-termquery,taketheunionoftheirtpreferredlistscallthissetS,where|S|mt.Computecosinesfromt

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论