信息检索六tfidf汇总PPT课件

上传人：优*** IP属地：广东上传时间：2020-05-23 格式：PPT 页数：49 大小：442.50KB 积分：35 举报 版权申诉

已阅读5页，还剩44页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

-,1,湖南大学计算机与通信学院刘钰峰,互联网信息搜索六tfidfandvectorspaces,-,2,回顾,1、中文分词2、词典压缩3、postinglist压缩4、tfidf,-,3,Scoringdocuments,Howdoweconstructanindex?Whatstrategiescanweusewithlimitedmainmemory?,-,4,Scoring,WewishtoreturninorderthedocumentsmostlikelytobeusefultothesearcherHowcanwerankorderthedocsinthecorpuswithrespecttoaquery?Assignascoresayin0,1foreachdoconeachqueryBeginwithaperfectworldnospammersNobodystuffingkeywordsintoadoctomakeitmatchqueriesMoreon“adversarialIR”underwebsearch,-,5,Linearzonecombinations,Firstgenerationofscoringmethods:usealinearcombinationofBooleans:E.g.,Score=0.6*+0.3*+0.05*+0.05*Eachexpressionsuchastakesonavaluein0,1.Thentheoverallscoreisin0,1.,Forthisexamplethescorescanonlytakeonafinitesetofvalueswhatarethey?,-,6,Exercise,OnthequerybillORrightssupposethatweretrievethefollowingdocsfromthevariouszoneindexes:,bill,rights,bill,rights,bill,rights,Author,Title,Body,1,5,2,8,3,3,5,9,2,5,1,5,8,3,9,9,Computethescoreforeachdocbasedontheweightings0.6,0.3,0.1,-,7,Generalidea,Wearegivenaweightvectorwhosecomponentssumupto1.Thereisaweightforeachzone/field.GivenaBooleanquery,weassignascoretoeachdocbyaddinguptheweightedcontributionsofthezones/fields.TypicallyuserswanttoseetheKhighest-scoringdocs.,-,8,Indexsupportforzonecombinations,InthesimplestversionwehaveaseparateinvertedindexforeachzoneVariant:haveasingleindexwithaseparatedictionaryentryforeachtermandzoneE.g.,bill.author,bill.title,bill.body,1,2,5,8,3,2,5,1,9,Ofcourse,compresszonenameslikeauthor/title/body.,-,9,Zonecombinationsindex,Theaboveschemeisstillwasteful:eachtermispotentiallyreplicatedforeachzoneInaslightlybetterscheme,weencodethezoneinthepostings:Atquerytime,accumulatecontributionstothetotalscoreofadocumentfromthevariouspostings,e.g.,bill,1.author,1.body,2.author,2.body,3.title,Asbefore,thezonenamesgetcompressed.,-,10,bill,1.author,1.body,2.author,2.body,3.title,rights,3.title,3.body,5.title,5.body,Scoreaccumulation,AswewalkthepostingsforthequerybillORrights,weaccumulatescoresforeachdocinalinearmergeasbefore.Note:wegetbothbillandrightsintheTitlefieldofdoc3,butscoreitnohigher.Shouldwegivemoreweighttomorehits?,1235,-,11,-,12,Term-documentcountmatrices,Considerthenumberofoccurrencesofaterminadocument:BagofwordsmodelDocumentisavector:acolumnbelow,-,13,Bagofwordsviewofadoc,ThusthedocJohnisquickerthanMary.isindistinguishablefromthedocMaryisquickerthanJohn.,Whichoftheindexesdiscussedsofardistinguishthesetwodocs?,-,14,Countsvs.frequencies,WARNING:InalotofIRliterature,“frequency”isusedtomean“count”ThustermfrequencyinIRliteratureisusedtomeannumberofoccurrencesinadocNotdividedbydocumentlength(whichwouldactuallymakeitafrequency)WewillconformtothismisnomerInsayingtermfrequencywemeanthenumberofoccurrencesofaterminadocument.,-,15,Termfrequencytf,LongdocsarefavoredbecausetheyremorelikelytocontainquerytermsCanfixthistosomeextentbynormalizingfordocumentlengthButisrawtftherightmeasure?,-,16,Documentfrequency,Butdocumentfrequency(df)maybebetter:df=numberofdocsinthecorpuscontainingthetermWordcfdfferrari1042217insurance104403997Document/collectionfrequencyweightingisonlypossibleinknown(static)collection.Sohowdowemakeuseofdf?,-,17,tfxidftermweights,tfxidfmeasurecombines:termfrequency(tf)orwf,somemeasureoftermdensityinadocinversedocumentfrequency(idf)measureofinformativenessofaterm:itsrarityacrossthewholecorpuscouldjustberawcountofnumberofdocumentsthetermoccursin(idfi=1/dfi)butbyfarthemostcommonlyusedversionis:SeeKishorePapineni,NAACL2,2002fortheoreticaljustification,-,18,Summary:tfxidf(ortf.idf),Assignatf.idfweighttoeachtermiineachdocumentdIncreaseswiththenumberofoccurrenceswithinadocIncreaseswiththerarityofthetermacrossthewholecorpus,-,19,再论TF,-,20,Real-valuedterm-documentmatrices,Function(scaling)ofcountofawordinadocument:BagofwordsmodelEachisavectorinvHerelog-scaledtf.idf,Notecanbe1!,-,21,Documentsasvectors,Eachdocjcannowbeviewedasavectorofwfidfvalues,onecomponentforeachtermSowehaveavectorspacetermsareaxesdocsliveinthisspaceevenwithstemming,mayhave20,000+dimensions(Thecorpusofdocumentsgivesusamatrix,whichwecouldalsoviewasavectorspaceinwhichwordslivetransposabledata),-,22,Whyturndocsintovectors?,Firstapplication:Query-by-exampleGivenadocd,findothers“like”it.Nowthatdisavector,findvectors(docs)“near”it.,-,23,Intuition,Postulate:Documentsthatare“closetogether”inthevectorspacetalkaboutthesamethings.,t1,d2,d1,d3,d4,d5,t3,t2,-,24,Cosinesimilarity,Distancebetweenvectorsd1andd2capturedbythecosineoftheanglexbetweenthem.Notethisissimilarity,notdistanceNotriangleinequalityforsimilarity.,-,25,Cosinesimilarity,Avectorcanbenormalized(givenalengthof1)bydividingeachofitscomponentsbyitslengthhereweusetheL2normThismapsvectorsontotheunitsphere:Then,Longerdocumentsdontgetmoreweight,-,26,Cosinesimilarity,CosineofanglebetweentwovectorsThedenominatorinvolvesthelengthsofthevectors.,Normalization,-,27,Normalizedvectors,Fornormalizedvectors,thecosineissimplythedotproduct:,-,28,Example,Docs:AustensSenseandSensibility,PrideandPrejudice;BrontesWutheringHeights.tfweightscos(SAS,PAP)=.996x.993+.087x.120+.017x0.0=0.999cos(SAS,WH)=.996x.847+.087x.466+.017x.254=0.889,-,29,Queriesinthevectorspacemodel,Centralidea:thequeryasavector:WeregardthequeryasshortdocumentWereturnthedocumentsrankedbytheclosenessoftheirvectorstothequery,alsorepresentedasavector.Notethatdqisverysparse!,-,30,Summary:Whatsthepointofusingvectorspaces?,Awell-formedalgebraicspaceforretrievalKey:Ausersquerycanbeviewedasa(very)shortdocument.Querybecomesavectorinthesamespaceasthedocs.Canmeasureeachdocsproximitytoit.Naturalmeasureofscores/rankingnolongerBoolean.QueriesareexpressedasbagsofwordsOthersimilaritymeasures:see/strehl/diss/node52.htmlforasurvey,-,31,Digression:spammingindices,Thiswasallinventedbeforethedayswhenpeoplewereinthebusinessofspammingwebsearchengines.Consider:Indexingasensiblepassivedocumentcollectionvs.Anactivedocumentcollection,wherepeople(andindeed,servicecompanies)areshapingdocumentsinordertomaximizescoresVectorspacesimilaritymaynotbeasusefulinthiscontext.,-,32,Interaction:vectorsandphrases,Scoringphrasesdoesntfitnaturallyintothevectorspaceworld:“tangerinetrees”“marmaladeskies”Positionalindexesdontcalculateorstoretf.idfinformationfor“tangerinetrees”BiwordindexestreatcertainphrasesastermsForthese,wecanpre-computetf.idf.TheoreticalproblemofcorrelateddimensionsProblem:wecannotexpectend-userformulatingqueriestoknowwhatphrasesareindexedWecanuseapositionalindextoboostorensurephraseoccurrence,-,33,VectorsandBooleanqueries,VectorsandBooleanqueriesreallydontworktogetherverywellInthespaceofterms,vectorproximityselectsbyspheres:e.g.,alldocshavingcosinesimilarity0.5tothequeryBooleanqueriesontheotherhand,selectby(hyper-)rectanglesandtheirunions/intersectionsRoundpeg-squarehole,-,34,Vectorspacesandotheroperators,Vectorspacequeriesareaptforno-syntax,bag-of-wordsqueriesCleanmetaphorforsimilar-documentqueriesNotagoodcombinationwithBoolean,wild-card,positionalqueryoperatorsBut,-,35,Querylanguagevs.scoring,Mayallowuseracertainquerylanguage,sayFreetextbasicqueriesPhrase,wildcardetc.inAdvancedQueries.Forscoring(oblivioustouser)mayusealloftheabove,e.g.forafreetextqueryHighest-rankedhitshavequeryasaphraseNext,docsthathaveallquerytermsneareachotherThen,docsthathavesomequeryterms,orallofthemspreadout,withtfxidfweightsforscoring,-,36,Efficientcosineranking,Findthekdocsinthecorpus“nearest”tothequeryklargestquery-doccosines.Efficientranking:Computingasinglecosineefficiently.Choosingtheklargestcosinevaluesefficiently.Canwedothiswithoutcomputingallncosines?n=numberofdocumentsincollection,-,37,Efficientcosineranking,Whatweredoingineffect:solvingthek-nearestneighborproblemforaqueryvectorIngeneral,wedonotknowhowtodothisefficientlyforhigh-dimensionalspacesButitissolvableforshortqueries,andstandardindexesareoptimizedtodothis,-,38,Useheapforselectingtopk,BinarytreeinwhicheachnodesvaluethevaluesofchildrenTakes2noperationstoconstruct,theneachofklogn“winners”readoffin2lognsteps.Forn=1M,k=100,thisisabout10%ofthecostofsorting.,1,.9,.3,.8,.3,.1,.1,-,39,Computingasinglecosine,Foreverytermi,witheachdocj,storetermfrequencytfij.Sometradeoffsonwhethertostoretermcount,termweight,orweightedbyidfi.Atquerytime,useanarrayofaccumulatorsAjtoaccumulatecomponent-wisesumIfyoureindexing5billiondocuments(websearch)anarrayofaccumulatorsisinfeasible,Ideas?,-,40,Computingtheklargestcosines:selectionvs.sorting,Typicallywewanttoretrievethetopkdocs(inthecosinerankingforthequery)nottototallyorderalldocsinthecorpusCanwepickoffdocswithkhighestcosines?,-,41,Canweavoidallthiscomputation?,Yes,butmayoccasionallygetananswerwrongadocnotinthetopkmaycreepintotheanswer.,-,42,Limitingtheaccumulators:Bestmcandidates,Preprocess:Pre-compute,foreachterm,itsmnearestdocs.(Treateachtermasa1-termquery.)lotsofpreprocessing.Result:“preferredlist”foreachterm.Search:Forat-termquery,taketheunionoftheirtpreferredlistscallthissetS,where|S|mt.Computecosinesfromt

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

信息检索六tfidf汇总PPT课件

文档简介

温馨提示

最新文档

评论

信息检索六tfidf汇总PPT课件

文档简介

温馨提示

最新文档

评论

相关文档