已阅读5页,还剩44页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
-,1,湖南大学计算机与通信学院刘钰峰,互联网信息搜索六tfidfandvectorspaces,-,2,回顾,1、中文分词2、词典压缩3、postinglist压缩4、tfidf,-,3,Scoringdocuments,Howdoweconstructanindex?Whatstrategiescanweusewithlimitedmainmemory?,-,4,Scoring,WewishtoreturninorderthedocumentsmostlikelytobeusefultothesearcherHowcanwerankorderthedocsinthecorpuswithrespecttoaquery?Assignascoresayin0,1foreachdoconeachqueryBeginwithaperfectworldnospammersNobodystuffingkeywordsintoadoctomakeitmatchqueriesMoreon“adversarialIR”underwebsearch,-,5,Linearzonecombinations,Firstgenerationofscoringmethods:usealinearcombinationofBooleans:E.g.,Score=0.6*+0.3*+0.05*+0.05*Eachexpressionsuchastakesonavaluein0,1.Thentheoverallscoreisin0,1.,Forthisexamplethescorescanonlytakeonafinitesetofvalueswhatarethey?,-,6,Exercise,OnthequerybillORrightssupposethatweretrievethefollowingdocsfromthevariouszoneindexes:,bill,rights,bill,rights,bill,rights,Author,Title,Body,1,5,2,8,3,3,5,9,2,5,1,5,8,3,9,9,Computethescoreforeachdocbasedontheweightings0.6,0.3,0.1,-,7,Generalidea,Wearegivenaweightvectorwhosecomponentssumupto1.Thereisaweightforeachzone/field.GivenaBooleanquery,weassignascoretoeachdocbyaddinguptheweightedcontributionsofthezones/fields.TypicallyuserswanttoseetheKhighest-scoringdocs.,-,8,Indexsupportforzonecombinations,InthesimplestversionwehaveaseparateinvertedindexforeachzoneVariant:haveasingleindexwithaseparatedictionaryentryforeachtermandzoneE.g.,bill.author,bill.title,bill.body,1,2,5,8,3,2,5,1,9,Ofcourse,compresszonenameslikeauthor/title/body.,-,9,Zonecombinationsindex,Theaboveschemeisstillwasteful:eachtermispotentiallyreplicatedforeachzoneInaslightlybetterscheme,weencodethezoneinthepostings:Atquerytime,accumulatecontributionstothetotalscoreofadocumentfromthevariouspostings,e.g.,bill,1.author,1.body,2.author,2.body,3.title,Asbefore,thezonenamesgetcompressed.,-,10,bill,1.author,1.body,2.author,2.body,3.title,rights,3.title,3.body,5.title,5.body,Scoreaccumulation,AswewalkthepostingsforthequerybillORrights,weaccumulatescoresforeachdocinalinearmergeasbefore.Note:wegetbothbillandrightsintheTitlefieldofdoc3,butscoreitnohigher.Shouldwegivemoreweighttomorehits?,1235,-,11,-,12,Term-documentcountmatrices,Considerthenumberofoccurrencesofaterminadocument:BagofwordsmodelDocumentisavector:acolumnbelow,-,13,Bagofwordsviewofadoc,ThusthedocJohnisquickerthanMary.isindistinguishablefromthedocMaryisquickerthanJohn.,Whichoftheindexesdiscussedsofardistinguishthesetwodocs?,-,14,Countsvs.frequencies,WARNING:InalotofIRliterature,“frequency”isusedtomean“count”ThustermfrequencyinIRliteratureisusedtomeannumberofoccurrencesinadocNotdividedbydocumentlength(whichwouldactuallymakeitafrequency)WewillconformtothismisnomerInsayingtermfrequencywemeanthenumberofoccurrencesofaterminadocument.,-,15,Termfrequencytf,LongdocsarefavoredbecausetheyremorelikelytocontainquerytermsCanfixthistosomeextentbynormalizingfordocumentlengthButisrawtftherightmeasure?,-,16,Documentfrequency,Butdocumentfrequency(df)maybebetter:df=numberofdocsinthecorpuscontainingthetermWordcfdfferrari1042217insurance104403997Document/collectionfrequencyweightingisonlypossibleinknown(static)collection.Sohowdowemakeuseofdf?,-,17,tfxidftermweights,tfxidfmeasurecombines:termfrequency(tf)orwf,somemeasureoftermdensityinadocinversedocumentfrequency(idf)measureofinformativenessofaterm:itsrarityacrossthewholecorpuscouldjustberawcountofnumberofdocumentsthetermoccursin(idfi=1/dfi)butbyfarthemostcommonlyusedversionis:SeeKishorePapineni,NAACL2,2002fortheoreticaljustification,-,18,Summary:tfxidf(ortf.idf),Assignatf.idfweighttoeachtermiineachdocumentdIncreaseswiththenumberofoccurrenceswithinadocIncreaseswiththerarityofthetermacrossthewholecorpus,-,19,再论TF,-,20,Real-valuedterm-documentmatrices,Function(scaling)ofcountofawordinadocument:BagofwordsmodelEachisavectorinvHerelog-scaledtf.idf,Notecanbe1!,-,21,Documentsasvectors,Eachdocjcannowbeviewedasavectorofwfidfvalues,onecomponentforeachtermSowehaveavectorspacetermsareaxesdocsliveinthisspaceevenwithstemming,mayhave20,000+dimensions(Thecorpusofdocumentsgivesusamatrix,whichwecouldalsoviewasavectorspaceinwhichwordslivetransposabledata),-,22,Whyturndocsintovectors?,Firstapplication:Query-by-exampleGivenadocd,findothers“like”it.Nowthatdisavector,findvectors(docs)“near”it.,-,23,Intuition,Postulate:Documentsthatare“closetogether”inthevectorspacetalkaboutthesamethings.,t1,d2,d1,d3,d4,d5,t3,t2,-,24,Cosinesimilarity,Distancebetweenvectorsd1andd2capturedbythecosineoftheanglexbetweenthem.Notethisissimilarity,notdistanceNotriangleinequalityforsimilarity.,-,25,Cosinesimilarity,Avectorcanbenormalized(givenalengthof1)bydividingeachofitscomponentsbyitslengthhereweusetheL2normThismapsvectorsontotheunitsphere:Then,Longerdocumentsdontgetmoreweight,-,26,Cosinesimilarity,CosineofanglebetweentwovectorsThedenominatorinvolvesthelengthsofthevectors.,Normalization,-,27,Normalizedvectors,Fornormalizedvectors,thecosineissimplythedotproduct:,-,28,Example,Docs:AustensSenseandSensibility,PrideandPrejudice;BrontesWutheringHeights.tfweightscos(SAS,PAP)=.996x.993+.087x.120+.017x0.0=0.999cos(SAS,WH)=.996x.847+.087x.466+.017x.254=0.889,-,29,Queriesinthevectorspacemodel,Centralidea:thequeryasavector:WeregardthequeryasshortdocumentWereturnthedocumentsrankedbytheclosenessoftheirvectorstothequery,alsorepresentedasavector.Notethatdqisverysparse!,-,30,Summary:Whatsthepointofusingvectorspaces?,Awell-formedalgebraicspaceforretrievalKey:Ausersquerycanbeviewedasa(very)shortdocument.Querybecomesavectorinthesamespaceasthedocs.Canmeasureeachdocsproximitytoit.Naturalmeasureofscores/rankingnolongerBoolean.QueriesareexpressedasbagsofwordsOthersimilaritymeasures:see/strehl/diss/node52.htmlforasurvey,-,31,Digression:spammingindices,Thiswasallinventedbeforethedayswhenpeoplewereinthebusinessofspammingwebsearchengines.Consider:Indexingasensiblepassivedocumentcollectionvs.Anactivedocumentcollection,wherepeople(andindeed,servicecompanies)areshapingdocumentsinordertomaximizescoresVectorspacesimilaritymaynotbeasusefulinthiscontext.,-,32,Interaction:vectorsandphrases,Scoringphrasesdoesntfitnaturallyintothevectorspaceworld:“tangerinetrees”“marmaladeskies”Positionalindexesdontcalculateorstoretf.idfinformationfor“tangerinetrees”BiwordindexestreatcertainphrasesastermsForthese,wecanpre-computetf.idf.TheoreticalproblemofcorrelateddimensionsProblem:wecannotexpectend-userformulatingqueriestoknowwhatphrasesareindexedWecanuseapositionalindextoboostorensurephraseoccurrence,-,33,VectorsandBooleanqueries,VectorsandBooleanqueriesreallydontworktogetherverywellInthespaceofterms,vectorproximityselectsbyspheres:e.g.,alldocshavingcosinesimilarity0.5tothequeryBooleanqueriesontheotherhand,selectby(hyper-)rectanglesandtheirunions/intersectionsRoundpeg-squarehole,-,34,Vectorspacesandotheroperators,Vectorspacequeriesareaptforno-syntax,bag-of-wordsqueriesCleanmetaphorforsimilar-documentqueriesNotagoodcombinationwithBoolean,wild-card,positionalqueryoperatorsBut,-,35,Querylanguagevs.scoring,Mayallowuseracertainquerylanguage,sayFreetextbasicqueriesPhrase,wildcardetc.inAdvancedQueries.Forscoring(oblivioustouser)mayusealloftheabove,e.g.forafreetextqueryHighest-rankedhitshavequeryasaphraseNext,docsthathaveallquerytermsneareachotherThen,docsthathavesomequeryterms,orallofthemspreadout,withtfxidfweightsforscoring,-,36,Efficientcosineranking,Findthekdocsinthecorpus“nearest”tothequeryklargestquery-doccosines.Efficientranking:Computingasinglecosineefficiently.Choosingtheklargestcosinevaluesefficiently.Canwedothiswithoutcomputingallncosines?n=numberofdocumentsincollection,-,37,Efficientcosineranking,Whatweredoingineffect:solvingthek-nearestneighborproblemforaqueryvectorIngeneral,wedonotknowhowtodothisefficientlyforhigh-dimensionalspacesButitissolvableforshortqueries,andstandardindexesareoptimizedtodothis,-,38,Useheapforselectingtopk,BinarytreeinwhicheachnodesvaluethevaluesofchildrenTakes2noperationstoconstruct,theneachofklogn“winners”readoffin2lognsteps.Forn=1M,k=100,thisisabout10%ofthecostofsorting.,1,.9,.3,.8,.3,.1,.1,-,39,Computingasinglecosine,Foreverytermi,witheachdocj,storetermfrequencytfij.Sometradeoffsonwhethertostoretermcount,termweight,orweightedbyidfi.Atquerytime,useanarrayofaccumulatorsAjtoaccumulatecomponent-wisesumIfyoureindexing5billiondocuments(websearch)anarrayofaccumulatorsisinfeasible,Ideas?,-,40,Computingtheklargestcosines:selectionvs.sorting,Typicallywewanttoretrievethetopkdocs(inthecosinerankingforthequery)nottototallyorderalldocsinthecorpusCanwepickoffdocswithkhighestcosines?,-,41,Canweavoidallthiscomputation?,Yes,butmayoccasionallygetananswerwrongadocnotinthetopkmaycreepintotheanswer.,-,42,Limitingtheaccumulators:Bestmcandidates,Preprocess:Pre-compute,foreachterm,itsmnearestdocs.(Treateachtermasa1-termquery.)lotsofpreprocessing.Result:“preferredlist”foreachterm.Search:Forat-termquery,taketheunionoftheirtpreferredlistscallthissetS,where|S|mt.Computecosinesfromt
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 设备健康管理平台-第2篇-洞察与解读
- 2025建筑外部装修设计合同范本
- 2025年建筑设备租赁协议合同范本
- 2025年精通商务英语口语轻松签订合同
- 色彩静物色块训练
- 2025版胃食道反流病症状分析及药物治疗护理详解
- 摄影数据价值挖掘-洞察与解读
- 2025临时产权车辆买卖合同书范文
- 大班健康活动我是小小营养师
- 关于西餐厅的介绍
- 高三数学知识点归纳笔记
- EPC总承包项目通用技术标模板
- 【中阮曲目艺术赏析】
- 轮机概论-大连海事大学
- 初中生必须掌握的3500字带拼音
- 大学生心理健康教育常见困扰与自我调适知到章节答案智慧树2023年浙江师范大学
- 广西民族大学624生物化学2007-2010,2012-2015,2017-2018,20-22年考研初试真题
- 室内燃气管道安装与验收标准
- 行政区域代码表Excel
- 题型06 函数的性质之周期性及蛙跳函数(解析版)
- 土壤质地分类
评论
0/150
提交评论