




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、大规模数据处理/云计算 Lecture 6 Text Retrieval Algorithm彭波北京大学信息科学技术学院7/21/2011/course/cs402/This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee /licenses/by-nc-sa/3.0/us/ for detailsJimmy LinUniversity of Maryland课程建设SEWMGroupTodays AgendaIntroduction to
2、information retrievalBasics of indexing and retrievalInverted indexing in MapReduceRetrieval at scale2First, nomenclatureInformation retrieval (IR)Focus on textual information (= text/document retrieval)Other possibilities include image, video, music, What do we search?Generically, “collections”Less
3、-frequently used, “corpora”What do we find?Generically, “documents”Even though we may be referring to web pages, PDFs, PowerPoint slides, paragraphs, etc.3The Central Problem in SearchSearcherAuthorConceptsConceptsQuery TermsDocument TermsDo these represent the same concepts?“tragic love story”“fate
4、ful star-crossed romance”Abstract IR ArchitectureDocumentsQueryHitsRepresentationFunctionRepresentationFunctionQuery RepresentationDocument RepresentationComparisonFunctionIndexofflineonlinedocument acquisition(e.g., web crawling)5How do we represent text?Remember: computers dont “understand” anythi
5、ng!“Bag of words”Treat all the words in a document as index termsAssign a “weight” to each term based on “importance” (or, in simplest case, presence/absence of word)Disregard order, structure, meaning, etc. of the wordsSimple, yet effective!AssumptionsTerm occurrence is independentDocument relevanc
6、e is independent“Words” are well-defined6Whats a word?天主教教宗若望保祿二世因感冒再度住進醫院。這是他今年第二度因同樣的病因住院。 - - 1982. - , . 2005-06 日米連合台頭中国対処前副長官提言 = 25 .7Sample DocumentMcDonalds slims down spudsFast-food chain to reduce certain types of fat in its french fries with new cooking oil.NEW YORK (CNN/Money) - McDonal
7、ds Corp. is cutting the amount of bad fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.But does that mean the popular shoestring fries wont taste the same? The company says no. Its a win-win for our customers because they
8、 are getting the same great french-fry taste along with an even healthier nutrition profile, said Mike Roberts, president of McDonalds USA.But others are not so sure. McDonalds will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the form
9、ula could mean a different taste.Shares of Oak Brook, Ill.-based McDonalds (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendys International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit
10、. Neither company could immediately be reached for comment.14 McDonalds12 fat11 fries8 new7 french 6 company, said, nutrition5 food, oil, percent, reduce, taste, Tuesday“Bag of Words”Counting WordsDocumentsInvertedIndexBag of Wordscase folding, tokenization, stopword removal, stemmingsyntax, semanti
11、cs, word knowledge, etc.9Boolean RetrievalUsers express queries as a Boolean expressionAND, OR, NOTCan be arbitrarily nestedRetrieval is based on the notion of setsAny given query divides the collection into two sets: retrieved, not-retrievedPure Boolean systems do not define an ordering of the resu
12、lts10Inverted Index: Boolean Retrievalone fish, two fishDoc 1red fish, blue fishDoc 2cat in the hatDoc 31111111231114bluecateggfishgreenhamhatone34144321bluecateggfishgreenhamhatone2green eggs and hamDoc 41red1two2red1two11Boolean RetrievalTo execute a Boolean query:Build query syntax treeFor each c
13、lause, look up postingsTraverse postings and apply Boolean operatorEfficiency analysisPostings traversal is linear (assuming sorted postings)Start with shortest posting first ( blue AND fish ) OR hambluefishANDhamOR12bluefish212Strengths and WeaknessesStrengthsPrecise, if you know the right strategi
14、esPrecise, if you have an idea of what youre looking forImplementations are fast and efficientWeaknessesUsers must learn Boolean logicBoolean logic insufficient to capture the richness of languageNo control over size of result set: either too many hits or noneWhen do you stop reading? All documents
15、in the result set are considered “equally good”What about partial matches? Documents that “dont quite match” the query may be useful also13Ranked RetrievalOrder documents by how likely they are to be relevant to the information needEstimate relevance(q, di)Sort documents by relevanceDisplay sorted r
16、esultsUser modelPresent hits one screen at a time, best results firstAt any point, users can decide to stop lookingHow do we estimate relevance?14Vector Space ModelAssumption: Documents that are “close together” in vector space “talk about” the same thingst1d2d1d3d4d5t3t2Therefore, retrieve document
17、s based on how close the document is to the query (i.e., similarity “closeness”)15Similarity MetricUse “angle” between the vectors:Or, more generally, inner products:16Term WeightingTerm weights consist of two componentsLocal: how important is the term in this document?Global: how important is the t
18、erm in the collection? Heres the intuition:Terms that appear often in a document should get high weightsTerms that appear in many documents should get low weightsHow do we capture this mathematically?Term frequency (local)Inverse document frequency (global)17TF.IDF Term Weightingweight assigned to t
19、erm i in document jnumber of occurrence of term i in document jnumber of documents in entire collectionnumber of documents with term i1821121111111Inverted Index: TF.IDF212111123111411111121tfdfbluecateggfishgreenhamhatone11111121bluecateggfishgreenhamhatone11red11two1red1twoone fish, two fishDoc 1r
20、ed fish, blue fishDoc 2cat in the hatDoc 3green eggs and hamDoc 43414432122119Positional IndexesStore term position in postingsSupports richer queries (e.g., proximity)Naturally, leads to larger indexes202,432,42113211321121111111Inverted Index: Positional Information212111123111411111121tfdfbluecat
21、eggfishgreenhamhatone11111121bluecateggfishgreenhamhatone11red11two1red1twoone fish, two fishDoc 1red fish, blue fishDoc 2cat in the hatDoc 3green eggs and hamDoc 43414432122121Retrieval in a NutshellLook up postings lists corresponding to query termsTraverse postings for each query termStore partia
22、l query-document scores in accumulatorsSelect top k results to return22Retrieval: Document-at-a-TimeEvaluate documents one at a time (score all query terms)TradeoffsSmall memory footprint (good)Must read through all postings (bad), but skipping possibleMore disk seeks (bad), but blocking possiblefis
23、h2131231921343580blue21192135Accumulators(e.g. priority queue)Document score in top k?Yes: Insert document score, extract-min if queue too largeNo: Do nothing23Retrieval: Query-At-A-TimeEvaluate documents one query term at a time Usually, starting from most rare term (often with tf-sorted postings)T
24、radeoffsEarly termination heuristics (good)Large memory footprint (bad), but filtering heuristics possiblefish2131231921343580blue21192135Accumulators(e.g., hash)Scoreq=x(doc n) = s24QuizWhat is Inverted Index?What is Boolean Retrieval?What is Vector Space Model?25为下面的网页建立索引,请描述文本处理的过程262,432,4221In
25、verted Index: Positional Information12bluefishone fish, two fishDoc 1red fish, blue fishDoc 2cat in the hatDoc 3green eggs and hamDoc 412227请写出”blue”和”fish”两个词在inverted index中的表示Ranked Retrieval请在为这个文档集建好的inverted index上执行ranked retrieval:“ blue fish”,描述retrieval的计算的过程28one fish, two fishDoc 1red fi
26、sh, blue fishDoc 2cat in the hatDoc 3green eggs and hamDoc 4MapReduce it?The indexing problemScalability is criticalMust be relatively fast, but need not be real timeFundamentally a batch operationIncremental updates may or may not be importantFor the web, crawling is a challenge in itselfThe retrie
27、val problemMust have sub-second response timeFor the web, only need relatively few resultsPerfect for MapReduce!Uh not so good29Indexing: Performance AnalysisHow is it done on a single machine?Fundamentally, a large sorting problemTerms usually fit in memoryPostings usually dontHow can it be done wi
28、th MapReduce?First, lets characterize the problem size:Size of vocabularySize of postings30Vocabulary Size: Heaps LawHeaps Law: linear in log-log spaceVocabulary size grows unbounded!M is vocabulary sizeT is collection size (number of tokens)k and b are constantsTypically, k is between 30 and 100, b
29、 is between 0.4 and 0.631Heaps Law for RCV1Reuters-RCV1 collection: 806,791 newswire documents (Aug 20, 1996-August 19, 1997)k = 44b = 0.49First 1,000,020 terms: Predicted = 38,323 Actual = 38,365Manning, Raghavan, Schtze, Introduction to Information Retrieval (2008)32Postings Size: Zipfs LawZipfs L
30、aw: (also) linear in log-log spaceSpecific case of Power Law distributionsIn other words:A few elements occur very frequentlyMany elements occur very infrequentlycf is the collection frequency of i-th common termc is a constant33Zipfs Law for RCV1Reuters-RCV1 collection: 806,791 newswire documents (
31、Aug 20, 1996-August 19, 1997)Fit isnt that good but good enough!Manning, Raghavan, Schtze, Introduction to Information Retrieval (2008)34Figure from: Newman, M. E. J. (2005) “Power laws, Pareto distributions and Zipfs law.” Contemporary Physics 46:323351.Power Laws are everywhere!35MapReduce: Index
32、ConstructionMap over all documentsEmit term as key, (docno, tf) as valueEmit other information as necessary (e.g., term position)Sort/shuffle: group postings by termReduceGather and sort the postings (e.g., by docno or tf)Write postings to diskMapReduce does all the heavy lifting!361121122111111112I
33、nverted Indexing with MapReduce1one1two1fishone fish, two fishDoc 12red2blue2fishred fish, blue fishDoc 23cat3hatcat in the hatDoc 31fish21one1two2red3cat2blue3hatShuffle and Sort: aggregate values by keysMapReduce37Inverted Indexing: Pseudo-Code382,41312113232,412,42,4131121121122111111Positional I
34、ndexes1one1two1fish2red2blue2fish3cat3hat1fish21one1two2red3cat2blue3hatShuffle and Sort: aggregate values by keysMapReduceone fish, two fishDoc 1red fish, blue fishDoc 2cat in the hatDoc 339Inverted Indexing: Pseudo-CodeWhats the problem?40Scalability BottleneckInitial implementation: terms as keys
35、, postings as valuesReducers must buffer all postings associated with key (to sort)What if we run out of memory to buffer postings?Uh oh!412,491,8,22238,412,9,762,491,8,22238,412,9,76213123Another Try1fish921(values)(key)3435801fish921(values)(keys)343580fishfishfishfishfishHow is this different? Le
36、t the framework do the sorting Term frequency implicitly stored Directly write postings to disk!Where have we seen this before?42213123213123Postings Encoding1fish9213435801fish81213145Conceptually:In Practice: Dont encode docnos, encode gaps (or d-gaps) But its not obvious that this save space43Ove
37、rview of Index CompressionByte-aligned vs. bit-alignedNon-parameterized bit-alignedUnary codes codes codesParameterized bit-alignedGolomb codes (local Bernoulli model)Want more detail? Read Managing Gigabytes by Witten, Moffat, and Bell!44Unary Codesx 1 is coded as x-1 one bits, followed by 1 zero b
38、it3 = 1104 = 1110Great for small numbers horrible for large numbersOverly-biased for very small gapsWatch out! Slightly different definitions in different textbooks45 codesx 1 is coded in two parts: length and offsetStart with binary encoded, remove highest-order bit = offsetLength is number of bina
39、ry digits, encoded in unary codeConcatenate length + offset codesExample: 9 in binary is 1001Offset = 001Length = 4, in unary code = 1110 code = 1110:001AnalysisOffset = log xLength = log x +1Total = 2 log x +146 codesSimilar to codes, except that length is encoded in codeExample: 9 in binary is 100
40、1Offset = 001Length = 4, in code = 11000 code = 11000:001 codes = more compact for smaller numbers codes = more compact for larger numbers47Golomb Codesx 1, parameter b:q + 1 in unary, where q = ( x - 1 ) / b r in binary, where r = x - qb - 1, in log b or log b bitsExample:b = 3, r = 0, 1, 2 (0, 10,
41、 11)b = 6, r = 0, 1, 2, 3, 4, 5 (00, 01, 100, 101, 110, 111)x = 9, b = 3: q = 2, r = 2, code = 110:11x = 9, b = 6: q = 1, r = 2, code = 10:100Optimal b 0.69 (N/df)Different b for every term!48Comparison of Coding Schemes10000:00:0021010:0100:00:100:01311010:1100:10:110:10041110110:00101:0010:00:1015
42、11110110:01101:0110:100:1106111110110:10101:1010:110:11171111110110:11101:11110:010:008111111101110:00011000:000110:1010:0191111111101110:00111000:001110:1110:1001011111111101110:01011000:0101110:010:101UnaryGolombb=3b=6Witten, Moffat, Bell, Managing Gigabytes (1999)49Index Compression: PerformanceW
43、itten, Moffat, Bell, Managing Gigabytes (1999)Unary2621918Binary15206.516.636.236.38Golomb6.095.84BibleTRECBible: King James version of the Bible; 31,101 verses (4.3 MB)TREC: TREC disks 1+2; 741,856 docs (2070 MB)Recommend best practiceComparison of Index Size (bits per pointer)50Chicken and Egg?1fi
44、sh92,49211,8,22(value)(key)3423358,41802,9,76fishfishfishfishfishWrite directly to diskBut wait! How do we set the Golomb parameter b?We need the df to set bBut we dont know the df until weve seen all postings!Recall: optimal b 0.69 (N/df)Sound familiar?51Getting the dfIn the mapper:Emit “special” key-value pairs to keep track of dfIn the reducer:Make sure “special” key-value pairs come first: process them t
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025签订汽车销售合同的注意事项详解
- 2025年华国内企业劳动合同样本
- 2025标准版小产权房买卖合同
- 2025唐人街项目A、B栋主体工程施工合同(执行)
- 《高中课程改革探索》课件
- 《光电子技术基础》课件分享
- 6-何时获得最大利润
- 2025年山南货运从业资格证考试题及答案
- 文山学院《妇产科学A》2023-2024学年第二学期期末试卷
- 辽宁石化职业技术学院《油藏数值模拟》2023-2024学年第二学期期末试卷
- 2025年第六届美丽中国全国国家版图知识竞赛题(附答案)
- 五星级酒店餐饮部管理制度大全
- 2025年紫金财产保险股份有限公司招聘笔试参考题库含答案解析
- 2025年高中作文素材积累:15个“小众又万能”的人物素材
- 2025年春新人教版语文一年级下册教学课件 11 浪花
- 水利工程信息化项目划分表示例、单元工程质量标准、验收应提供的资料目录
- 2025年安徽省水利水电勘测设计研究总院股份有限公司招聘笔试参考题库附带答案详解
- 2025年行政执法人员执法资格考试必考题库及答案(共232题)
- DB31∕T 360-2020 住宅物业管理服务规范
- 2024-2030年中国街舞培训行业发展趋势及竞争格局分析报告
- 2024年度中国鲜食玉米行业发展前景分析简报
评论
0/150
提交评论