大规模数据处理云计算Lecture6–TextRetrievalAlgorithm

上传人：彩*** IP属地：天津上传时间：2022-07-25 格式：PPT 页数：60 大小：3.21MB 积分：88 举报 版权申诉

大规模数据处理云计算Lecture6–TextRetrievalAlgorithm_第2页

大规模数据处理云计算Lecture6–TextRetrievalAlgorithm_第3页

大规模数据处理云计算Lecture6–TextRetrievalAlgorithm_第4页

大规模数据处理云计算Lecture6–TextRetrievalAlgorithm_第5页

已阅读5页，还剩55页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、大规模数据处理/云计算 Lecture 6 Text Retrieval Algorithm彭波北京大学信息科学技术学院7/21/2011/course/cs402/This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee /licenses/by-nc-sa/3.0/us/ for detailsJimmy LinUniversity of Maryland课程建设SEWMGroupTodays AgendaIntroduction to

2、information retrievalBasics of indexing and retrievalInverted indexing in MapReduceRetrieval at scale2First, nomenclatureInformation retrieval (IR)Focus on textual information (= text/document retrieval)Other possibilities include image, video, music, What do we search?Generically, “collections”Less

3、-frequently used, “corpora”What do we find?Generically, “documents”Even though we may be referring to web pages, PDFs, PowerPoint slides, paragraphs, etc.3The Central Problem in SearchSearcherAuthorConceptsConceptsQuery TermsDocument TermsDo these represent the same concepts?“tragic love story”“fate

4、ful star-crossed romance”Abstract IR ArchitectureDocumentsQueryHitsRepresentationFunctionRepresentationFunctionQuery RepresentationDocument RepresentationComparisonFunctionIndexofflineonlinedocument acquisition(e.g., web crawling)5How do we represent text?Remember: computers dont “understand” anythi

5、ng!“Bag of words”Treat all the words in a document as index termsAssign a “weight” to each term based on “importance” (or, in simplest case, presence/absence of word)Disregard order, structure, meaning, etc. of the wordsSimple, yet effective!AssumptionsTerm occurrence is independentDocument relevanc

6、e is independent“Words” are well-defined6Whats a word?天主教教宗若望保祿二世因感冒再度住進醫院。這是他今年第二度因同樣的病因住院。 - - 1982. - , . 2005-06 日米連合台頭中国対処前副長官提言 = 25 .7Sample DocumentMcDonalds slims down spudsFast-food chain to reduce certain types of fat in its french fries with new cooking oil.NEW YORK (CNN/Money) - McDonal

7、ds Corp. is cutting the amount of bad fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.But does that mean the popular shoestring fries wont taste the same? The company says no. Its a win-win for our customers because they

8、 are getting the same great french-fry taste along with an even healthier nutrition profile, said Mike Roberts, president of McDonalds USA.But others are not so sure. McDonalds will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the form

9、ula could mean a different taste.Shares of Oak Brook, Ill.-based McDonalds (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendys International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit

10、. Neither company could immediately be reached for comment.14 McDonalds12 fat11 fries8 new7 french 6 company, said, nutrition5 food, oil, percent, reduce, taste, Tuesday“Bag of Words”Counting WordsDocumentsInvertedIndexBag of Wordscase folding, tokenization, stopword removal, stemmingsyntax, semanti

11、cs, word knowledge, etc.9Boolean RetrievalUsers express queries as a Boolean expressionAND, OR, NOTCan be arbitrarily nestedRetrieval is based on the notion of setsAny given query divides the collection into two sets: retrieved, not-retrievedPure Boolean systems do not define an ordering of the resu

12、lts10Inverted Index: Boolean Retrievalone fish, two fishDoc 1red fish, blue fishDoc 2cat in the hatDoc 31111111231114bluecateggfishgreenhamhatone34144321bluecateggfishgreenhamhatone2green eggs and hamDoc 41red1two2red1two11Boolean RetrievalTo execute a Boolean query:Build query syntax treeFor each c

13、lause, look up postingsTraverse postings and apply Boolean operatorEfficiency analysisPostings traversal is linear (assuming sorted postings)Start with shortest posting first ( blue AND fish ) OR hambluefishANDhamOR12bluefish212Strengths and WeaknessesStrengthsPrecise, if you know the right strategi

14、esPrecise, if you have an idea of what youre looking forImplementations are fast and efficientWeaknessesUsers must learn Boolean logicBoolean logic insufficient to capture the richness of languageNo control over size of result set: either too many hits or noneWhen do you stop reading? All documents

15、in the result set are considered “equally good”What about partial matches? Documents that “dont quite match” the query may be useful also13Ranked RetrievalOrder documents by how likely they are to be relevant to the information needEstimate relevance(q, di)Sort documents by relevanceDisplay sorted r

16、esultsUser modelPresent hits one screen at a time, best results firstAt any point, users can decide to stop lookingHow do we estimate relevance?14Vector Space ModelAssumption: Documents that are “close together” in vector space “talk about” the same thingst1d2d1d3d4d5t3t2Therefore, retrieve document

17、s based on how close the document is to the query (i.e., similarity “closeness”)15Similarity MetricUse “angle” between the vectors:Or, more generally, inner products:16Term WeightingTerm weights consist of two componentsLocal: how important is the term in this document?Global: how important is the t

18、erm in the collection? Heres the intuition:Terms that appear often in a document should get high weightsTerms that appear in many documents should get low weightsHow do we capture this mathematically?Term frequency (local)Inverse document frequency (global)17TF.IDF Term Weightingweight assigned to t

19、erm i in document jnumber of occurrence of term i in document jnumber of documents in entire collectionnumber of documents with term i1821121111111Inverted Index: TF.IDF212111123111411111121tfdfbluecateggfishgreenhamhatone11111121bluecateggfishgreenhamhatone11red11two1red1twoone fish, two fishDoc 1r

20、ed fish, blue fishDoc 2cat in the hatDoc 3green eggs and hamDoc 43414432122119Positional IndexesStore term position in postingsSupports richer queries (e.g., proximity)Naturally, leads to larger indexes202,432,42113211321121111111Inverted Index: Positional Information212111123111411111121tfdfbluecat

21、eggfishgreenhamhatone11111121bluecateggfishgreenhamhatone11red11two1red1twoone fish, two fishDoc 1red fish, blue fishDoc 2cat in the hatDoc 3green eggs and hamDoc 43414432122121Retrieval in a NutshellLook up postings lists corresponding to query termsTraverse postings for each query termStore partia

22、l query-document scores in accumulatorsSelect top k results to return22Retrieval: Document-at-a-TimeEvaluate documents one at a time (score all query terms)TradeoffsSmall memory footprint (good)Must read through all postings (bad), but skipping possibleMore disk seeks (bad), but blocking possiblefis

23、h2131231921343580blue21192135Accumulators(e.g. priority queue)Document score in top k?Yes: Insert document score, extract-min if queue too largeNo: Do nothing23Retrieval: Query-At-A-TimeEvaluate documents one query term at a time Usually, starting from most rare term (often with tf-sorted postings)T

24、radeoffsEarly termination heuristics (good)Large memory footprint (bad), but filtering heuristics possiblefish2131231921343580blue21192135Accumulators(e.g., hash)Scoreq=x(doc n) = s24QuizWhat is Inverted Index?What is Boolean Retrieval?What is Vector Space Model?25为下面的网页建立索引，请描述文本处理的过程262,432,4221In

25、verted Index: Positional Information12bluefishone fish, two fishDoc 1red fish, blue fishDoc 2cat in the hatDoc 3green eggs and hamDoc 412227请写出”blue”和”fish”两个词在inverted index中的表示Ranked Retrieval请在为这个文档集建好的inverted index上执行ranked retrieval：“ blue fish”，描述retrieval的计算的过程28one fish, two fishDoc 1red fi

26、sh, blue fishDoc 2cat in the hatDoc 3green eggs and hamDoc 4MapReduce it?The indexing problemScalability is criticalMust be relatively fast, but need not be real timeFundamentally a batch operationIncremental updates may or may not be importantFor the web, crawling is a challenge in itselfThe retrie

27、val problemMust have sub-second response timeFor the web, only need relatively few resultsPerfect for MapReduce!Uh not so good29Indexing: Performance AnalysisHow is it done on a single machine?Fundamentally, a large sorting problemTerms usually fit in memoryPostings usually dontHow can it be done wi

28、th MapReduce?First, lets characterize the problem size:Size of vocabularySize of postings30Vocabulary Size: Heaps LawHeaps Law: linear in log-log spaceVocabulary size grows unbounded!M is vocabulary sizeT is collection size (number of tokens)k and b are constantsTypically, k is between 30 and 100, b

29、 is between 0.4 and 0.631Heaps Law for RCV1Reuters-RCV1 collection: 806,791 newswire documents (Aug 20, 1996-August 19, 1997)k = 44b = 0.49First 1,000,020 terms: Predicted = 38,323 Actual = 38,365Manning, Raghavan, Schtze, Introduction to Information Retrieval (2008)32Postings Size: Zipfs LawZipfs L

30、aw: (also) linear in log-log spaceSpecific case of Power Law distributionsIn other words:A few elements occur very frequentlyMany elements occur very infrequentlycf is the collection frequency of i-th common termc is a constant33Zipfs Law for RCV1Reuters-RCV1 collection: 806,791 newswire documents (

31、Aug 20, 1996-August 19, 1997)Fit isnt that good but good enough!Manning, Raghavan, Schtze, Introduction to Information Retrieval (2008)34Figure from: Newman, M. E. J. (2005) “Power laws, Pareto distributions and Zipfs law.” Contemporary Physics 46:323351.Power Laws are everywhere!35MapReduce: Index

32、ConstructionMap over all documentsEmit term as key, (docno, tf) as valueEmit other information as necessary (e.g., term position)Sort/shuffle: group postings by termReduceGather and sort the postings (e.g., by docno or tf)Write postings to diskMapReduce does all the heavy lifting!361121122111111112I

33、nverted Indexing with MapReduce1one1two1fishone fish, two fishDoc 12red2blue2fishred fish, blue fishDoc 23cat3hatcat in the hatDoc 31fish21one1two2red3cat2blue3hatShuffle and Sort: aggregate values by keysMapReduce37Inverted Indexing: Pseudo-Code382,41312113232,412,42,4131121121122111111Positional I

34、ndexes1one1two1fish2red2blue2fish3cat3hat1fish21one1two2red3cat2blue3hatShuffle and Sort: aggregate values by keysMapReduceone fish, two fishDoc 1red fish, blue fishDoc 2cat in the hatDoc 339Inverted Indexing: Pseudo-CodeWhats the problem?40Scalability BottleneckInitial implementation: terms as keys

35、, postings as valuesReducers must buffer all postings associated with key (to sort)What if we run out of memory to buffer postings?Uh oh!412,491,8,22238,412,9,762,491,8,22238,412,9,76213123Another Try1fish921(values)(key)3435801fish921(values)(keys)343580fishfishfishfishfishHow is this different? Le

36、t the framework do the sorting Term frequency implicitly stored Directly write postings to disk!Where have we seen this before?42213123213123Postings Encoding1fish9213435801fish81213145Conceptually:In Practice: Dont encode docnos, encode gaps (or d-gaps) But its not obvious that this save space43Ove

37、rview of Index CompressionByte-aligned vs. bit-alignedNon-parameterized bit-alignedUnary codes codes codesParameterized bit-alignedGolomb codes (local Bernoulli model)Want more detail? Read Managing Gigabytes by Witten, Moffat, and Bell!44Unary Codesx 1 is coded as x-1 one bits, followed by 1 zero b

38、it3 = 1104 = 1110Great for small numbers horrible for large numbersOverly-biased for very small gapsWatch out! Slightly different definitions in different textbooks45 codesx 1 is coded in two parts: length and offsetStart with binary encoded, remove highest-order bit = offsetLength is number of bina

39、ry digits, encoded in unary codeConcatenate length + offset codesExample: 9 in binary is 1001Offset = 001Length = 4, in unary code = 1110 code = 1110:001AnalysisOffset = log xLength = log x +1Total = 2 log x +146 codesSimilar to codes, except that length is encoded in codeExample: 9 in binary is 100

40、1Offset = 001Length = 4, in code = 11000 code = 11000:001 codes = more compact for smaller numbers codes = more compact for larger numbers47Golomb Codesx 1, parameter b:q + 1 in unary, where q = ( x - 1 ) / b r in binary, where r = x - qb - 1, in log b or log b bitsExample:b = 3, r = 0, 1, 2 (0, 10,

41、 11)b = 6, r = 0, 1, 2, 3, 4, 5 (00, 01, 100, 101, 110, 111)x = 9, b = 3: q = 2, r = 2, code = 110:11x = 9, b = 6: q = 1, r = 2, code = 10:100Optimal b 0.69 (N/df)Different b for every term!48Comparison of Coding Schemes10000:00:0021010:0100:00:100:01311010:1100:10:110:10041110110:00101:0010:00:1015

42、11110110:01101:0110:100:1106111110110:10101:1010:110:11171111110110:11101:11110:010:008111111101110:00011000:000110:1010:0191111111101110:00111000:001110:1110:1001011111111101110:01011000:0101110:010:101UnaryGolombb=3b=6Witten, Moffat, Bell, Managing Gigabytes (1999)49Index Compression: PerformanceW

43、itten, Moffat, Bell, Managing Gigabytes (1999)Unary2621918Binary15206.516.636.236.38Golomb6.095.84BibleTRECBible: King James version of the Bible; 31,101 verses (4.3 MB)TREC: TREC disks 1+2; 741,856 docs (2070 MB)Recommend best practiceComparison of Index Size (bits per pointer)50Chicken and Egg?1fi

44、sh92,49211,8,22(value)(key)3423358,41802,9,76fishfishfishfishfishWrite directly to diskBut wait! How do we set the Golomb parameter b?We need the df to set bBut we dont know the df until weve seen all postings!Recall: optimal b 0.69 (N/df)Sound familiar?51Getting the dfIn the mapper:Emit “special” key-value pairs to keep track of dfIn the reducer:Make sure “special” key-value pairs come first: process them t

人人文库> 全部分类> 应用文书 > 研究报告

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

大规模数据处理云计算Lecture6–TextRetrievalAlgorithm

文档简介

温馨提示

最新文档

评论

大规模数据处理云计算Lecture6–TextRetrievalAlgorithm

文档简介

温馨提示

最新文档

评论

相关文档