




文档简介
Improving Search Quality of the Google Search Appliance by Huy Nguyen Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2009 Huy Nguyen MMIX All rights reserved The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part Author MASSACHUSETTS INSTITE OF TECHNOLOGY JUL 2 0 2009 I LIBRARIES Department of Electrical Engineering and Computer Science May 22 2009 Certified by David Elworthy VI ompany Thesis Supervisor SThesis Supervisor Certified by q Regina Barzilay ssociate Professor S esis Supervisor Accepted by Arthur C Smith Chairman Department Committee on Graduate Students ARCHIVES Improving Search Quality of the Google Search Appliance by Huy Nguyen Submitted to the Department of Electrical Engineering and Computer Science on May 22 2009 in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science Abstract In this thesis we describe various experiments on the ranking function of the Google Search Appliance to improve search quality An evolutionary computation framework is implemented and applied to optimize various parameter settings of the ranking func tion We evaluate the importance of IDF in the ranking function and achieve small improvements in performance We also examine many ways to combining the query independent and query dependent scores Lastly we perform various experiments with signals based on the positions of the query terms in the document Thesis Supervisor David Elworthy Title VI A Company Thesis Supervisor Thesis Supervisor Regina Barzilay Title Associate Professor Acknowledgments I am deeply grateful to David Elworthy and Professor Regina Barzilay for providing me with directions and ideas throughout my work in information retrieval I also want to thank Liviu Panait Michael Parker Sibabrata Ray Emily Rocke Nikola Jevtic and the rest of the enterprise search team at Google who have been very helpful and supportive during my entire time at Google Contents 1 Introduction 1 1 Ranking in web and enterprise search 1 2 Our contribution 1 3 Structure of the thesis 2 Related Work 2 1 A web search system 2 1 1 The crawler 2 1 2 The indexer 2 1 3 The query processor 2 2 Document ranking 2 3 Page dependent signals 2 3 1 Weighted hit counts 2 3 2 Phrases 2 3 3 Term position signals 2 4 Page independent signals 2 4 1 In degree 2 4 2 HITS 2 4 3 Pagerank 2 5 Ranking models 2 5 1 Vector space model 2 5 2 Probabilistic ranking model 2 5 3 Language mixture model 5 15 15 16 17 18 19 19 20 21 21 22 22 23 24 26 26 28 29 o o o o o 2 6 Combination of different signals 31 2 7 Evaluation 32 2 7 1 Navigational queries 32 2 7 2 Informational queries 33 2 7 3 Evaluation metrics 33 2 7 4 TREC corpora and web track evaluation 34 3 Methodology 37 3 1 Document feature extraction 38 3 2 Nelder Mead algorithm 38 3 3 Evolutionary strategies 39 3 4 Implementation 40 3 5 Summary 40 4 Importance of IDF 41 4 1 Summary 43 5 Combination of query independent and query dependent measures 45 5 1 Method 45 5 2 Results 46 6 Signals based on term positions 51 6 1 Chronological term rank 51 6 2 Length of the shortest snippet 53 6 3 Optimal re ranking 53 6 4 Summary 54 7 Conclusion and future work 59 A Results of chronological term rank and snippet length experiments 61 List of Figures 4 1 Change in performance of system using query expansion 42 4 2 Change in performance of system not using query expansion 42 5 1 Two queries whose relevant documents are in different parameter ranges Blue dots are irrelevant documents red dots are relevant documents the vertical axis is the query independent score the horizontal axis is the query dependent score the blue curve is the optimal function combining the two scores 47 5 2 Two queries whose relevant documents are in the same parameter range 47 5 3 Two queries whose relevant documents have high query independent score 48 5 4 A typical query 49 5 5 A home page finding query where the query independent score is much more important than the query dependent score 49 6 1 Formulas for computing the contribution of CTR to the relevance score 52 6 2 Formulas for computing the contribution of the snippet length to the relevance score 52 6 3 Performance of the system with contribution from CTR using the func tion c 1 L The horizontal axis is c and the vertical axis is the metric score 55 6 4 Performance of the system with contribution from the snippet length using the function C 56 snippet 2 56 6 5 Re ranking with CTR on system without IDF 57 6 6 Re ranking with CTR on system with IDF 57 6 7 Re ranking with snippet length on system without IDF 58 6 8 Re ranking with snippet length on system with IDF 58 A 1 Performance of the system with contribution from CTR using the func tion o 62 log tr 2 A 2 Performance of the system with contribution from CTR using the func tion c 63 A 3 Performance of the system with contribution from CTR using the func tion cDLN 64 tr 1 A 4 Performance of the system with contribution from the snippet length using the function c log snippet 2 2 65 A 5 Performance of the system with contribution from the snippet length using the function c DLN 66 snippet 1 A 6 Performance of the system with contribution from the snippet length using the function clog DLN 67 log snippet 2 67 A 7 Performance of the system with contribution from the snippet length using the function c 1 snippa 68 DLN 6 Chapter 1 Introduction 1 1 Ranking in web and enterprise search In the past few decades there has been an explosion in the amount of content available on the World Wide Web WWW Navigating on the web and finding necessary information have become extremely important tasks as the web grows and the number of web pages a person needs to use quickly grows out of the manageable range of a person s memory Starting from directory services provided by companies such as Yahoo the task was handed over to automated search engines by Google Yahoo Microsoft etc As people get used to search tools available on the web they also look for similar ways to find information in private repositories such as companies private documents In some sense these repositories are similar to the public web as companies put their documents on internal websites with a link structure among documents just like on the web In some other ways however private repositories are much more heterogeneous than the web with a lot of documents stored in databases and in local files without any link between them The link structure of private corpora is different from that of the web in many respects Unlike in the web there is no artificial effort to boost the ranking of certain documents so there are much fewer spammy links in these corpora On the other hand in web search links from a different website usually give a better indication of popularity than links from the same website In private corpora such distinction usually does not exist as documents are spread out on many different machines in some arbitrary manner and sometimes on site links are actually equally important as off site links These and other structural differences make searching in an enterprise setting an interesting problem which requires re evaluation of the contributions and relative importance of link based signals and document based signals and the ways to combine them into a final score for the ranking function Most importantly the way the signals are combined should work well across many different types of corpora with different link structure characteristics and sizes As companies put more and more documents into searchable formats an interest ing challenge arises The private corpora have grown quickly beyond the scope one single machine can handle and require being divided into multiple parts to be handled by multiple machines This is known as the federated search problem where mul tiple machines serve multiple parts of the same corpus with the ability to combine the search results quickly and accurately To reduce communication cost improve latency and ease maintenance and extensibility it is desirable for each machine to be able to answer queries while not knowing about the rest of the corpus There fore signals based on the whole corpus becomes less preferable to alternatives with similar performance but easier to maintain One such signal is the inverse document frequency IDF which is based on the number of document in the corpus containing each search term When documents are spread out on many machines IDF can be skewed toward certain parts of the corpus so the system incorrectly favors results from those particular parts of the corpus Generally a search ranking function computes the scores based on the web struc ture of the whole corpus and the matches between the content of the documents and the query terms For the matches between the content of the documents and the query terms existing systems overwhelmingly use the bag of words model with the frequency of occurrences as the sole measure of relevancy Recently Troy and Zhang 37 came up with a new signal called Chronological Term Rank CTR based on the position of the occurrences in the document and achieved some improvements on a simple system tested on several TREC tasks This is a promising approach as it is based on information orthogonal to the traditional term frequencies and it is worth investigating further on more sophisticated systems on many different query types 1 2 Our contribution In this thesis we attempt to address the three problems described in the previous section for the ranking function of the Google Search Appliance by re evaluating the contributions of various signals and the ways to combine them in the ranking function Intuitively almost all signals by themselves have some discriminatory information on the relevancy of a document against a query However it is non trivial to combine them fairly with respect to their contributions and prevent noisy signals from ham pering the accurate ones For this purpose we have implemented an evolutionary strategies framework that can evaluate many parameter settings in parallel and given a particular ranking function optimize the parameter settings for that function Effect of IDF on the ranking function We evaluate the impact of the IDF on the ranking function by comparing the performance of the system with and without IDF on various TREC corpora and also several side by side experiments While the IDF is a good way to compare the relative importance of the query terms it can be noisy especially when the corpus is split into multiple separate parts e g in federated search Additionally the contribution from the IDF is reduced for the AND queries where all query terms must be present for a document to be considered the system does not have to choose between documents with only occurrences of the one query term and documents with only occurrences of another query term The contribution from the IDF can also be achieved partially with a list of stop words i e words providing little discriminatory information such as an and in etc It turns out that the contribution from IDF is marginal at best on medium size corpora of a million documents and somewhat harmful on small corpora We achieve a modest improvement on a small task while maintaining the same level of performance on bigger corpora Methods for combining different signals We look at the relative importance of the web based score and the content based score and the ways to combine these scores into a final score We explore the use of impact transformation 2 in combining them as well as various variations with cutoffs and varying behavior on different parameter ranges Interestingly simple linear combination functions work almost as well as more complicated quadratic ones with varying behavior on different parameter ranges even though there are a small number of queries where quadratic functions work slightly better Impact of position based signals We study the impact of position based signals including the CTR and the length of the shortest snippet containing all the search terms Previously the impacts of these signals were studied on a simple system where the way the occurrences are formatted is not taken into account i e an occurrence in a big font is considered the same as an occurrence in a small font In this thesis we explore the contributions of these signals in the context of AND queries on the Google Search Appliance which has a sophisticated weighting scheme taking into account the formatting of the occurrences Our experiment results are mixed with improved performance on some tasks but worse performance on other tasks We also compute upper bounds on the contributions of these signals These bounds indicate that while they might not help in general it is possible that they can provide some indication of relevance for home page queries 1 3 Structure of the thesis In chapter 2 we review the structure of a web search system and provide a broad overview of the major approaches to the web search problem Our discussion is mostly around the ranking component as it is arguably the heart of the system and also the focus of our research We then describe the various signals going into the ranking function and the three major approaches of combining those signals into a final score Finally we discuss our corpora the evaluation metrics and the relative importance of the metrics on different corpora Chapter 3 presents our system for running experiments and selecting weights using evolutionary strategies We describe our setup for parallelized evaluation and our use of evolutionary strategies to optimize the parameter settings Chapter 4 investigates the impact of IDF in the topicality score In this chapter we compare the performance of two settings one optimized for combining with IDF and one optimized for not using IDF on our evaluation corpora Besides the overall metrics we also look at individual queries where one setting performs better than the other and vice versa and the reason why it happens Chapter 5 looks at the use of impact transformation and various combining func tions for combining link structure measures and content based scores We start by describing our experiment and evaluation setup We then describe our findings by comparing the behavior of the scores and the optimal combining function in different parameter ranges Chapter 6 studies the effect of position based signals including the first hit position and the length of the shortest snippet containing all the query terms We present our experiment results using these signals with and without using IDF Then we discuss the optimal reranking using these signals which is an upper bound for the improvement gain of any method combining the existing score and the position based signals In chapter 7 we discuss the main ideas and contributions of the thesis and direc tions for future research 14 Chapter 2 Related Work In this chapter we describe the organization of a generic web search system and the related works in search ranking Section 2 1 describes the components of a search system the crawler the indexer and the query processor Then we give an overview of search ranking and the various signals for ranking in section 2 2 In section 2 3 we go over several ranking signals based on the content of the documents In section 2 4 we describe the signals based on the links between documents Then in section 2 5 we go over three main models in search ranking the vector space model the probabilistic ranking model and the language mixture model Section 2 6 describes the ways to combine different signals into a single final score Finally in section 2 7 we describe our evaluation metrics and our evaluation corpora and query sets 2 1 A web search system A web search system is a system for organizing and retrieving information using text based queries The user provides the system with a query consisting of several words and the system has to return a list of documents ordered by how relevant they are with respect to the query In big corpora of documents there could be thousands or even hundred thousands of documents matching a particular query Therefore the order in which the documents are presented to the user is crucial to the usefulness of the system The part producing the ordering is the main focus of this thesis Firstly we look at the general organization of a web search system and in partic ular the Google Search Appliance Generally a search system is a complex system with many components divided into three parts a crawler an indexer and a query processor The crawler is used to retrieve all documents in the corpora and prepare them to be indexed Usually the crawler is given several starting points and then follows the links on the web pages to find the rest of the documents In an enterprise setting the crawler also needs to be able to extract documents from databases and files on hard drives In the Google Search Appliance there is a connector framework allowing the crawler to read documents from databases and files The indexer reor ganizes information extracted from the documents to produce search indices which allows finding relevant documents quickly The query processor takes queries from users and finds matching documents using the indices produced by the indexer It also computes the relevancy of the matching documents and presents to the users the documents sorted in decreasing relevancy order In the subsequent sections we will explore these components in more details 2 1 1 The crawler The crawler explores the documents on the web through web links It is given a set of seed pages and gather the rest of the documents by following the hyperlinks in the web graph It usually maintains a priority queue of pages which det
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- GB/T 46213-2025废旧粘结钕铁硼磁体再制造技术规范
- 化工传递安全正能量课件
- 养护公路安全培训内容记录课件
- 钢管外架承包协议书与钢管外架租赁承包合同8篇
- 冒号的用法及举例课件
- 初级安全培训课件
- 内镜在异物取出术课件
- 俱乐部商业思维营销方案(3篇)
- 货找人型营销方案(3篇)
- 化学实验教师安全培训课件
- 华为信息安全管理培训课件
- 诗经整本书阅读课件
- (2025年标准)预售小麦协议书
- 2025年院感测试题及答案
- 养生店国庆节活动方案
- GB/T 33467-2016全自动吹瓶灌装旋盖一体机通用技术要求
- GB/T 20481-2006气象干旱等级
- 2023年石家庄水务投资集团有限责任公司招聘笔试模拟试题及答案解析
- 2020牛津译林版高中英语新教材选修第一册全册课文翻译及单词表
- 我国运动员在奥林匹克运动会取得的辉煌成绩课件
- 专升本高等数学的讲义80页PPT课件
评论
0/150
提交评论