信息检索10西北工业大学_第1页
信息检索10西北工业大学_第2页
信息检索10西北工业大学_第3页
信息检索10西北工业大学_第4页
信息检索10西北工业大学_第5页
已阅读5页,还剩30页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Chen QunSchool of Computer, NWPU01/2017Information Retrieval: Theory and PracticeIR on WebToday Web Crawling and Search Issues Web Crawling Web Search Engines and AlgorithmsSearching the Web Web Directories versus Search Engines Some statistics about Web searching Challenges for Web Searching Search

2、 Engines Crawling Indexing QueryingDirectories vs. Search Engines Directories Hand-selected sites Search over the contents of the descriptions of the pages Organized in advance into categories Search Engines All pages in all sites Search over the contents of the pages themselves Organized after the

3、query by relevance rankings or other scoresSearch Engines vs. Internal Engines Ten years ago HotBot, GoTo, Yahoo and Microsoft were all powered by Inktomi Today Google is the search engine behind many other search servicesSearches per day (current) Dont have exact numbers for Google, but estimated t

4、o be about 400 Millionb searches per day They index over 20 Billion web pages http:/ for Web Searching: Data Distributed data Volatile data/”Freshness”: 40% of the web changes every month Exponential growth Unstructured and redundant data: 30% of web pages are near duplicates Multiple formats Hidden

5、 data Page links Challenges for Web Searching: Users Many different kinds of users Users unfamiliar with search engine interfaces (e.g., Does the query “apples oranges” mean the same thing on all of the search engines?) Users unfamiliar with the logical view of the data (e.g., Is a search for “Orang

6、es” the same things as a search for “oranges”?)Web Search Queries Web search queries are SHORT 2.9 words on average (Aug 2009) Has increased, was 1.7 (1997), 2.4(2004) User Expectations Many say “the first item shown should be what I want to see”! This works if the user has the most popular/common n

7、otion in mindSearch Engines Crawling Indexing QueryingStandard Web Search Engine Architecturecrawl thewebcreate an invertedindexCheck for duplicates,store the documentsInverted indexSearch engine serversuserqueryShow results To userDocIdsWeb Crawlers How do the web search engines get all of the item

8、s they index? More precisely: Put a set of known sites on a queue Repeat the following until the queue is empty: Take the first page off of the queue If this page has not yet been processed: Record the information found on this page Positions of words, links going out, etc Add each link on the curre

9、nt page to the queue Record that this page has been processed In what order should the links be followed?Page Visit OrderSites Are Complex Graphs, Not Just TreesPage 1Page 3Page 2Page 1Page 2Page 1Page 5Page 6Page 4Page 1Page 2Page 1Page 3Site 6Site 5Site 3Site 1Site 2Web Crawling Issues Keep out si

10、gns A file called robots.txt tells the crawler which directories are off limits Freshness Figure out which pages change often Re-crawl these often Duplicates, virtual hosts, etc Convert page contents with a hash function Compare new pages to the hash table Lots of problems Server unavailable Incorre

11、ct html Missing links Infinite loops Web crawling is difficult to do robustly!Indexes for Web Search Engines Inverted indexes are still used, even though the web is so huge Most current web search systems partition the indexes across different machines Each machine handles different parts of the dat

12、a (Google uses thousands of PC-class processors and keeps most things in main memory) Other systems duplicate the data across many machines Queries are distributed among the machines Most do a combination of theseSearch Engine QueryingIn this example, the data for the pages is partitioned across mac

13、hines. Additionally, each partition is allocated multiple machines to handle the queries.Each row can handle 120 queries per secondEach column can handle 7M pagesTo handle more queries, add another row.From description of the FAST search engine, by Knut Risvikhttp:/ Cascading Allocation of CPUs A va

14、riation on this that produces a cost-savings: Put high-quality/common pages on many machines Put lower quality/less common pages on fewer machines Query goes to high quality machines first If no hits found there, go to other machinesGoogle Google maintains (probably) the worlds largest Linux cluster

15、 (tens of millions of servers) These are partitioned between index servers and page servers Index servers resolve the queries (massively parallel processing) Page servers deliver the results of the queries Over 20 Billion web pages are indexed and served by GoogleSearch Engine Indexes Starting Point

16、s for Users include Manually compiled lists Directories Page “popularity” Frequently visited pages (in general) Frequently visited pages as a result of a query Link “co-citation” Which sites are linked to by other sites?Starting Points: What is Really Being Used? Todays search engines combine these

17、methods in various ways Integration of Directories Today most web search engines integrate categories into the results listings Lycos, MSN, Google Link analysis Google uses it; others are also using it Words on the links seems to be especially useful Page popularity Many use DirectHits popularity ra

18、nkingsWeb Page Ranking Varies by search engine Pretty messy in many cases Details usually proprietary and fluctuating Combining subsets of: Term frequencies Term proximities Term position (title, top of page, etc) Term characteristics (boldface, capitalized, etc) Link analysis information Category i

19、nformation Popularity informationRanking: Hearst 96 Proximity search can help get high-precision results if 1 term Combine Boolean and passage-level proximity Proves significant improvements when retrieving top 5, 10, 20, 30 documents Results reproduced by Mitra et al. 98 Google uses something simil

20、arUse of Web Links: Algorithms Query independent page quality global analysis PageRank (Google): simulates a random walk across the web and computes the “score” of a page as the probability of reaching the page Query dependent page quality local analysis HITS (Hyperlink Induced Topic Search): focuss

21、es on broad topic queries that are likely to be answered with too many pages the more a page is pointed to by other pages, the more popular is the page popular pages are more likely to include relevant information than non-popular pagesRanking: Link Analysis Assumptions: If the pages pointing to thi

22、s page are good, then this is also a good page The words on the links pointing to this page are useful indicators of what this page is about References: Page et al. 98, Kleinberg 98Ranking: Link Analysis Why does this work? The official Toyota site will be linked to by lots of other official (or hig

23、h-quality) sites The best Toyota fan-club site probably also has many links pointing to it Less high-quality sites do not have as many high-quality sites linking to themPageRank (1) Designed by Brin and Page at Stanford University Used to implement Google Main idea: a page has a high rank if the sum

24、 of the ranks of its in-links is high in-link of page p: a link from a page to page p out-link of a page p: a link from page p to a page a high PageRank page has many in-links or few highly ranked in-links Retrieval: use cosine product (content, feature, term weight) combined with PageRank valuePage

25、Rank (2) Random Surfer Model : user randomly navigates Initially the surfer is at a random page At each step the surfer proceeds to a randomly chosen Web page with probability d called the “damping factor” (e.g. probability of random jump = 0.2) to a randomly chosen page linked to the current page w

26、ith probability 1-d (e.g. probability of following a random outlink = 0.8) Process modelled by Markov Chain PageRank PR of a page a = probability that the surfer is at page a on a given timePR(a) = kd + k(1-d) i=1,n PR(ai)/C(ai)d set by systema = page pointed by ai for i=1,nk normalisation factorC(a

27、i) = number of outlinks of aiPageRank T2T1T6T5T4T3T7X1X2ANote: these are not real PageRanks, since they include values = 1 Suppose d=0.2, round 1 runningPageRank Similar to calculations used in scientific citation analysis (e.g., Garfield et al.) and social network analysis (e.g., Waserman et al.) S

28、imilar to other work on ranking (e.g., the hubs and authorities of Kleinberg et al.) How is Amazon similar to Google in terms of the basic insights and techniques of PageRank? How could PageRank be applied to other problems and domains?HITS: Hypertext Induced Topic SearchOriginated from Kleinberg, 1

29、997Also referred to as the “The Connectivity Analysis Approach”Broad topic queries produce large sets of retrieved results abundance problem too many relevant documents new type of quality measure needed distinguish the most “authoritative” pages high-quality response to a broad queryHITS: for a certain topic, it identifies good authorities pages that contain relevant information (good sources of content) good hubs page that point to useful pages (good sources of links)HITS (2) Intuition authority comes from inlinks being a good hub comes from outlinks better authority comes from inlinks

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论