现代信息检索导论英文课件：14-WebCrawlingAndIndexing

上传人：窝*** IP属地：安徽上传时间：2022-09-14 格式：PPT 页数：46 大小：232KB 积分：30 举报 版权申诉

现代信息检索导论英文课件：14-WebCrawlingAndIndexing_第2页

现代信息检索导论英文课件：14-WebCrawlingAndIndexing_第3页

现代信息检索导论英文课件：14-WebCrawlingAndIndexing_第4页

现代信息检索导论英文课件：14-WebCrawlingAndIndexing_第5页

已阅读5页，还剩41页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、An Introduction to IR14th CourseChapter 20 Web crawling and indexes Todays lectureCrawlingConnectivity serversBasic crawler operationBegin with known “seed” pagesFetch and parse themExtract URLs they point toPlace the extracted URLs on a queueFetch each URL on the queue and repeatIts so easy to get

2、a single WebpageHTTP/0.9 & prior% telnet www.zzz.yyy 80GET /FilePathrnHTTP/1.1% telnet www.zzz.yyy 80GET %s HTTP/1.1rnHost: %srnrnCrawling pictureWebURLs crawledand parsedURLs frontierUnseen WebSeedpagesSimple picture complicationsWeb crawling isnt feasible with one machineAll of the above steps dis

3、tributedEven non-malicious pages pose challengesLatency/bandwidth to remote servers varyWebmasters stipulationsHow “deep” should you crawl a sites URL hierarchy?Site mirrors and duplicate pagesMalicious pagesSpam pages Spider traps incl dynamically generatedPoliteness dont hit a server too oftenWhat

4、 any crawler must doBe Polite: Respect implicit and explicit politeness considerations for a websiteOnly crawl pages youre allowed toRespect robots.txt (more on this shortly)Be Robust: Be immune to spider traps and other malicious behavior from web serversWhat any crawler should doBe capable of dist

5、ributed operation: designed to run on multiple distributed machinesBe scalable: designed to increase the crawl rate by adding more machinesPerformance/efficiency: permit full use of available processing and network resourcesWhat any crawler should doFetch pages of “higher quality” firstContinuous op

6、eration: Continue fetching fresh copies of a previously fetched pageExtensible: Adapt to new data formats, protocolsUpdated crawling pictureURLs crawledand parsedUnseen WebSeedPagesURL frontierCrawling threadURL frontierCan include multiple pages from the same hostMust avoid trying to fetch them all

7、 at the same timeMust try to keep all crawling threads busyExplicit and implicit politenessExplicit politeness: specifications from webmasters on what portions of site can be crawledrobots.txtImplicit politeness: even with no specification, avoid hitting any site too oftenRobots.txtProtocol for givi

8、ng spiders (“robots”) limited access to a website, originally from 1994/wc/norobots.htmlWebsite announces its request on what can(not) be crawledFor a URL, create a file URL/robots.txtThis file specifies access restrictionsRobots.txt exampleNo robot should visit any URL starting with /yoursite/temp/

9、, except the robot called “searchengine: User-agent: *Disallow: /yoursite/temp/ User-agent: searchengineDisallow: Processing steps in crawlingPick a URL from the frontierFetch the document at the URLParse the documentExtract links from it to other docs (URLs)Check if document has content already see

10、nIf not, add to indexesFor each extracted URLEnsure it passes certain URL filter testsCheck if it is already in the frontier (duplicate URL elimination)E.g., only crawl .edu, obey robots.txt, etc.Which one?Basic crawl architectureWWWFetchDNSParseContentseen?DocFPsDupURLelimURLsetURL FrontierURLfilte

11、rrobotsfiltersDNS (Domain Name Server)A lookup service on the internetGiven a URL, retrieve its IP addressService provided by a distributed set of servers thus, lookup latencies can be high (even seconds)Common OS implementations of DNS lookup are blocking: only one outstanding request at a timeSolu

12、tionsDNS cachingBatch DNS resolver collects requests and sends them out togetherParsing: URL normalizationWhen a fetched document is parsed, some of the extracted links are relative URLsE.g., at /wiki/Main_Pagewe have a relative link to /wiki/Wikipedia:General_disclaimer which is the same as the abs

13、olute URL /wiki/Wikipedia:General_disclaimerMust expand such relative URLsContent seen?Duplication is widespread on the webIf the page just fetched is already in the index, do not further process itThis is verified using document fingerprints or shinglesFilters and robots.txt Filters regular express

14、ions for URLs to be crawled/notOnce a robots.txt file is fetched from a site, need not fetch it repeatedlyDoing so burns bandwidth, hits web serverCache robots.txt filesDuplicate URL eliminationFor a non-continuous (one-shot) crawl, test to see if an extracted+filtered URL has already been passed to

15、 the frontierFor a continuous crawl see details of frontier implementationDistributing the crawlerRun multiple crawl threads, under different processes potentially at different nodesGeographically distributed nodesPartition hosts being crawled into nodesHash used for partitionHow do these nodes comm

16、unicate?Communication between nodesThe output of the URL filter at each node is sent to the Duplicate URL Eliminator at all nodesWWWFetchDNSParseContentseen?URLfilterDupURLelimDocFPsURLsetURL FrontierrobotsfiltersHostsplitterTootherhostsFromotherhostsURL frontier: two main considerationsPoliteness:

17、do not hit a web server too frequentlyFreshness: crawl some pages more often than othersE.g., pages (such as News sites) whose content changes oftenThese goals may conflict each other.(E.g., simple priority queue fails many links out of a page go to its own site, creating a burst of accesses to that

18、 site.)Politeness challengesEven if we restrict only one thread to fetch from a host, can hit it repeatedlyCommon heuristic: insert time gap between successive requests to a host that is time for most recent fetch from that hostURL frontier: Mercator schemePrioritizerBiased front queue selectorBack

19、queue routerBack queue selectorK front queuesB back queuesSingle host on eachURLsCrawl thread requesting URLMercators URL frontierURLs flow in from the top into the frontierFront queues manage prioritizationBack queues enforce politenessEach queue is FIFOFront queuesPrioritizer1KBiased front queue s

20、electorBack queue routerFront queuesPrioritizer assigns to URL an integer priority between 1 and KAppends URL to corresponding queueHeuristics for assigning priorityRefresh rate sampled from previous crawlsApplication-specific (e.g., “crawl news sites more often”)Biased front queue selectorWhen a ba

21、ck queue requests a URL (in a sequence to be described): picks a front queue from which to pull a URLThis choice can be round robin biased to queues of higher priority, or some more sophisticated variantCan be randomizedBack queuesBiased front queue selectorBack queue routerBack queue selector1BHeap

22、Back queue invariantsEach back queue is kept non-empty while the crawl is in progressEach back queue only contains URLs from a single hostMaintain a table from hosts to back queuesHost nameBack queue 31BBack queue heapOne entry for each back queueThe entry is the earliest time te at which the host c

23、orresponding to the back queue can be hit againThis earliest time is determined fromLast access to that hostAny time buffer heuristic we chooseBack queue processingA crawler thread seeking a URL to crawl:Extracts the root of the heapFetches URL at head of corresponding back queue q (look up from tab

24、le)Checks if queue q is now empty if so, pulls a URL v from front queuesIf theres already a back queue z for vs host, append v to z and pull another URL from front queues, repeatElse add v to qWhen q is non-empty, create heap entry for itNumber of back queues BKeep all threads busy while respecting

25、politenessMercator recommendation: three times as many back queues as crawler threadsConnectivity serversConnectivity ServerCS1: Bhar98b, CS2 & 3: Rand01Support for fast queries on the web graphWhich URLs point to a given URL?Which URLs does a given URL point to?Stores mappings in memory fromURL to

26、outlinks, URL to inlinksApplicationsCrawl controlWeb graph analysisLink analysisMost recent published workBoldi and Vigna/proceedings/docs/1p595.pdfWebgraph set of algorithms and a java implementationFundamental goal maintain node adjacency lists in memoryFor this, compressing the adjacency lists is

27、 the critical componentAdjacency listsThe set of neighbors of a nodeAssume each URL represented by an integerE.g., for a 4 billion page web, need 32 bits per nodeNaively, this demands 64 bits to represent each hyperlinkAdjaceny list compressionProperties exploited in compression:Similarity (between lists)Locality (many links from a page go to “nearby” pages)Use gap encodings in sorted listsDistribution of gap valuesStorageBoldi/Vigna get down to an average of 3 bits/link(URL to URL edge

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

现代信息检索导论英文课件：14-WebCrawlingAndIndexing

文档简介

温馨提示

最新文档

评论

现代信息检索导论英文课件：14-WebCrawlingAndIndexing

文档简介

温馨提示

最新文档

评论

相关文档