版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、An Introduction to IR14th CourseChapter 20 Web crawling and indexes Todays lectureCrawlingConnectivity serversBasic crawler operationBegin with known “seed” pagesFetch and parse themExtract URLs they point toPlace the extracted URLs on a queueFetch each URL on the queue and repeatIts so easy to get
2、a single WebpageHTTP/0.9 & prior% telnet www.zzz.yyy 80GET /FilePathrnHTTP/1.1% telnet www.zzz.yyy 80GET %s HTTP/1.1rnHost: %srnrnCrawling pictureWebURLs crawledand parsedURLs frontierUnseen WebSeedpagesSimple picture complicationsWeb crawling isnt feasible with one machineAll of the above steps dis
3、tributedEven non-malicious pages pose challengesLatency/bandwidth to remote servers varyWebmasters stipulationsHow “deep” should you crawl a sites URL hierarchy?Site mirrors and duplicate pagesMalicious pagesSpam pages Spider traps incl dynamically generatedPoliteness dont hit a server too oftenWhat
4、 any crawler must doBe Polite: Respect implicit and explicit politeness considerations for a websiteOnly crawl pages youre allowed toRespect robots.txt (more on this shortly)Be Robust: Be immune to spider traps and other malicious behavior from web serversWhat any crawler should doBe capable of dist
5、ributed operation: designed to run on multiple distributed machinesBe scalable: designed to increase the crawl rate by adding more machinesPerformance/efficiency: permit full use of available processing and network resourcesWhat any crawler should doFetch pages of “higher quality” firstContinuous op
6、eration: Continue fetching fresh copies of a previously fetched pageExtensible: Adapt to new data formats, protocolsUpdated crawling pictureURLs crawledand parsedUnseen WebSeedPagesURL frontierCrawling threadURL frontierCan include multiple pages from the same hostMust avoid trying to fetch them all
7、 at the same timeMust try to keep all crawling threads busyExplicit and implicit politenessExplicit politeness: specifications from webmasters on what portions of site can be crawledrobots.txtImplicit politeness: even with no specification, avoid hitting any site too oftenRobots.txtProtocol for givi
8、ng spiders (“robots”) limited access to a website, originally from 1994/wc/norobots.htmlWebsite announces its request on what can(not) be crawledFor a URL, create a file URL/robots.txtThis file specifies access restrictionsRobots.txt exampleNo robot should visit any URL starting with /yoursite/temp/
9、, except the robot called “searchengine: User-agent: *Disallow: /yoursite/temp/ User-agent: searchengineDisallow: Processing steps in crawlingPick a URL from the frontierFetch the document at the URLParse the documentExtract links from it to other docs (URLs)Check if document has content already see
10、nIf not, add to indexesFor each extracted URLEnsure it passes certain URL filter testsCheck if it is already in the frontier (duplicate URL elimination)E.g., only crawl .edu, obey robots.txt, etc.Which one?Basic crawl architectureWWWFetchDNSParseContentseen?DocFPsDupURLelimURLsetURL FrontierURLfilte
11、rrobotsfiltersDNS (Domain Name Server)A lookup service on the internetGiven a URL, retrieve its IP addressService provided by a distributed set of servers thus, lookup latencies can be high (even seconds)Common OS implementations of DNS lookup are blocking: only one outstanding request at a timeSolu
12、tionsDNS cachingBatch DNS resolver collects requests and sends them out togetherParsing: URL normalizationWhen a fetched document is parsed, some of the extracted links are relative URLsE.g., at /wiki/Main_Pagewe have a relative link to /wiki/Wikipedia:General_disclaimer which is the same as the abs
13、olute URL /wiki/Wikipedia:General_disclaimerMust expand such relative URLsContent seen?Duplication is widespread on the webIf the page just fetched is already in the index, do not further process itThis is verified using document fingerprints or shinglesFilters and robots.txt Filters regular express
14、ions for URLs to be crawled/notOnce a robots.txt file is fetched from a site, need not fetch it repeatedlyDoing so burns bandwidth, hits web serverCache robots.txt filesDuplicate URL eliminationFor a non-continuous (one-shot) crawl, test to see if an extracted+filtered URL has already been passed to
15、 the frontierFor a continuous crawl see details of frontier implementationDistributing the crawlerRun multiple crawl threads, under different processes potentially at different nodesGeographically distributed nodesPartition hosts being crawled into nodesHash used for partitionHow do these nodes comm
16、unicate?Communication between nodesThe output of the URL filter at each node is sent to the Duplicate URL Eliminator at all nodesWWWFetchDNSParseContentseen?URLfilterDupURLelimDocFPsURLsetURL FrontierrobotsfiltersHostsplitterTootherhostsFromotherhostsURL frontier: two main considerationsPoliteness:
17、do not hit a web server too frequentlyFreshness: crawl some pages more often than othersE.g., pages (such as News sites) whose content changes oftenThese goals may conflict each other.(E.g., simple priority queue fails many links out of a page go to its own site, creating a burst of accesses to that
18、 site.)Politeness challengesEven if we restrict only one thread to fetch from a host, can hit it repeatedlyCommon heuristic: insert time gap between successive requests to a host that is time for most recent fetch from that hostURL frontier: Mercator schemePrioritizerBiased front queue selectorBack
19、queue routerBack queue selectorK front queuesB back queuesSingle host on eachURLsCrawl thread requesting URLMercators URL frontierURLs flow in from the top into the frontierFront queues manage prioritizationBack queues enforce politenessEach queue is FIFOFront queuesPrioritizer1KBiased front queue s
20、electorBack queue routerFront queuesPrioritizer assigns to URL an integer priority between 1 and KAppends URL to corresponding queueHeuristics for assigning priorityRefresh rate sampled from previous crawlsApplication-specific (e.g., “crawl news sites more often”)Biased front queue selectorWhen a ba
21、ck queue requests a URL (in a sequence to be described): picks a front queue from which to pull a URLThis choice can be round robin biased to queues of higher priority, or some more sophisticated variantCan be randomizedBack queuesBiased front queue selectorBack queue routerBack queue selector1BHeap
22、Back queue invariantsEach back queue is kept non-empty while the crawl is in progressEach back queue only contains URLs from a single hostMaintain a table from hosts to back queuesHost nameBack queue 31BBack queue heapOne entry for each back queueThe entry is the earliest time te at which the host c
23、orresponding to the back queue can be hit againThis earliest time is determined fromLast access to that hostAny time buffer heuristic we chooseBack queue processingA crawler thread seeking a URL to crawl:Extracts the root of the heapFetches URL at head of corresponding back queue q (look up from tab
24、le)Checks if queue q is now empty if so, pulls a URL v from front queuesIf theres already a back queue z for vs host, append v to z and pull another URL from front queues, repeatElse add v to qWhen q is non-empty, create heap entry for itNumber of back queues BKeep all threads busy while respecting
25、politenessMercator recommendation: three times as many back queues as crawler threadsConnectivity serversConnectivity ServerCS1: Bhar98b, CS2 & 3: Rand01Support for fast queries on the web graphWhich URLs point to a given URL?Which URLs does a given URL point to?Stores mappings in memory fromURL to
26、outlinks, URL to inlinksApplicationsCrawl controlWeb graph analysisLink analysisMost recent published workBoldi and Vigna/proceedings/docs/1p595.pdfWebgraph set of algorithms and a java implementationFundamental goal maintain node adjacency lists in memoryFor this, compressing the adjacency lists is
27、 the critical componentAdjacency listsThe set of neighbors of a nodeAssume each URL represented by an integerE.g., for a 4 billion page web, need 32 bits per nodeNaively, this demands 64 bits to represent each hyperlinkAdjaceny list compressionProperties exploited in compression:Similarity (between lists)Locality (many links from a page go to “nearby” pages)Use gap encodings in sorted listsDistribution of gap valuesStorageBoldi/Vigna get down to an average of 3 bits/link(URL to URL edge
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 电仪车间主任安全生产责任制培训
- 农产品质量安全管理制度全流程培训
- 2026年广西城市职业大学单招综合素质考试题库附答案详解(完整版)
- 学校消防安全责任制度培训
- 2026年广东省茂名市单招职业倾向性考试题库含答案详解(b卷)
- 工器具定期试验制度培训课件
- 2026年山西金融职业学院单招职业适应性测试题库附参考答案详解(考试直接用)
- 2026年山西职业技术学院单招职业倾向性考试题库附参考答案详解ab卷
- 2026年川北幼儿师范高等专科学校单招职业适应性测试题库带答案详解(a卷)
- 2026年山西省忻州市单招职业适应性考试题库带答案详解(轻巧夺冠)
- 《电工电子技术》课件-数字式万用表的使用
- 颌面部骨折围手术期的护理
- 《怡成血酮监测意义》课件
- 井字架搭拆作业架体的安装与拆除安全要求范本
- 主蒸汽管道更换施工方案
- 人工智能导论PPT完整全套教学课件
- 2023年浙江省普通高中学业水平考考纲物理
- ARJ21机型理论知识考试题库(汇总版)
- JJG 875-2019数字压力计
- 《薄膜材料与薄膜技术》教学配套课件
- 金属非金属地下矿山安全生产标准化评分办法-模板
评论
0/150
提交评论