




已阅读5页,还剩54页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
nutch爬虫系统分析nutch分析1nutch简介21.1nutch体系结构22抓取部分32.1爬虫的数据结构及含义32.2抓取目录分析42.3抓取过程概述42.4抓取过程分析52.4.1inject方法62.4.2generate方法122.4.3fetch 方法142.4.4parse方法162.4.5update方法162.4.6invert方法192.4.7index方法232.4.8dedup方法262.4.9merge方法303配置文件分析313.1nutch-default.xml分析313.1.1313.1.2323.1.3353.1.4373.1.5413.1.6423.1.7433.1.8453.1.9453.1.10453.1.11483.1.12483.1.13493.1.14493.1.15513.1.16523.1.17523.1.18533.1.19533.1.20543.1.21553.1.22553.1.23553.1.24563.1.25563.2regex-urlfilter.txt解析583.3regex-normalize.xml解析583.4总结594参考资源591 nutch简介1.1 nutch体系结构2 抓取部分2.1 爬虫的数据结构及含义爬虫系统是由nutch的爬虫工具驱动的。并且把构建和维护一些数据结构类型同一系列工具关联起来:包括web database、一系列的segment和index。接下来我们将详细描述他们。三者的物理文件分别存储在爬行结果目录下的crawldb文件夹内,segments文件夹和index文件夹内。那么三者分别存储的信息是什么呢?web database,也叫webdb,其中存储的是爬虫所抓取网页之间的链接结构信息,它只在爬虫crawler工作中使用而和searcher的工作没有任何关系。webdb内存储了两种实体的信息:page和link。page实体通过描述网络上一个网页的特征信息来表征一个实际的网页,因为网页有很多个需要描述,webdb中通过网页的url和网页内容的md5两种索引方法对这些网页实体进行了索引。page实体描述的网页特征主要包括网页内的 link数目,抓取此网页的时间等相关抓取信息,对此网页的重要度评分等。同样的,link实体描述的是两个page实体之间的链接关系。webdb构成了一个所抓取网页的链接结构图,这个图中page实体是图的结点,而link实体则代表图的边。一次爬行会产生很多个segment,每个segment内存储的是爬虫crawler在单独一次抓取循环中抓到的网页以及这些网页的索引。 crawler爬行时会根据webdb中的link关系按照一定的爬行策略生成每次抓取循环所需的fetchlist,然后fetcher通过 fetchlist中的urls抓取这些网页并索引,然后将其存入segment。segment是有时限的,当这些网页被crawler重新抓取后,先前抓取产生的segment就作废了。在存储中。segment文件夹是以产生时间命名的,方便我们删除作废的segments以节省存储空间。index是crawler抓取的所有网页的索引,它是通过对所有单个segment中的索引进行合并处理所得的。nutch利用lucene技术进行索引,所以lucene中对索引进行操作的接口对nutch中的index同样有效。但是需要注意的是,lucene中的segment和nutch 中的不同,lucene中的segment是索引index的一部分,但是nutch中的segment只是webdb中各个部分网页的内容和索引,最后通过其生成的index跟这些segment已经毫无关系了。2.2 抓取目录分析抓取后一共生成5个文件夹,分别是:l crawldb目录存放下载的url,以及下载的日期,用来页面更新检查时间.l linkdb目录存放url的互联关系,是下载完成后分析得到的.l segments:存放抓取的页面,下面子目录的个数于获取的页面层数有关系,通常每一层页面会独立存放一个子目录,子目录名称为时间,便于管理.比如我这只抓取了一层页面就只生成了20090508173137目录.每个子目录里又有6个子文件夹如下: content:每个下载页面的内容。 crawl_fetch:每个下载url的状态。 crawl_generate:待下载url集合。 crawl_parse:包含来更新crawldb的外部链接库。 parse_data:包含每个url解析出的外部链接和元数据 parse_text:包含每个解析过的url的文本内容。l indexs:存放每次下载的独立索引目录l index:符合lucene格式的索引目录,是indexs里所有index合并后的完整索引2.3 抓取过程概述引用到的类主要有以下9个:1、nutch.crawl.inject用来给抓取数据库添加url的插入器2、nutch.crawl.generator用来生成待下载任务列表的生成器3、nutch.fetcher.fetcher完成抓取特定页面的抓取器4、nutch.parse.parsesegment负责内容提取和对下级url提取的内容进行解析的解析器5、nutch.crawl.crawldb负责数据库管理的数据库管理工具6、nutch.crawl.linkdb负责链接管理7、nutch.indexer.indexer负责创建索引的索引器8、nutch.indexer.deleteduplicates删除重复数据9、nutch.indexer.indexmerger对当前下载内容局部索引和历史索引进行合并的索引合并器2.4 抓取过程分析crawler的工作原理主要是:首先crawler根据webdb生成一个待抓取网页的url集合叫做fetchlist,接着下载线程fetcher开始根据 fetchlist将网页抓取回来,如果下载线程有很多个,那么就生成很多个fetchlist,也就是一个fetcher对应一个fetchlist。然后crawler根据抓取回来的网页webdb进行更新,根据更新后的webdb生成新的fetchlist,里面是未抓取的或者新发现的urls,然后下一轮抓取循环重新开始。这个循环过程可以叫做“产生/抓取/更新”循环。指向同一个主机上web资源的urls通常被分配到同一个fetchlist中,这样的话防止过多的fetchers对一个主机同时进行抓取造成主机负担过重。另外nutch遵守robots exclusion protocol,网站可以通过自定义robots.txt控制crawler的抓取。在nutch中,crawler操作的实现是通过一系列子操作的实现来完成的。这些子操作nutch都提供了子命令行可以单独进行调用。下面就是这些子操作的功能描述以及命令行,命令行在括号中。1. 创建一个新的webdb (admin db -create).2. 将抓取起始urls写入webdb中 (inject).3. 根据webdb生成fetchlist并写入相应的segment(generate).4. 根据fetchlist中的url抓取网页 (fetch).5. 根据抓取网页更新webdb (updatedb).6. 循环进行35步直至预先设定的抓取深度。7. 分析链接关系,生成反向链接.(此步1.0特有,具体作用?)8. 对所抓取的网页进行索引(index).9. 在索引中丢弃有重复内容的网页和重复的urls (dedup).10. 将segments中的索引进行合并生成用于检索的最终index(merge).crawler详细工作流程是:在创建一个webdb之后(步骤1), “产生/抓取/更新”循环(步骤36)根据一些种子urls开始启动。当这个循环彻底结束,crawler根据抓取中生成的segments创建索引(步骤810)。在进行重复urls清除(步骤9)之前,每个segment的索引都是独立的(步骤8)。最终,各个独立的segment索引被合并为一个最终的索引index(步骤10)。其中有一个细节问题,dedup操作主要用于清除segment索引中的重复urls,但是我们知道,在webdb中是不允许重复的url存在的,那么为什么这里还要进行清除呢?原因在于抓取的更新。比方说一个月之前你抓取过这些网页,一个月后为了更新进行了重新抓取,那么旧的segment在没有删除之前仍然起作用,这个时候就需要在新旧segment之间进行除重。下边是在crawl类设置断点调试每个方法的结果.2.4.1 inject方法描述:初始化爬取的crawldb,读取url配置文件,把内容注入爬取数据库.首先会找到读取url配置文件的目录urls.如果没创建此目录,nutch1.0下会报错.得到hadoop处理的临时文件夹:/tmp/hadoop-administrator/mapred/日志信息如下:2009-05-08 15:41:36,640 info injector - injector: starting2009-05-08 15:41:37,031 info injector - injector: crawldb: 20090508/crawldb2009-05-08 15:41:37,781 info injector - injector: urldir: urls接着设置一些初始化信息.调用hadoop包jobclient.runjob方法,跟踪进入jobclient下的submitjob方法进行提交整个过程.具体原理又涉及到另一个开源项目hadoop的分析,它包括了复杂的mapreduce架构,此处不做分析。查看submitjob方法,首先获得jobid,执行configurecommandlineoptions方法后会在上边的临时文件夹生成一个system文件夹,同时在它下边生成一个job_local_0001文件夹.执行writesplitsfile后在job_local_0001下生成job.split文件.执行writexml写入job.xml,然后执行jobsubmitclient.submitjob正式提交整个job流程,日志如下:2009-05-08 15:41:36,640 info injector - injector: starting2009-05-08 15:41:37,031 info injector - injector: crawldb: 20090508/crawldb2009-05-08 15:41:37,781 info injector - injector: urldir: urls2009-05-08 15:52:41,734 info injector - injector: converting injected urls to crawl db entries.2009-05-08 15:56:22,203 info jvmmetrics - initializing jvm metrics with processname=jobtracker, sessionid=2009-05-08 16:08:20,796 warn jobclient - use genericoptionsparser for parsing the arguments. applications should implement tool for the same.2009-05-08 16:08:20,984 warn jobclient - no job jar file set. user classes may not be found. see jobconf(class) or jobconf#setjar(string).2009-05-08 16:24:42,593 info fileinputformat - total input paths to process : 12009-05-08 16:38:29,437 info fileinputformat - total input paths to process : 12009-05-08 16:38:29,546 info maptask - numreducetasks: 12009-05-08 16:38:29,562 info maptask - io.sort.mb = 1002009-05-08 16:38:29,687 info maptask - data buffer = 79691776/996147202009-05-08 16:38:29,687 info maptask - record buffer = 262144/3276802009-05-08 16:38:29,718 info pluginrepository - plugins: looking in: d:workworkspacenutch_crawlbinplugins2009-05-08 16:38:29,921 info pluginrepository - plugin auto-activation mode: true2009-05-08 16:38:29,921 info pluginrepository - registered plugins:2009-05-08 16:38:29,921 info pluginrepository - the nutch core extension points (nutch-extensionpoints)2009-05-08 16:38:29,921 info pluginrepository - basic query filter (query-basic)2009-05-08 16:38:29,921 info pluginrepository - basic url normalizer (urlnormalizer-basic)2009-05-08 16:38:29,921 info pluginrepository - basic indexing filter (index-basic)2009-05-08 16:38:29,921 info pluginrepository - html parse plug-in (parse-html)2009-05-08 16:38:29,921 info pluginrepository - site query filter (query-site)2009-05-08 16:38:29,921 info pluginrepository - basic summarizer plug-in (summary-basic)2009-05-08 16:38:29,921 info pluginrepository - http framework (lib-http)2009-05-08 16:38:29,921 info pluginrepository - text parse plug-in (parse-text)2009-05-08 16:38:29,921 info pluginrepository - pass-through url normalizer (urlnormalizer-pass)2009-05-08 16:38:29,921 info pluginrepository - regex url filter (urlfilter-regex)2009-05-08 16:38:29,921 info pluginrepository - http protocol plug-in (protocol-http)2009-05-08 16:38:29,921 info pluginrepository - xml response writer plug-in (response-xml)2009-05-08 16:38:29,921 info pluginrepository - regex url normalizer (urlnormalizer-regex)2009-05-08 16:38:29,921 info pluginrepository - opic scoring plug-in (scoring-opic)2009-05-08 16:38:29,921 info pluginrepository - cyberneko html parser (lib-nekohtml)2009-05-08 16:38:29,921 info pluginrepository - anchor indexing filter (index-anchor)2009-05-08 16:38:29,921 info pluginrepository - javascript parser (parse-js)2009-05-08 16:38:29,921 info pluginrepository - url query filter (query-url)2009-05-08 16:38:29,921 info pluginrepository - regex url filter framework (lib-regex-filter)2009-05-08 16:38:29,921 info pluginrepository - json response writer plug-in (response-json)2009-05-08 16:38:29,921 info pluginrepository - registered extension-points:2009-05-08 16:38:29,921 info pluginrepository - nutch summarizer (org.apache.nutch.searcher.summarizer)2009-05-08 16:38:29,921 info pluginrepository - nutch protocol (tocol.protocol)2009-05-08 16:38:29,921 info pluginrepository - nutch analysis (org.apache.nutch.analysis.nutchanalyzer)2009-05-08 16:38:29,921 info pluginrepository - nutch field filter (org.apache.nutch.indexer.field.fieldfilter)2009-05-08 16:38:29,921 info pluginrepository - html parse filter (org.apache.nutch.parse.htmlparsefilter)2009-05-08 16:38:29,921 info pluginrepository - nutch query filter (org.apache.nutch.searcher.queryfilter)2009-05-08 16:38:29,921 info pluginrepository - nutch search results response writer (org.apache.nutch.searcher.response.responsewriter)2009-05-08 16:38:29,921 info pluginrepository - nutch url normalizer (.urlnormalizer)2009-05-08 16:38:29,921 info pluginrepository - nutch url filter (.urlfilter)2009-05-08 16:38:29,921 info pluginrepository - nutch online search results clustering plugin (org.apache.nutch.clustering.onlineclusterer)2009-05-08 16:38:29,921 info pluginrepository - nutch indexing filter (org.apache.nutch.indexer.indexingfilter)2009-05-08 16:38:29,921 info pluginrepository - nutch content parser (org.apache.nutch.parse.parser)2009-05-08 16:38:29,921 info pluginrepository - nutch scoring (org.apache.nutch.scoring.scoringfilter)2009-05-08 16:38:29,921 info pluginrepository - ontology model loader (org.apache.nutch.ontology.ontology)2009-05-08 16:38:29,968 info configuration - found resource crawl-urlfilter.txt at file:/d:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt2009-05-08 16:38:29,984 warn regexurlnormalizer - cant find rules for scope inject, using default2009-05-08 16:38:29,984 info maptask - starting flush of map output2009-05-08 16:38:30,203 info maptask - finished spill 02009-05-08 16:38:30,203 info taskrunner - task:attempt_local_0001_m_000000_0 is done. and is in the process of commiting2009-05-08 16:38:30,218 info localjobrunner - file:/d:/work/workspace/nutch_crawl/urls/site.txt:0+192009-05-08 16:38:30,218 info taskrunner - task attempt_local_0001_m_000000_0 done.2009-05-08 16:38:30,234 info localjobrunner - 2009-05-08 16:38:30,250 info merger - merging 1 sorted segments2009-05-08 16:38:30,265 info merger - down to the last merge-pass, with 1 segments left of total size: 53 bytes2009-05-08 16:38:30,265 info localjobrunner - 2009-05-08 16:38:30,390 info taskrunner - task:attempt_local_0001_r_000000_0 is done. and is in the process of commiting2009-05-08 16:38:30,390 info localjobrunner - 2009-05-08 16:38:30,390 info taskrunner - task attempt_local_0001_r_000000_0 is allowed to commit now2009-05-08 16:38:30,406 info fileoutputcommitter - saved output of task attempt_local_0001_r_000000_0 to file:/tmp/hadoop-administrator/mapred/temp/inject-temp-4741923042009-05-08 16:38:30,406 info localjobrunner - reduce reduce2009-05-08 16:38:30,406 info taskrunner - task attempt_local_0001_r_000000_0 done.执行完后返回的running值如下:job: job_local_0001file: file:/tmp/hadoop-administrator/mapred/system/job_local_0001/job.xmltracking url: http:/localhost:8080/2009-05-08 16:47:14,093 info jobclient - running job: job_local_00012009-05-08 16:49:51,859 info jobclient - job complete: job_local_00012009-05-08 16:51:36,062 info jobclient - counters: 112009-05-08 16:51:36,062 info jobclient - file systems2009-05-08 16:51:36,062 info jobclient - local bytes read=515912009-05-08 16:51:36,062 info jobclient - local bytes written=1043372009-05-08 16:51:36,062 info jobclient - map-reduce framework2009-05-08 16:51:36,062 info jobclient - reduce input groups=12009-05-08 16:51:36,062 info jobclient - combine output records=02009-05-08 16:51:36,062 info jobclient - map input records=12009-05-08 16:51:36,062 info jobclient - reduce output records=12009-05-08 16:51:36,062 info jobclient - map output bytes=492009-05-08 16:51:36,062 info jobclient - map input bytes=192009-05-08 16:51:36,062 info jobclient - combine input records=02009-05-08 16:51:36,062 info jobclient - map output records=12009-05-08 16:51:36,062 info jobclient - reduce input records=1至此第一个runjob方法执行结束.总结:待写接下来就是生成crawldb文件夹,并把urls合并注入到它的里面.jobclient.runjob(mergejob);crawldb.install(mergejob, crawldb);这个过程首先会在前面提到的临时文件夹下生成job_local_0002目录,和上边一样同样会生成job.split和job.xml,接着完成crawldb的创建,最后删除临时文件夹temp下的文件.至此inject过程结束.最后部分日志如下:2009-05-08 17:03:57,250 info injector - injector: merging injected urls into crawl db.2009-05-08 17:10:01,015 info jvmmetrics - cannot initialize jvm metrics with processname=jobtracker, sessionid= - already initialized2009-05-08 17:10:15,953 warn jobclient - use genericoptionsparser for parsing the arguments. applications should implement tool for the same.2009-05-08 17:10:16,156 warn jobclient - no job jar file set. user classes may not be found. see jobconf(class) or jobconf#setjar(string).2009-05-08 17:12:15,296 info fileinputformat - total input paths to process : 12009-05-08 17:13:40,296 info fileinputformat - total input paths to process : 12009-05-08 17:13:40,406 info maptask - numreducetasks: 12009-05-08 17:13:40,406 info maptask - io.sort.mb = 1002009-05-08 17:13:40,515 info maptask - data buffer = 79691776/996147202009-05-08 17:13:40,515 info maptask - record buffer = 262144/3276802009-05-08 17:13:40,546 info maptask - starting flush of map output2009-05-08 17:13:40,765 info maptask - finished spill 02009-05-08 17:13:40,765 info taskrunner - task:attempt_local_0002_m_000000_0 is done. and is in the process of commiting2009-05-08 17:13:40,765 info localjobrunner - file:/tmp/hadoop-administrator/mapred/temp/inject-temp-474192304/part-00000:0+1432009-05-08 17:13:40,765 info taskrunner - task attempt_local_0002_m_000000_0 done.2009-05-08 17:13:40,796 info localjobrunner - 2009-05-08 17:13:40,796 info merger - merging 1 sorted segments2009-05-08 17:13:40,796 info merger - down to the last merge-pass, with 1 segments left of total size: 53 bytes2009-05-08 17:13:40,796 info localjobrunner - 2009-05-08 17:13:40,906 warn nativecodeloader - unable to load native-hadoop library for your platform. using builtin-java classes where applicable2009-05-08 17:13:40,906 info codecpool - got brand-new compressor2009-05-08 17:13:40,906 info taskrunner - task:attempt_local_0002_r_000000_0 is done. and is in the process of commiting2009-05-08 17:13:40,906 info localjobrunner - 2009-05-08 17:13:40,906 info taskrunner - task attempt_local_0002_r_000000_0 is allowed to commit now2009-05-08 17:13:40,921 info fileoutputcommitter - saved output of task attempt_local_0002_r_000000_0 to file:/d:/work/workspace/nutch_crawl/20090508/crawldb/18965677452009-05-08 17:13:40,921 info localjobrunner - reduce reduce2009-05-08 17:13:40,937 info taskrunner - task attempt_local_0002_r_000000_0 done.2009-05-08 17:13:46,781 info jobclient - running job: job_local_00022009-05-08 17:14:55,125 info jobclient - job complete: job_local_00022009-05-08 17:14:59,328 info jobclient - counters: 112009-05-08 17:14:59,328 info jobclient - file systems2009-05-08 17:14:59,328 info jobclient - local bytes read=1038752009-05-08 17:14:59,328 info jobclient - local bytes written=2093852009-05-08 17:14:59,328 info jobclient - map-reduce framework2009-05-08 17:14:59,328 info jobclient - reduce input groups=12009-05-08 17:14:59,328 info jobclient - combine output records=02009-05-08 17:14:59,328 info jobclient - map input records=12009-05-08 17:14:59,328 info jobclient - reduce output records=12009-05-08 17:14:59,328 info jobclient - map output bytes=492009-05-08 17:14:59,328 info jobclient - map input bytes=572009-05-08 17:14:59,328 info jobclient - combine input records=02009-05-08 17:14:59,328 info jobclient - map output records=12009-05-08 17:14:59,328 info jobclient - reduce input records=12009-05-08 17:17:30,984 info jvmmetrics - cannot initialize jvm metrics with processname=jobtracker, sessionid= - already initialized2009-05-08 17:20:02,390 info injector - injector: done2.4.2 generate方法描述:从爬取数据库中生成新的segment,然后从中生成待下载任务列表(fetchlist).lockutil.createlockfile(fs, lock, force);首先执行上边方法后会在crawldb目录下生成.locked文件,猜测作用是防止crawldb的数据被修改,真实作用有待验证.接着执行的过程和上边大同小异,可参考上边步骤,日志如下:2009-05-08 17:37:18,218 info generator - generator: selecting best-scoring urls due for fetch.2009-05-08 17:37:18,625 info generator - generator: starting2009-05-08 17:37:18,937 info generator - generator: segment: 20090508/segments/200905081731372009-05-08 17:37:19,468 info generator - generator: filtering: true2009-05-08 17:37:22,312 info generator - generator: topn: 502009-05-08 17:37:51,203 info generator - generator: jobtracker is local, generating exactly one partition.2009-05-08 17:39:57,609 info jvmmetrics - cannot initialize jvm metrics with processname=jobtracker, sessionid=
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 【中考专题】2026年中考数学专项提优复习:代数式【附答案】
- 2025产品代理合同范本
- 2025年大庆油田有限责任公司春季高校毕业生招聘模拟试卷附答案详解(完整版)
- 2025合同买卖协议书样本
- 2025江西赣州经济技术开发区退役军人服务中心招聘见习生1人考前自测高频考点模拟试题及答案详解(夺冠)
- 2025年四川省成都市青白江区七所“两自一包”公办学校招聘教师(152人)考前自测高频考点模拟试题及答案详解(考点梳理)
- 2025医疗机构专家劳动合同书
- 2025年牡丹江绥芬河市公开招聘公益性岗位工作人员20人(第一批)考前自测高频考点模拟试题及答案详解(夺冠系列)
- 焦作师范考试题库及答案
- 煤矿班长考试题库及答案
- pos风险管理办法
- 中石化质量管理体系
- 上肢静脉血管超声检查规范与应用
- 2025年职业指导师(二级)专业能力职业素养提升辅导策略实务策略试卷
- 2025 精神科护理抑郁患者干预医学查房课件
- 2025年汽车驾驶员(技师)考试题库及答案
- 遵义介绍课件
- 播音主持重音的教学课件
- 2025年辽宁省公安招聘辅警考试试卷及答案
- 2025年福建省选调生考试综合知识真题解析试卷
- 飞书软件使用培训
评论
0/150
提交评论