Heritrix使用手册.doc_第1页
Heritrix使用手册.doc_第2页
Heritrix使用手册.doc_第3页
Heritrix使用手册.doc_第4页
Heritrix使用手册.doc_第5页
已阅读5页,还剩49页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Heritrix User ManualInternet ArchiveKristinn SigurssonMichael StackIgor RanitovicTable of Contents1. Introduction2. Installing and running Heritrix2.1. Obtaining and installing Heritrix2.2. Running Heritrix3. Web based user interface4. A quick guide to running your first crawl job5. Creating jobs and profiles5.1. Crawl job5.2. Profile6. Configuring jobs and profiles6.1. Modules (Scope, Frontier, and Processors)6.2. Submodules6.3. Settings6.4. Overrides6.5. Refinements7. Running a job7.1. Web Console7.2. Pending jobs7.3. Monitoring a running job7.4. Editing a running job8. Analysis of jobs8.1. Completed jobs8.2. Logs8.3. Reports9. Outside the user interface9.1. Generated files9.2. Helpful scripts9.3. Recovery of Frontier State and recover.gz9.4. Checkpointing9.5. Remote Monitoring and ControlGlossary1. IntroductionHeritrix is the Internet Archives open-source, extensible, web-scale, archival-quality web crawler.This document explains how to create, configure and run crawls using Heritrix. It is intended for users of the software and presumes that they possess at least a general familiarity with the concept of web crawling.For a general overview on Heritrix, see An Introduction to Heritrix.If you want to build Heritrix from source or if youd like to make contributions and would like to know about contribution conventions, etc., see instead the Developers Manual.2. Installing and running HeritrixThis chapter will explain how to set up Heritrix.Because Heritrix is a pure Java program it can (in theory anyway) be run on any platform that has a Java 1.4 VM. However we are only committed to supporting its operation on Linux and so this chapter only covers setup on that platform. Because of this, what follows assumes basic Linux administration skills. Other chapters in the user manual are platform agnostic.This chapter also only covers installing and running the prepackaged binary distributions of Heritrix. For information about downloading and compiling the source see the Developers Manual.2.1. Obtaining and installing HeritrixThe packaged binary can be downloaded from the projects sourceforge home page. Each release comes in four flavors, packaged as .tar.gz or .zip and including source or not.For installation on Linux get the file heritrix-?.?.?.tar.gz (where ?.?.? is the most recent version number).The packaged binary comes largely ready to run. Once downloaded it can be untarred into the desired directory. % tar xfz heritrix-?.?.?.tar.gzOnce you have downloaded and untarred the correct file you can move on to the next step.2.1.1. System requirements. Java Runtime EnvironmentThe Heritrix crawler is implemented purely in Java. This means that the only true requirement for running it is that you have a JRE installed (Building will require a JDK).The Heritrix crawler makes use of Java 1.4 features so your JRE must be at least of a 1.4.x pedigree.We currently include all of the free/open source third-party libraries necessary to run Heritrix in the distribution package. See dependencies for the complete list (Licenses for all of the listed libraries are listed in the dependencies section of the raw project.xml at the root of the src download or on Sourceforge)..1. Installing JavaIf you do not have Java installed you can download Java from:Sun - IBM - /java. HardwareDefault a java heap of 256MB RAM, which is usually suitable for crawls that range over hundreds of hosts. Assign more - see Section , “JAVA_OPTS” for how - of your available ram to the heap if you are crawling thousands of hosts or experience Java out-of-memory problems.. LinuxThe Heritrix crawler has been built and tested primarily on Linux. It has seen some informal use on Macintosh, Windows 2000 and Windows XP, but is not tested, packaged, nor supported on platforms other than Linux at this time.2.2. Running HeritrixTo run Heritrix, first do the following: % export HERITRIX_HOME=/PATH/TO/BUILT/HERITRIX.where $HERITRIX_HOME is the location of your untarred heritrix.?.?.?.tar.gz. Next run: % cd $HERITRIX_HOME % chmod u+x $HERITRIX_HOME/bin/heritrix % $HERITRIX_HOME/bin/heritrix -helpThis should give you usage output like the following: Usage: heritrix -help Usage: heritrix -nowui ORDER_FILE Usage: heritrix -port=PORT -admin=LOGIN:PASSWORD -run ORDER_FILE Usage: heritrix -port=PORT -selftest=TESTNAME Version: 0.11.0 Options: -a,-admin Login and password for web user interface administration. Default: admin/letmein. -h,-help Prints this message and exits. -n,-nowui Put heritrix into run mode and begin crawl using ORDER_FILE. Do not put up web user interface. -p,-port Port to run web user interface on. Default: 8080. -r,-run Put heritrix into run mode. If ORDER_FILE begin crawl. -s,-selftest Run the integrated selftests. Pass test name to it only (Case sensitive: E.g. pass Charset to run charset selftest). Arguments: ORDER_FILE Crawl order to run.Launch the crawler with the UI enabled by doing the following: % $HERITRIX_HOME/bin/heritrixThis will start up Heritrix printing out a startup message that looks like the following: b116-dyn-60 619 heritrix-0.4.0 ./bin/heritrix Tue Feb 10 17:03:01 PST 2004 Starting heritrix. Tue Feb 10 17:03:05 PST 2004 Heritrix 0.4.0 is running. Web UI is at: :8080/admin Login and password: admin/letmeinSee Section 3, “Web based user interface” and Section 4, “A quick guide to running your first crawl job” to get your first crawl up and running.2.2.1. Command line optionsA quick overview of the most useful command line options.It is not necessary to specify any command line options to run the crawler.. -port=PORT -port=PORTSet what port the web based user interface runs on. By default this is port 80.2. -admin=LOGIN:PASSWORD -admin=LOGIN:PASSWORDChange the default admin username and password. If you do not do this then the default username and password will be in effect. Since they are widely known that may not be desirable. Default username is admin and the default password is letmein.2.2.2. Environment variablesBelow are environment variables that effect Heritrix operation.. HERITRIX_HOMESet this environment variable to point at the Heritrix home directory. For example, if youve unpacked Heritrix in your home directory and Heritrix is sitting in the heritrix-1.0.0 directory, youd set HERITRIX_HOME as follows. Assuming your shell is bash: % export HERITRIX_HOME=/heritrix-1.0.0If you dont set this environment variable, the Heritrix start script makes a guess at the home for Heritrix. It doesnt always guess correctly. . JAVA_HOMEThis environment variable may already exist. It should point to the Java installation on the machine. An example of how this might be set (assuming your shell is bash): % export JAVA_HOME=/usr/local/java/jre/. JAVA_OPTSPass options to the Heritrix JVM by populating the JAVA_OPTS environment variable with values. For example, if you want to have Heritrix run with a larger heap, say 512 megs, you could do either of the following (assuming your shell is bash): % export JAVA_OPTS=-Xmx512M % $HERITRIX_HOME/bin/heritrixOr, you could do it all on the one line as follows: % JAVA_OPTS=-Xmx512m $HERITRIX_HOME/bin/heritrix2.2.3. System propertiesBelow we document the system properties passed on the command-line that can influence Heritrixs behavior. If you are using the /bin/heritrix script to launch Heritrix you may have to edit it to change/set these properties or else pass them as part of JAVA_OPTS.. pertiesSet this property to point at an alternate perties file - e.g.: -Dperties=/tmp/perties - when you want heritrix to use a properties file other than that found at conf/perties.. heritrix.developmentSet this property when you want to run the crawler from eclipse. This property takes no arguments. When this property is set, the conf and webapps directories will be found in their development locations and startup messages will show on the text console (standard out).. heritrix.homeWhere heritrix is homed usually passed by the heritrix launch script.. heritrix.outWhere stdout/stderr are sent, usually heritrix_out.log and passed by the heritrix launch script.. heritrix.versionVersion of heritrix set by the heritrix build into perties.. dirWhere to drop heritrix jobs. Usually empty. Default location is $HERITRIX_HOME/jobs.. heritrix.cmdlineThis set of system properties are rarely used. They are for use when Heritrix has NOT been started from the command-line - e.g. its been embedded in another application - and the startup configuration that is set usually by command-line options, instead needs to be done via system properties alone..1. heritrix.cmdline.adminValue is a colon-delimited String user name and password for admin GUI.2. heritrix.cmdline.nowuiIf set to true, will start up embedded web server..3. heritrix.cmdline.orderIf set to true, will start up embedded web server..4. heritrix.cmdline.portValue is port GUI is to run on..5. heritrix.cmdline.runIf true, crawler is set into run mode on startup.. .ssl.trustStoreHeritrix has its own trust store at conf/heritrix.cacerts that it uses if the FetcherHTTP is configured to use a trust level of other than open (open is the default setting). In the unusual case where youd like to have Heritrix use an alternate truststore, point at the alternate by supplying the JSSE .ssl.trustStore property on the command line: e.g.. java.util.logging.config.fileThe Heritrix conf directory includes a file named perties. A section of this file specifies the default Heritrix logging configuration. To override these settings, point java.util.logging.config.file at a properties file with an alternate logging configuration. Below we reproduce the default perties for reference: # Basic logging setup; to console, all levels handlers= java.util.logging.ConsoleHandler java.util.logging.ConsoleHandler.level= ALL # Default global logging level: only warnings or higher .level= WARNING # currently necessary (?) for standard logs to work crawl.level= INFO runtime-errors.level= INFO uri-errors.level= INFO progress-statistics.level= INFO recover.level= INFO # HttpClient is too chatty. only want to hear about severe problems mons.httpclient.level= SEVEREHeres an example of how you might specify an override: % JAVA_OPTS=-Djava.util.logging.config.file=perties ./bin/heritrix -no-wui order.xmlAlternatively you could edit the default file.0. java.io.tmpdirSpecify an alternate tmp directory. Default is /tmp.1. com.sun.management.jmxremote.portWhat port to start up JMX Agent on. Default is 8849. See also the environment variable JMX_PORT.3. Web based user interfaceAfter Heritrix has been launched from the command line, the web based user interface (WUI) becomes accessible.The URI to access the WUI is printed on the text console from which the program was launched (typically http:/:8080/admin/).The WUI is password protected. By default, the username is admin with the password of letmein. The username and password can be specified at startup and users are encouraged to change them. The currently valid username and password combination will be printed out to the console along with the path.The WUI can then be accessed via any browser. While weve endeavoured to make certain that it functions in all recent browsers, Mozilla 5 or newer is recommended. IE 6 or newer should also work without problems.The initial login page takes the standard username/password combination discussed above. Logins are valid for 24 hours. Once it times you out, the user will need to login again.CautionThe access control to the WUI is not encrypted! Passwords will be submitted over the network in plain text.4. A quick guide to running your first crawl jobOnce youve installed Heritrix and logged into the WUI (see above) you are presented with the web Console page. Near the top there is a row of tabs.Step 1. Create a jobTo create a new job choose the Jobs tab, this will take you to the Jobs page. Once there you are presented with three options for creating a new job. Select With defaults. This will create a new job based on the default profile (see Section 5.2, “Profile”).On the screen that comes next you will be asked to supply a name, description and a seed list for the new job.For a name supply a short text with no special characters or spaces (except dash and underscore). You can skip the description if you like. In the seeds list type in the URL of the sites you are interested in harvesting. One URL to a line.Creating a job is covered in greater detail in Section 5, “Creating jobs and profiles”.Step 2. Configure the jobOnce youve entered this information in you are ready to go to the configuration pages. Click the Modules button in the row of buttons at the bottom of the page.This will take you to the modules configuration page (more details in Section 6.1, “Modules (Scope, Frontier, and Processors)”). For now we are only interested in the option second from the top named Select crawl scope. It allows you to specify the limits of the crawl. By default it is limited to the domains that your seeds span. This may be suitable for your purposes. If not you can choose a broad scope (not limited to the domains of its seeds) or the more restrictive host scope that limits the crawl to the hosts that its seeds span. For more on scopes refer to Section 6.1.1, “Crawl Scope”.To change scopes, select the new one from the combobox and click the Changebutton.Next turn your attention to the second row of tabs at the top of the page, below the usual tabs. You are currently on the far left tab. Now select the tab called Settings near the middle of the row.This takes you to the Settings page. It allows you to configure various details of the crawl. Exhaustive coverage of this page can be found in Section 6.3, “Settings”. For now we are only interested in the two settings under http-headers. These are the user-agent and from field of the HTTP headers in the crawlers requests. You must set them to valid values before a crawl can be run. The current values upper-case what needs replacing. If you have trouble with that please refer to Section , “HTTP headers” for whats regarded as valid values.Once youve set the http-headers settings to proper values (and made any other desired changes), you can click the Submit job tab at the far right of the second row of tabs. The crawl job is now configured and ready to run.Configuring a job is covered in greater detail in Section 6, “Configuring jobs and profiles”.Step 3. Running the jobNew jobs that have been submitted are placed in a queue of pending jobs. The crawler does not start processing jobs from this queue until the crawler is started. While the crawler is stopped, jobs are simply held.To start the crawler, click on the Console tab. Once on the Console page, you will find the option Start at the top of the Crawler Status box, just to the right of the indicator of current status. Clicking this option will put the crawling into Crawling Jobs mode, where it will begin crawling any next pending job, such as the job you just created and configured.The Console will update to display progress information about the on-going crawl. Click the Refresh option (or the top-left Heritrix logo) to update this information.For more information about running a job see Section 7, “Running a job”.Detailed information about evaluating the progress of a job can be found in Section 8, “Analysis of jobs”.5. Creating jobs and profilesIn order to run a crawl a configuration must be created that defines it. In Heritrix such a configuration is called a crawl job.5.1. Crawl jobA crawl job encompasses the configurations needed to run a single crawl. It also contains some additional elements such as file locations, status etc.Once logged onto the WUI new jobs can be created by going to the Jobs tab. Once the Jobs page loads users can create jobs by choosing of the following three options:Based on existing jobThis option allows the user to create a job by basing it on any existing job, regardless of whether it has been crawled or not. Can be useful for repeating crawls or recovering a crawl that had problems. (See Section 9.3, “Recovery of Frontier State and recover.gz”Based on a profileThis option allows the user to create a job by basing it on any existing profiles.With defaultsThis option creates a new crawl job based on the default profile.Options 1 and 2 will display a list of available options. Initially there are two profiles and no existing jobs.All crawl jobs are created by basing them on profiles (see Section 5.2, “Profile”) or existing jobs.Once the proper profile/job has been chosen to base the new job on, a simple page will appear asking for the new jobs:NameThe name must only contain letters, numbers, dash (-) and underscore (_). No other characters are allowed. This name will be used to identify the crawl in the WUI but it need not be unique. The name can not be changed laterDescriptionA short description of the job. This is a freetext inpu

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论