Differences between revisions 3 and 4
Revision 3 as of 2006-03-07 22:11:15
Size: 2189
Editor: JeffRitchie
Comment: Examples, Config Values
Revision 4 as of 2009-09-20 23:10:15
Size: 2189
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 8: Line 8:
  '''<crawldb>:''' Path to the crawldb directory.[[BR]]
  '''<segments_dir>:''' Path to the directory where the Fetcher Segments are created.[[BR]]
  '''[-topN <num>]:''' Selects the top ''<num>'' ranking URLs for this segment. Default: ''Long.MAX_VALUE''[[BR]]
  '''[-numFetchers <fetchers>]:''' The number of fetch partitions. Default: ''Configuration key -> mapred.map.tasks -> 1''[[BR]]
  '''[-adddays <days>]:''' Adds <days> to the current time to facilitate crawling urls already fetched sooner then ''db.default.fetch.interval''. Default: ''0''[[BR]]
  '''<crawldb>:''' Path to the crawldb directory.<<BR>>
  '''<segments_dir>:''' Path to the directory where the Fetcher Segments are created.<<BR>>
  '''[-topN <num>]:''' Selects the top ''<num>'' ranking URLs for this segment. Default: ''Long.MAX_VALUE''<<BR>>
  '''[-numFetchers <fetchers>]:''' The number of fetch partitions. Default: ''Configuration key -> mapred.map.tasks -> 1''<<BR>>
  '''[-adddays <days>]:''' Adds <days> to the current time to facilitate crawling urls already fetched sooner then ''db.default.fetch.interval''. Default: ''0''<<BR>>
Line 15: Line 15:
 hadoop-default.xml[[BR]]
 hadoop-site.xml[[BR]]
 nutch-default.xml[[BR]]
 nutch-site.xml[[BR]]
 hadoop-default.xml<<BR>>
 hadoop-site.xml<<BR>>
 nutch-default.xml<<BR>>
 nutch-site.xml<<BR>>
Line 21: Line 21:
 The following properties directory affect how the Generator generates fetch segments.[[BR]][[BR]]
  generate.max.per.host -- Sets the maximum number of URLs from a single host to be generated for this fetch run. Default: unlimited.[[BR]]
 The following properties directory affect how the Generator generates fetch segments.<<BR>><<BR>>
  generate.max.per.host -- Sets the maximum number of URLs from a single host to be generated for this fetch run. Default: unlimited.<<BR>>

"generate" is an alias for "org.apache.nutch.crawl.Generator"

Generates a new Fetcher Segment from the Crawl Database

Usage

  • nutch-0.8-dev/bin/nutch org.apache.nutch.crawl.Generator <crawldb> <segments_dir> [-topN <num>] [-numFetchers <fetchers>] [-adddays <days>]

    • <crawldb>: Path to the crawldb directory.
      <segments_dir>: Path to the directory where the Fetcher Segments are created.
      [-topN <num>]: Selects the top <num> ranking URLs for this segment. Default: Long.MAX_VALUE
      [-numFetchers <fetchers>]: The number of fetch partitions. Default: Configuration key -> mapred.map.tasks -> 1
      [-adddays <days>]: Adds <days> to the current time to facilitate crawling urls already fetched sooner then db.default.fetch.interval. Default: 0

Configuration Files

  • hadoop-default.xml
    hadoop-site.xml
    nutch-default.xml
    nutch-site.xml

Configuration Values

  • The following properties directory affect how the Generator generates fetch segments.

    • generate.max.per.host -- Sets the maximum number of URLs from a single host to be generated for this fetch run. Default: unlimited.

Other Files

  • None.

Caveats and Notes

  • Differences from 0.7.1
    • One major change from 0.7.1 was that -numFetchers was used to influence the number of fetcher segments created. For instance if -numFetchers 2 was specified there would be 2 fetcher segments created under <segments_dir>. Under 0.8 this is no longer the case.

Examples

 nutch-0.8-dev/bin/nutch generate /my/crawldb /my/segments
  • This example will generate a fetch list that contains all URLs ready to be fetched from the Crawl Database. The Crawl Database is located at my/crawldb and the Generator will output the fetch list to /my/segments/yyyyMMddHHmmss.

 nutch-0.8-dev/bin/nutch generate /my/crawldb /my/segments -topN 100 -adddays 20
  • In this example the Generator will add 20 days to the current date/time when determining the top 100 scoring pages to fetch.

DevelopmentCommandLineOptions

nutch-0.8-dev/bin/nutch_generate (last edited 2009-09-20 23:10:15 by localhost)