"generate" is an alias for "org.apache.nutch.crawl.Generator"
Generates a new Fetcher Segment from the Crawl Database
Usage
nutch-0.8-dev/bin/nutch org.apache.nutch.crawl.Generator <crawldb> <segments_dir> [-topN <num>] [-numFetchers <fetchers>] [-adddays <days>]
<crawldb>: Path to the crawldb directory.
<segments_dir>: Path to the directory where the Fetcher Segments are created.
[-topN <num>]: Selects the top <num> ranking URLs for this segment. Default: Long.MAX_VALUE
[-numFetchers <fetchers>]: The number of fetch partitions. Default: Configuration key -> mapred.map.tasks -> 1
[-adddays <days>]: Adds <days> to the current time to facilitate crawling urls already fetched sooner then db.default.fetch.interval. Default: 0
Configuration Files
hadoop-default.xml
hadoop-site.xml
nutch-default.xml
nutch-site.xml
Configuration Values
The following properties directory affect how the Generator generates fetch segments.
generate.max.per.host -- Sets the maximum number of URLs from a single host to be generated for this fetch run. Default: unlimited.
Other Files
- None.
Caveats and Notes
- Differences from 0.7.1
One major change from 0.7.1 was that -numFetchers was used to influence the number of fetcher segments created. For instance if -numFetchers 2 was specified there would be 2 fetcher segments created under <segments_dir>. Under 0.8 this is no longer the case.
Examples
nutch-0.8-dev/bin/nutch generate /my/crawldb /my/segments
- This example will generate a fetch list that contains all URLs ready to be fetched from the Crawl Database. The Crawl Database is located at my/crawldb and the Generator will output the fetch list to /my/segments/yyyyMMddHHmmss.
nutch-0.8-dev/bin/nutch generate /my/crawldb /my/segments -topN 100 -adddays 20
- In this example the Generator will add 20 days to the current date/time when determining the top 100 scoring pages to fetch.