Generate is an alias for org.apache.nutch.crawl.Generator

This class generates a subset of a crawl db to fetch. This version allows us to generate fetchlists for several segments in one go. Unlike in the initial version (FetchListTool), the IP resolution is done ONLY on the entries which have been selected for fetching. The URLs are partitioned by IP, domain or host within a segment. We can chose separately how to count the URLS i.e. by domain or host to limit the entries.

Usage: bin/nutch generate <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm][-maxNumSegments num]

<crawldb>: Path to the location of our crawldb directory.

<segments_dir>: Path to the location of our segments directory where the Fetcher Segments are created.

[-force]: This arguement will force an update even if there appears to be a lock. /!\ : CAUTION: advised /!\

[-topN N]: Where N is the number of top URLs to be selected. Normally, the "generate" command prepares a fetchlist out of all unfetched pages, or the ones where fetch interval already expired. But if you use -topN, then instead of all unfetched urls you only get N urls with the highest score - potentially the most interesting ones, which should be prioritized in fetching.

[-numFetchers numFetchers]: The number of fetch partitions. Default: Configuration key -> mapred.map.tasks -> 1 (in local mode), possibly multiple in deploy/distributed mode.

[-adddays numDays]: Adds <days> to the current time to facilitate crawling urls already fetched sooner then db.default.fetch.interval. Default: 0

[-noFilter]:Whether to filter URLs or not is read from the crawl.generate.filter property in nutch-site.xml/nutch-default.xml configuration files. If the property is not found, the URLs are filtered. Same for the normalisation

[-noNorm]: The exact same applies for normalisation parameter as does for the filtering option above.

[-maxNumSegments num:

Configuration Files

Configuration Values

Examples

bin/nutch org.apache.nutch.crawl.Generator /my/crawldb /my/segments

bin/nutch org.apache.nutch.crawl.Generator /my/crawldb /my/segments -topN 100 -adddays 20

CommandLineOptions

bin/nutch_generate (last edited 2011-08-23 14:05:00 by LewisJohnMcgibbney)