bin/nutch generate

generate is an alias for org.apache.nutch.tools.FetchListTool

The generate command is used to create a new fetchlist from the webdb which contains urls which can be fetched using the fetch tool.

Usage: bin/nutch org.apache.nutch.tools.FetchListTool (-local | -ndfs <namenode:port>)
<db> <segment_dir> [-refetchonly] [-anchoroptimize linkdb] [-topN N]
[-cutoff cutoffscore] [-numFetchers numFetchers] [-adddays numDays]

Command line parameters:

-topN N where N is a number of pages.

Normally, the "generate" command prepares a fetchlist out of all unfetched pages, or the ones where fetch interval already expired. But if you use -topN, then instead of all unfetched urls you only get N urls with the highest score - potentially the most interesting ones, which should be prioritized in fetching.

CommandLineOptions

last edited 2006-01-09 22:41:04 by JerryRussell