Contents
Nutch Command Line Options of bin/nutch
The following is a complete list of Nutch command line options. That is to say that some or all of the options may not be available in the particular version of Nutch you are using. For version specific options please see the relevant check box, once you know that such a command exists for your particular Nutch distribution, you can navigate to the relevant wiki entry for a detailed descritpion of the tool.
The script bin/nutch is a helper which picks different java classes to "run".
The new script bin/crawl NUTCH-1087 is also part of the new versions, that is written for crawls and to replace the bin/nutch crawl command.
Note: Most commands print help when invoked w/o parameters.
See each entry for details of the command arguments and options.
command |
function |
version |
|
|
|
1.x |
2.x |
One-step crawler for intranets |
X |
X |
|
Read / dump crawl db |
X |
X |
|
Merge crawldb-s, with optional filtering |
X |
||
Read / dump link db |
X |
||
Inject new urls into the database |
X |
X |
|
Inject new urls into the hostdatabase |
|
X |
|
Generate new segments to fetch from crawldb |
X |
X |
|
Generate new segments to fetch from text files |
X |
||
Fetch a segment's pages |
X |
X |
|
Parse a segment's pages |
X |
X |
|
Read / dump segment data |
X |
||
Merges multiple segments, with optional filtering and slicing |
X |
||
Update crawldb (from segments if in 1.x) after fetching |
X |
X |
|
Update hostdb after fetching |
|
X |
|
Create a linkdb from parsed segments |
X |
||
Merge's linkdb-s, with optional filtering |
X |
||
Run the elastic search indexer on parsed batches |
|
X |
|
Run the solr indexer on parsed segments and linkdb |
X |
X |
|
Removes duplicate documents from solr |
X |
X |
|
Removes HTTP 301 and 404 documents from solr |
X |
||
Checks the parser for a given url |
X |
X |
|
Checks the indexing filters for a given url |
X |
||
Calculates domain statistics from crawldb |
X |
||
Generates a web graph from existing segments |
X |
||
Runs a link analysis program on the generated web graph |
X |
||
Updates the crawldb with linkrank scores |
X |
||
Dumps the web graph's node scores |
X |
||
Loads a plugin and run one of its classes main() |
X |
X |
|
run a (local) Nutch server on a user defined port |
|
X |
|
Runs the given JUnit test |
X |
X |
|
run the class named CLASSNAME |
X |
X |
Webgraph classes
bin/nutch org.apache.nutch.scoring.webgraph.WebGraph
bin/nutch org.apache.nutch.scoring.webgraph.Loops
bin/nutch org.apache.nutch.scoring.webgraph.LinkRank
bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater
bin/nutch org.apache.nutch.scoring.webgraph.NodeDumper
bin/nutch org.apache.nutch.scoring.webgraph.NodeReader
bin/nutch org.apache.nutch.scoring.webgraph.LoopReader
bin/nutch org.apache.nutch.scoring.webgraph.LinkDumper
Useful Plugin Classes
bin/nutch plugin urlnormalizer-regex org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
Other Classes
bin/nutch org.apache.nutch.net.URLFilterChecker
bin/nutch org.apache.nutch.net.URLNormalizerChecker
bin/nutch org.apache.nutch.tools.CrawlDBScanner
bin/nutch org.apache.nutch.protocol.RobotRulesParser
back to FrontPage