Differences between revisions 55 and 56
Revision 55 as of 2013-04-29 00:49:29
Size: 4629
Editor: TejasPatil
Comment:
Revision 56 as of 2013-05-03 19:33:36
Size: 4517
Editor: TejasPatil
Comment:
Deletions are marked like this. Additions are marked like this.
Line 63: Line 63:
 * bin/nutch org.apache.nutch.protocol.RobotRulesParser (for 1.x only)
 * bin/nutch plugin lib-http org.apache.nutch.protocol.http.api.RobotRulesParser (for 2.x only)
Line 71: Line 69:
 * bin/nutch org.apache.nutch.protocol.RobotRulesParser

Nutch Command Line Options of bin/nutch

The following is a complete list of Nutch command line options. That is to say that some or all of the options may not be available in the particular version of Nutch you are using. For version specific options please see the relevant check box, once you know that such a command exists for your particular Nutch distribution, you can navigate to the relevant wiki entry for a detailed descritpion of the tool.

The script bin/nutch is a helper which picks different java classes to "run".

The new script bin/crawl NUTCH-1087 is also part of the new versions, that is written for crawls and to replace the bin/nutch crawl command.

Note: Most commands print help when invoked w/o parameters.

See each entry for details of the command arguments and options.

command

function

version

1.x

2.x

bin/nutch crawl

One-step crawler for intranets

X

X

bin/nutch readdb

Read / dump crawl db

X

X

bin/nutch mergedb

Merge crawldb-s, with optional filtering

X

bin/nutch readlinkdb

Read / dump link db

X

bin/nutch inject

Inject new urls into the database

X

X

bin/nutch hostinject

Inject new urls into the hostdatabase

X

bin/nutch generate

Generate new segments to fetch from crawldb

X

X

bin/nutch freegen

Generate new segments to fetch from text files

X

bin/nutch fetch

Fetch a segment's pages

X

X

bin/nutch parse

Parse a segment's pages

X

X

bin/nutch readseg

Read / dump segment data

X

bin/nutch mergesegs

Merges multiple segments, with optional filtering and slicing

X

bin/nutch updatedb

Update crawldb (from segments if in 1.x) after fetching

X

X

bin/nutch updatehostdb

Update hostdb after fetching

X

bin/nutch invertlinks

Create a linkdb from parsed segments

X

bin/nutch mergelinkdb

Merge's linkdb-s, with optional filtering

X

bin/nutch elasticindex

Run the elastic search indexer on parsed batches

X

bin/nutch solrindex

Run the solr indexer on parsed segments and linkdb

X

X

bin/nutch solrdedup

Removes duplicate documents from solr

X

X

bin/nutch solrclean

Removes HTTP 301 and 404 documents from solr

X

bin/nutch parsechecker

Checks the parser for a given url

X

X

bin/nutch indexchecker

Checks the indexing filters for a given url

X

bin/nutch domainstats

Calculates domain statistics from crawldb

X

bin/nutch webgraph

Generates a web graph from existing segments

X

bin/nutch linkrank

Runs a link analysis program on the generated web graph

X

bin/nutch scoreupdater

Updates the crawldb with linkrank scores

X

bin/nutch nodedumper

Dumps the web graph's node scores

X

bin/nutch plugin

Loads a plugin and run one of its classes main()

X

X

bin/nutch nutchserver

run a (local) Nutch server on a user defined port

X

bin/nutch junit

Runs the given JUnit test

X

X

bin/nutch CLASSNAME

run the class named CLASSNAME

X

X

Webgraph classes

  • bin/nutch org.apache.nutch.scoring.webgraph.WebGraph

  • bin/nutch org.apache.nutch.scoring.webgraph.Loops

  • bin/nutch org.apache.nutch.scoring.webgraph.LinkRank

  • bin/nutch org.apache.nutch.scoring.webgraph.ScoreUpdater

  • bin/nutch org.apache.nutch.scoring.webgraph.NodeDumper

  • bin/nutch org.apache.nutch.scoring.webgraph.NodeReader

  • bin/nutch org.apache.nutch.scoring.webgraph.LoopReader

  • bin/nutch org.apache.nutch.scoring.webgraph.LinkDumper

Useful Plugin Classes

  • bin/nutch plugin urlnormalizer-regex org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer

Other Classes

back to FrontPage

CommandLineOptions (last edited 2014-09-27 19:32:12 by LewisJohnMcgibbney)