Differences between revisions 1 and 2
Revision 1 as of 2013-03-20 18:04:26
Size: 1119
Comment: change of url from last crawl page
Revision 2 as of 2013-11-14 13:36:07
Size: 1172
Editor: JulienNioche
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
''''' REMOVED AS OF NUTCH 1.8 AND NUTCH 2.3 '''''

REMOVED AS OF NUTCH 1.8 AND NUTCH 2.3

Crawl is an alias for org.apache.nutch.crawl.Crawl

This class performs a complete crawl given a set of root urls.

Usage:

bin/nutch crawl <urlDir> [-solr <solrURL>] [-dir d] [-threads n] [-depth i] [-topN N]

<urlDir>: Contains text files with URL lists. This must be an existing directory. Example would be ${NUTCH_HOME}/urls

[-solr <solrURL>]: Enables us to pass our Solr instance as an indexing parameter to simplify the process of indexing with Solr.

[-dir d]: This parameter enables you to choose the directory Nutch should use when crawling.

[-threads n]: This parameter enables you to choose how many threads Nutch should use when crawling.

[-depth i]: You can tell Nutch how deep it should crawl. If you don’t tell Nutch a value, it takes 5 as his standard parameter. For example if you pass –depth 1 as the parameter, Nutch will only index the first level. If you say –depth 2 (or more) Nutch will follow this number of outlinks.

[-topN N]: The maximum number of outlinks Nutch will obtain from any one page.

CommandLineOptions

bin/nutch crawl (last edited 2013-11-14 13:36:07 by JulienNioche)