Crawl is an alias for org.apache.nutch.crawl.Crawl

This class performs a complete crawl given a set of root urls.


bin/nutch crawl <urlDir> [-solr <solrURL>] [-dir d] [-threads n] [-depth i] [-topN N]

<urlDir>: Contains text files with URL lists. This must be an existing directory. Example would be ${NUTCH_HOME}/urls

[-solr <solrURL>]: Enables us to pass our Solr instance as an indexing parameter to simplify the process of indexing with Solr.

[-dir d]: This parameter enables you to choose the directory Nutch should use when crawling.

[-threads n]: This parameter enables you to choose how many threads Nutch should use when crawling.

[-depth i]: You can tell Nutch how deep it should crawl. If you don’t tell Nutch a value, it takes 5 as his standard parameter. For example if you pass –depth 1 as the parameter, Nutch will only index the first level. If you say –depth 2 (or more) Nutch will follow this number of outlinks.

[-topN N]: The maximum number of outlinks Nutch will obtain from any one page.


bin/nutch_crawl (last edited 2011-07-13 19:00:52 by JoeLencioni)