bin/nutch crawl

crawl is an alias for org.apache.nutch.tools.CrawlTool

Perform complete crawling and indexing given a set of root urls.

Usage: bin/nutch org.apache.nutch.tools.CrawlTool (-local | -ndfs <nameserver:port>) <root_url_file> [-dir d] [-threads n] [-depth i] [-showThreadID]

Usage (version 0.8): bin/nutch org.apache.nutch.tools.CrawlTool (-local | -ndfs <nameserver:port>) <dir_with_url_files> [-dir d] [-threads n] [-depth i] [-showThreadID]

<dir_with_url_files>: contains text files with URL lists. This must be an existing directory.

[-showThreadID]

[-depth i]: You can tell Nutch how deep it should crawl. If you don’t tell Nutch a value, it takes 5 as his standard parameter. For example if you say –depth 1, Nutch would only index the first level. Only if you say –depth 2 (or more) Nutch would make a link follow.

[-dir d]: You can choose the directory, where Nutch should save the index. If you don’t choose a directory Nutch would create a own directory in the directory where you started the crawl. Example of a –dir parameter: -dir /usr/local/index/

[-threads n]: You can choose, how many threads Nutch would use.

-local

-ndfs <nameserver:port>

CommandLineOptions

last edited 2006-03-13 12:52:02 by fha