crawl is an alias for org.apache.nutch.tools.CrawlTool
Perform complete crawling and indexing given a set of root urls.
Usage: bin/nutch org.apache.nutch.tools.CrawlTool (-local | -ndfs <nameserver:port>) <root_url_file> [-dir d] [-threads n] [-depth i] [-showThreadID]
Usage (version 0.8): bin/nutch org.apache.nutch.tools.CrawlTool (-local | -ndfs <nameserver:port>) <dir_with_url_files> [-dir d] [-threads n] [-depth i] [-showThreadID]
<dir_with_url_files>: contains text files with URL lists. This must be an existing directory.
[-showThreadID]
[-depth i]: You can tell Nutch how deep it should crawl. If you don’t tell Nutch a value, it takes 5 as his standard parameter. For example if you say –depth 1, Nutch would only index the first level. Only if you say –depth 2 (or more) Nutch would make a link follow.
[-dir d]: You can choose the directory, where Nutch should save the index. If you don’t choose a directory Nutch would create a own directory in the directory where you started the crawl. Example of a –dir parameter: -dir /usr/local/index/
[-threads n]: You can choose, how many threads Nutch would use.
-local
-ndfs <nameserver:port>