The bin/crawl script gives more command during a crawl. It uses individual steps (inject->generate->fetch->parse->updatedb) during a crawl.
Usage: crawl [options] <crawl_dir> <num_rounds> Arguments: <crawl_dir> Directory where the crawl/host/link/segments dirs are saved <num_rounds> The number of rounds to run this crawl for Options: -i|--index Indexes crawl results into a configured indexer -D A Java property to pass to Nutch calls -w|--wait <NUMBER[SUFFIX]> Time to wait before generating a new segment when no URLs are scheduled for fetching. Suffix can be: s for second, m for minute, h for hour and d for day. If no suffix is specified second is used by default. [default: -1] -s <seed_dir> Path to seeds file(s) -sm <sitemap_dir> Path to sitemap URL file(s) --hostdbupdate Boolean flag showing if we either update or not update hostdb for each round --hostdbgenerate Boolean flag showing if we use hostdb in generate or not --num-slaves <num_slaves> Number of slave nodes [default: 1] Note: This can only be set when running in distribution mode --num-tasks <num_tasks> Number of reducer tasks [default: 2] --size-fetchlist <size_fetchlist> Number of URLs to fetch in one iteration [default: 50000] --time-limit-fetch <time_limit_fetch> Number of minutes allocated to the fetching [default: 180] --num-threads <num_threads> Number of threads for fetching / sitemap processing [default: 50] --sitemaps-from-hostdb <frequency> Whether and how often to process sitemaps based on HostDB. Supported values are: - never [default] - always (processing takes place in every iteration) - once (processing only takes place in the first iteration)
Need Assistance ?
Please message us in the user-mailing list if you find any issues