Description

The bin/crawl script gives more command during a crawl. It uses individual steps (inject->generate->fetch->parse->updatedb) during a crawl.

Usage

Nutch 1.X

Usage: crawl [options] <crawl_dir> <num_rounds>

Arguments:
  <crawl_dir>                           Directory where the crawl/host/link/segments dirs are saved
  <num_rounds>                          The number of rounds to run this crawl for

Options:
  -i|--index                            Indexes crawl results into a configured indexer
  -D                                    A Java property to pass to Nutch calls
  -w|--wait <NUMBER[SUFFIX]>            Time to wait before generating a new segment when no URLs
                                        are scheduled for fetching. Suffix can be: s for second,
                                        m for minute, h for hour and d for day. If no suffix is
                                        specified second is used by default. [default: -1]
  -s <seed_dir>                         Path to seeds file(s)
  -sm <sitemap_dir>                     Path to sitemap URL file(s)
  --hostdbupdate                                Boolean flag showing if we either update or not update hostdb for each round
  --hostdbgenerate                      Boolean flag showing if we use hostdb in generate or not
  --num-slaves <num_slaves>             Number of slave nodes [default: 1]
                                        Note: This can only be set when running in distribution mode
  --num-tasks <num_tasks>               Number of reducer tasks [default: 2]
  --size-fetchlist <size_fetchlist>     Number of URLs to fetch in one iteration [default: 50000]
  --time-limit-fetch <time_limit_fetch> Number of minutes allocated to the fetching [default: 180]
  --num-threads <num_threads>           Number of threads for fetching / sitemap processing [default: 50]
  --sitemaps-from-hostdb <frequency>    Whether and how often to process sitemaps based on HostDB.
                                        Supported values are:
                                          - never [default]
                                          - always (processing takes place in every iteration)
                                          - once (processing only takes place in the first iteration)

Nutch 2.x

Need Assistance ?

Please message us in the user-mailing list if you find any issues

bin/crawl (last edited 2018-08-15 13:38:57 by SebastianNagel)