Differences between revisions 6 and 7
Revision 6 as of 2006-03-05 00:03:34
Size: 1131
Editor: JeffRitchie
Comment:
Revision 7 as of 2009-09-20 23:09:32
Size: 1131
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 8: Line 8:
  '''<urlDir>:''' contains text files with URL lists. This must be an existing directory.[[BR]]
  '''[-dir <d>]:''' The directory where Nutch will save the crawl files. Default Value: ''./crawl-[date]'' where [date] is the current date.[[BR]]
  '''[-threads <n>]:''' Number of Fetcher Threads to use. Overrides the configuration key ''fetcher.threads.fetch''. Default Value: ''10''[[BR]]
  '''[-depth <i>]:''' Number of iterations Nutch should crawl. Default Value: ''5''[[BR]]
  '''[-topN <num>]:''' Limit crawls to the top <num> links per iteration. Default Value: ''Integer.MAX_VALUE''[[BR]]
  '''<urlDir>:''' contains text files with URL lists. This must be an existing directory.<<BR>>
  '''[-dir <d>]:''' The directory where Nutch will save the crawl files. Default Value: ''./crawl-[date]'' where [date] is the current date.<<BR>>
  '''[-threads <n>]:''' Number of Fetcher Threads to use. Overrides the configuration key ''fetcher.threads.fetch''. Default Value: ''10''<<BR>>
  '''[-depth <i>]:''' Number of iterations Nutch should crawl. Default Value: ''5''<<BR>>
  '''[-topN <num>]:''' Limit crawls to the top <num> links per iteration. Default Value: ''Integer.MAX_VALUE''<<BR>>
Line 15: Line 15:
 hadoop-default.xml[[BR]]
 hadoop-site.xml[[BR]]
 nutch-default.xml[[BR]]
 nutch-site.xml[[BR]]
 crawl-tool.xml[[BR]]
 hadoop-default.xml<<BR>>
 hadoop-site.xml<<BR>>
 nutch-default.xml<<BR>>
 nutch-site.xml<<BR>>
 crawl-tool.xml<<BR>>

"crawl" is an alias for "org.apache.nutch.crawl.Crawl"

Perform complete crawling and indexing given a set of root urls.

Usage

  • nutch-0.8-dev/bin/nutch org.apache.nutch.crawl.Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN]

    • <urlDir>: contains text files with URL lists. This must be an existing directory.
      [-dir <d>]: The directory where Nutch will save the crawl files. Default Value: ./crawl-[date] where [date] is the current date.
      [-threads <n>]: Number of Fetcher Threads to use. Overrides the configuration key fetcher.threads.fetch. Default Value: 10
      [-depth <i>]: Number of iterations Nutch should crawl. Default Value: 5
      [-topN <num>]: Limit crawls to the top <num> links per iteration. Default Value: Integer.MAX_VALUE

Configuration Files

  • hadoop-default.xml
    hadoop-site.xml
    nutch-default.xml
    nutch-site.xml
    crawl-tool.xml

Other Files

  • crawl-urlfilter.txt

Caveats and Notes

  • None.

DevelopmentCommandLineOptions

nutch-0.8-dev/bin/nutch_crawl (last edited 2009-09-20 23:09:32 by localhost)