Differences between revisions 2 and 3
Revision 2 as of 2006-03-05 02:38:52
Size: 1364
Editor: JeffRitchie
Comment:
Revision 3 as of 2009-09-20 23:09:34
Size: 1364
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 8: Line 8:
  '''<segment>:''' Path to the segment to fetch.[[BR]]
  '''[-threads <n>]:''' The number of fetcher threads to run. Default: ''Configuration Key -> fetcher.threads.fetch -> 10''[[BR]]
  '''[-noParsing]:''' Disables automatic parsing of the segment's data. See ["nutch-0.8-dev/bin/nutch parse"][[BR]]
  '''<segment>:''' Path to the segment to fetch.<<BR>>
  '''[-threads <n>]:''' The number of fetcher threads to run. Default: ''Configuration Key -> fetcher.threads.fetch -> 10''<<BR>>
  '''[-noParsing]:''' Disables automatic parsing of the segment's data. See [[nutch-0.8-dev/bin/nutch_parse]]<<BR>>
Line 13: Line 13:
 hadoop-default.xml[[BR]]
 hadoop-site.xml[[BR]]
 nutch-default.xml[[BR]]
 nutch-site.xml[[BR]]
 hadoop-default.xml<<BR>>
 hadoop-site.xml<<BR>>
 nutch-default.xml<<BR>>
 nutch-site.xml<<BR>>
Line 22: Line 22:
 The Fetcher depends upon several plugins to fetch various protocols. Current protocols and the plugins supporting them are as follows:[[BR]][[BR]]
 '''http:'''[[BR]]
  protocol-http[[BR]]
  protocol-httpclient[[BR]]
 '''https:'''[[BR]]
  protocol-httpclient[[BR]]
 '''ftp:'''[[BR]]
  protocol-ftp[[BR]]
 '''file:'''[[BR]]
  protocol-file[[BR]]
 The Fetcher depends upon several plugins to fetch various protocols. Current protocols and the plugins supporting them are as follows:<<BR>><<BR>>
 '''http:'''<<BR>>
  protocol-http<<BR>>
  protocol-httpclient<<BR>>
 '''https:'''<<BR>>
  protocol-httpclient<<BR>>
 '''ftp:'''<<BR>>
  protocol-ftp<<BR>>
 '''file:'''<<BR>>
  protocol-file<<BR>>

"fetch" is an alias for "org.apache.nutch.fetcher.Fetcher"

Runs the Fetcher on a segment.

Usage

  • nutch-0.8-dev/bin/nutch org.apache.nutch.fetcher.Fetcher <segment> [-threads <n>] [-noParsing]

    • <segment>: Path to the segment to fetch.
      [-threads <n>]: The number of fetcher threads to run. Default: Configuration Key -> fetcher.threads.fetch -> 10
      [-noParsing]: Disables automatic parsing of the segment's data. See nutch-0.8-dev/bin/nutch_parse

Configuration Files

  • hadoop-default.xml
    hadoop-site.xml
    nutch-default.xml
    nutch-site.xml

Other Files

  • None.

Caveats and Notes

  • The Fetcher depends upon several plugins to fetch various protocols. Current protocols and the plugins supporting them are as follows:

    http:'''<<BR>>

    • protocol-http
      protocol-httpclient

    https:'''<<BR>>

    • protocol-httpclient

    ftp:'''<<BR>>

    • protocol-ftp

    file:'''<<BR>>

    • protocol-file

    When fetching documents from the Internet you should not use protocol-file as it is intended for fetching files local to the system the fetcher is running on. If you wish to fetch both http and https protocols then only protocol-httpclient is needed.

DevelopmentCommandLineOptions

nutch-0.8-dev/bin/nutch_fetch (last edited 2009-09-20 23:09:34 by localhost)