"fetch" is an alias for "org.apache.nutch.fetcher.Fetcher"
Runs the Fetcher on a segment.
Usage
nutch-0.8-dev/bin/nutch org.apache.nutch.fetcher.Fetcher <segment> [-threads <n>] [-noParsing]
<segment>: Path to the segment to fetch.
[-threads <n>]: The number of fetcher threads to run. Default: Configuration Key -> fetcher.threads.fetch -> 10
[-noParsing]: Disables automatic parsing of the segment's data. See nutch-0.8-dev/bin/nutch_parse
Configuration Files
hadoop-default.xml
hadoop-site.xml
nutch-default.xml
nutch-site.xml
Other Files
- None.
Caveats and Notes
The Fetcher depends upon several plugins to fetch various protocols. Current protocols and the plugins supporting them are as follows:
http:'''<<BR>>protocol-http
protocol-httpclient
protocol-httpclient
ftp:'''<<BR>>
protocol-ftp
protocol-file