bin/nutch fetch

The fetcher logs to stderr with fetcher output codes.

called java class

net.nutch.fetcher.RequestScheduler

command line options

bin/nutch fetch [-verbose] <dir>

-verbose

config file options

http.agent.name

Our HTTP 'User-Agent' request header.

http.robots.agents

The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence.

http.agent.description

Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name.

http.agent.url

A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name.

http.agent.email

An email address to advertise in the HTTP 'From' request header and User-Agent header.

http.agent.version

A version string to advertise in the User-Agent header.

http.timeout

The default network timeout, in milliseconds.

http.content.limit

The default length limit for downloaded content, in bytes. Content longer than this is truncated.

http.version.1.1

If true, the fetcher will attempt to use HTTP version 1.1 and gzip encoding.

fetcher.server.delay

The number of seconds the fetcher will delay between successive requests to the same server.

fetcher.threads.fetch

The number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection).

fetcher.threads.output

The number of OutputThreads to use. When adjusting this, remember that each thread could be holding a raw page, it's DOM structure, plaintext, and extracted links in memory.

fetcher.stats.minutes

Controls how often the fetcher will dump progress statistics to the logs, in minutes.

fetcher.request.queue

The maximum number of unfetched requests to queue in memory.

fetcher.output.queue

The maximum number of completed (but unwritten) requests to queue in memory before throttling the fetcher.

fetcher.active.servers

The maximum number of distinct servers that may be referenced by queued requests.

fetcher.robots.cache

The minimum number of robots.txt files to cache for inactive servers.

fetcher.server.maxurls

The maximum number of URLs that may be queued at once for a single host.

fetcher.lowservers.threshold

When there are fewer than this many servers in the fetcher's active queues, each server's queue of URLs will be pruned to fetcher.lowservers.maxurls.

fetcher.lowservers.maxurls

See description of fetcher.lowservers.threshold.

fetcher.retry.max

The maximum number of times the fetcher will attempt to get a page that has encountered recoverable errors.

fetcher.redirect.max

The maximum number of redirects the fetcher will follow when trying to fetch a page.

fetcher.host.consecutive.failures

The maximum number of consecutive failures, excluding 404 errors, to allow on a given server before declaring it dead (note: each failure will have had up to fetcher.retry.max retries).

fetcher.host.max.failerr.rate

The maximum fetch error rate, excluding 404s, to allow for a given server before declaring it dead. Note: errors include transient issues, and multiple retries contribute to the score (so, getting the first page on the 3rd try gives you a .66 "failerr.rate").

fetcher.host.min.requests.rate

A threshold on the minimum number of requests we issue to a host before applying fetcher.host.max.failerr.rate. At least this many requests will be issued before declaring a host dead due to error rate. Note: this setting does not affect fetcher.host.consecutive.failures!

excludehosts.suffix.file

Filename which contains list of hostnames we shouldn't fetch from.

fetcher.trace.longmsg

Whether to use "long messages" is the trace portion of the logged output (if set to false, terse messages will be used).

fetcher.trace.success

Whether to log successful fetches in the trace log.

fetcher.trace.not.found

Whether to log 404/Not Found errors in the trace log.

fetcher.throttle.period

How often throttling behavior should be readjusted based on current bandwidth usage, measured in seconds. Set to -1 to disable throttling.

fetcher.throttle.bandwidth

The desired amount of bandwidth the fetcher should use (aside from DNS and TCP overhead), in kbits/s. Set to -1 to disable throttling. Note: This is not a cap, this is a target for bandwidth usage over time.

fetcher.throttle.initial.threads

The number of threads that should be active initially.

MatthiasJaekle - 13 Mar 2004

  • No labels