Protocol implementations allow Nutch to use different protocols (ftp, http, file, etc.) to fetch documents. Implementation is done in plugins which allows users

  • to activate only required protocol implementations, eg. block file:// access by simply keeping protocol-file deactivated
  • choose from alternative implementations of one and the same protocol scheme - Nutch has multiple implementations of the http/https protocol scheme, every plugin focusing on different features

HTTP/HTTPS protocol plugins

protocol-http

Simple (no third-party dependencies) but error-tolerant HTTP/HTTPS protocol implementation (HTTP 1.0 and 1.1).

protocol-httpclient

HTTP/HTTPS protocol based on Apache HttpClient, optionally with Basic, Digest and NTLM authentication schemes, form/post authentication and support to use proxy servers. See HttpAuthenticationSchemes and HttpPostAuthentication.

protocol-okhttp

HTTP/HTTPS protocol based on on okhttp, supports

  • HTTP 1.1 or http/2 (property http.useHttp2)
  • usage of proxy servers
  • efficient by reusing connection with a configurable connection pool (NUTCH-2896 and PR#697)

Browser-based HTTP/HTTPS protocol plugins

Nutch provides a couple of protocol plugins which fetch content not directly but using an intermediate web browser controlled via the Selenium browser automation library.

protocol-selenium

See README.

protocol-interactiveselenium

See README.

protocol-htmlunit




file:// access – protocol-file


ftp:// access – protocol-ftp


Samba – protocol-smb

(under development, see NUTCH-2856)

  • No labels