Protocol implementations allow Nutch to use different protocols (ftp, http, file, etc.) to fetch documents. Implementation is done in plugins which allows users

HTTP/HTTPS protocol plugins

protocol-http

Simple (no third-party dependencies) but error-tolerant HTTP/HTTPS protocol implementation (HTTP 1.0 and 1.1).

protocol-httpclient

HTTP/HTTPS protocol based on Apache HttpClient, optionally with Basic, Digest and NTLM authentication schemes, form/post authentication and support to use proxy servers. See HttpAuthenticationSchemes and HttpPostAuthentication.

protocol-okhttp

HTTP/HTTPS protocol based on on okhttp, supports

Browser-based HTTP/HTTPS protocol plugins

Nutch provides a couple of protocol plugins which fetch content not directly but using an intermediate web browser controlled via the Selenium browser automation library.

protocol-selenium

See README.

protocol-interactiveselenium

See README.

protocol-htmlunit




file:// access – protocol-file


ftp:// access – protocol-ftp


Samba – protocol-smb

(under development, see NUTCH-2856)