Welcome to the Apache Nutch Wiki

Please contribute your knowledge about Nutch here!

Or browse the open issues, open a new Jira ticket, or check the Nutch source code on git.

Table of Contents

What is Apache Nutch?

Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project comprises two codebases, namely:

Nutch 1.x (ACTIVE): A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing.


Nutch 2.x (INACTIVE): An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions. No more releases or bug fixes are anticipated for this codebase.


Being pluggable and modular of course has it's benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations e.g. Apache Tika for parsing. Additionally, pluggable indexing exists for Apache Solr, Elastic Search, etc.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

You can download Nutch here.

Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.

Nutch Version Administration

Tutorials

Nutch 1.X tutorial(s)

Other Tutorial(s)

Configuration

General Information

Nutch Development

Archive and Old Nutch Versions

How to edit this Wiki

This Wiki is a collaborative site, anyone can contribute and share. To help avoid spam the Nutch wiki is only editable by known accounts. If you would like to help out with the Nutch wiki, add a new page, or work on an existing one, please first create a wiki account by clicking on "Sign Up" or "Log in" if you already have an account.