Differences between revisions 279 and 280
Revision 279 as of 2014-06-20 02:43:28
Size: 8418
Comment:
Revision 280 as of 2014-06-25 05:13:19
Size: 8518
Comment:
Deletions are marked like this. Additions are marked like this.
Line 90: Line 90:
 * [[Website_Update_HOWTO]]  * [[CMS_Website_Update_HOWTO]] - How to edit the Nutch website based on the [[http://www.apache.org/dev/cms.html|Apache CMS]].

Welcome to the Apache Nutch Wiki

nutch_logo_medium.gif

Please contribute your knowledge about Nutch here!

If you would like to update any content, would like to add your own content or would like to see something added then please browse the Documentation issues and open a Jira ticket (tagging it with the Documentation label).

What is Apache Nutch?

Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project has diversified and now comprises two codebases, namely:

  • Nutch 1.x: A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing.

  • Nutch 2.x: An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions.

Being pluggable and modular of course has it's benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations e.g. Apache Tika for parsing. Additionally, pluggable indexing exists for Apache Solr, Elastic Search, etc.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

You can download Nutch here.

For more information about Apache Nutch, please see the Nutch wiki.

Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.

Nutch Version Administration

Tutorials

Nutch 1.X tutorial(s)

  • NutchTutorial - How to configure Nutch to crawl in local mode and post to Apache Solr for search/index.

Nutch 2.X tutorial(s)

Other Tutorial(s)

Configuration

General Information

Nutch Development

Nutch 2.x

Pre Nutch 1.3 and Archive

How to edit this Wiki

This Wiki is a collaborative site, anyone can contribute and share:

  • Create an account by clicking the "Login" link at the top of any page, and picking a username and password.
  • Edit any page by pressing Edit at the top or the bottom of the page

There are some conventions used on the Nutch wiki:

  • /!\ :TODO: /!\ (/!\ :TODO: /!\ ) is used to denote sections that definitely need to be cleaned up.

Some general info on using this Wiki Software:

FrontPage (last edited 2014-09-27 18:54:44 by LewisJohnMcgibbney)