Welcome to the Apache Nutch Wiki
Please contribute your knowledge about Nutch here!
Nutch Version Administration
Current CommandLineOptions: Command line options for 1.X and 2.X
JavaDocs -- The JavaDocs for the most recent Nutch-1.X release.
JavaDocs -- The JavaDocs for the most recent Nutch-2.X release.
Nutch 1.X tutorial(s)
NutchTutorial - How to configure Nutch to crawl in local mode and post to Apache Solr for search/index.
Nutch 2.X tutorial(s)
Nutch2Tutorial -- How to get Nutch 2.X to use HBase as persistence layer for Gora
Setting up Nutch 2.0 with MySQL to handle UTF-8 - A step-by-step tutorial
Accumulo, Nutch, and Gora - A step-by-step tutorial
Setting up Nutch 2.x with Cassandra - How to setup and run Nutch 2.x using Cassandra as storage.
Hadoop Tutorial Nutch being based Hadoop, it helps to have a better understanding of Hadoop.
Nutch Hadoop Tutorial - How to setup and run Nutch in deploy mode over a Hadoop cluster.
Running Nutch in (pseudo) distributed mode - How to setup and run Nutch in Hadoop pseudo-distributed mode.
RunNutchInEclipse - How to configure, build, crawl and debug Nutch within Eclipse
Intranet Document Search - Index and search Microsoft Office, PDF etc. documents in a file system hierarchy with a Solr backend.
Recrawling with Nutch - How to re-crawl with Nutch.
Ajax-Solr Tutorial: Nutch - Quick and easy guide to getting a nice UI on top of your Nutch crawl data.
OverviewDeploymentConfigs :This full page requires a complete update to reflect recent Nutch releases:
NutchConfigurationFiles: An overview from Nutch developers.
NutchPropertiesCompleteList: A fine grained account of all Nutch property configuration.
HttpAuthenticationSchemes - How to enable Nutch to authenticate itself using NTLM, Basic or Digest authentication schemes.
NonDefaultIntranetCrawlingOptions - Desirable options to add to your Nutch intranet crawling configuration.
OptimizingCrawls - How to optimise your crawling/fetching speed with Nutch.
ErrorMessages -- What they mean and suggestions for getting rid of them. :This requires extensive updating to reflect recent Nutch releases. In addition the legacy indexing and searching material should be archived.
SetupProxyForNutch - using Tinyproxy on Ubuntu
IndexStructure :This page needs a slight update to provide more information on plugins and the data they send to Solr for indexing:
Features :TODO:This needs to be completely overhauled to reflect recent Nutch features.
Current Nutch Gotchas
PublicServers running Nutch
Presentations on Nutch
Evaluations of Search Quality
Commercial Support & developers for hire
AcademicArticles that deal with Nutch
Becoming a Nutch Developer - Start developing and contributing to Nutch.
PluginCentral -- How to write your own plugins and use other people's.
InternalDocumentation -- How Nutch works.
FixingOpicScoring - In planning.
TaskList -- Tasks for Nutch developers. :Severe update required:
Committer's_Rules -- Committers should follow these guidelines when deciding, which branch to use for committing the patches and when to commit.
NutchMeetUps - Records of previous Nutch community meetup, hackathons, barcamps etc.
Nutch2Crawling - A description of the crawling jobs and field to database mappings.
Nutch2Architecture - A high level overview of the new architecture and design
Nutch2Roadmap -- Discussions on the architecture and features of Nutch 2.0
NewScoring -- New stable pagerank like webgraph and link-analysis jobs.
NewScoringIndexingExample -- Two full fetch cycles of commands using new scoring and indexing systems.
Build Nutch 2.0 in Eclipse -- How to setup your IDE environment comfortably.
ErrorMessagesInNutch2 -- What they mean and suggestions for getting rid of them.
NutchConfigurationFiles-2.x -- Configuration files that are specific to Nutch-2.x
Pre Nutch 1.3 and Archive
How to edit this Wiki
This Wiki is a collaborative site, anyone can contribute and share:
- Create an account by clicking the "Login" link at the top of any page, and picking a username and password.
Edit any page by pressing Edit at the top or the bottom of the page
There are some conventions used on the Nutch wiki:
:TODO: (/!\ :TODO: /!\ ) is used to denote sections that definitely need to be cleaned up.
Some general info on using this Wiki Software: