|Deletions are marked like this.||Additions are marked like this.|
|Line 59:||Line 59:|
|* TikaPlugin - Comments on the Tika integration and differences with existing parse plugins|
Welcome to the Apache Nutch Wiki
Please contribute your knowledge about Nutch here!
Nutch Version 1.3 Administration
RunningNutchAndSolr - How to configure Nutch 1.3 to crawl in local mode and post to Apache Solr for search/index.
RunningNutchInDeployMode - How to configure Nutch 1.3 to crawl in deploy mode. :TODO:This tutorial is in construction.
Hadoop Tutorial Nutch being based Hadoop, it helps to have a better understanding of Hadoop.
OverviewDeploymentConfigs :This full page requires a complete update to reflect Nutch 1.3 release:
HttpAuthenticationSchemes - How to enable Nutch to authenticate itself using NTLM, Basic or Digest authentication schemes.
NonDefaultIntranetCrawlingOptions - Desirable options to add to your intranet crawling configuration. :This is configured for Nutch <1.3 and therefore requires an update and for the old page to be archived:
OptimizingCrawls - How to optimize your crawling/fetching speed with Nutch.
ErrorMessages -- What they mean and suggestions for getting rid of them. :This requires extensive updating to reflect Nutch 1.3. In addition the legacy indexing and searching material should be archived. We also need to create a similar page for Nutch 2.0 as the errors are different in nature as are the solutions required to fix them.
SetupProxyForNutch - using Tinyproxy on Ubuntu Requires slight updating to correct references and subheadings
Features :TODO:This needs to be completely overhauled to reflect Nutch 1.3 features.
Current Nutch Gotchas
PublicServers running Nutch
Presentations on Nutch
Evaluations of Search Quality
Help_Wanted organizations hiring Nutch expertise
Commercial Support and developers for hire
AcademicArticles that deal with Nutch
FAQ :The Indexing and Searching section require update/archive to reflect new 1.3 release:
Becoming a Nutch Developer - Start developing and contributing to Nutch.
PluginCentral -- How to write your own plugins and use other people's. :This page requires a huge update to reflect plugins included in Nutch 1.3:
InternalDocumentation -- How Nutch works.
MultiLingualSupport - In development.
FixingOpicScoring - In planning.
TaskList -- Tasks for Nutch developers.
Committer's_Rules -- Committers should follow these guidelines when deciding, which branch to use for committing the patches and when to commit.
JavaDemoApplication - A simple demonstration of how to use the Nutch APIin a Java application
ApacheConUs2009MeetUp - List of topics for MeetUp at ApacheCon US 2009 in Oakland (Nov 2-6)
Nutch2Roadmap -- Discussions on the architecture and features of Nutch 2.0
NewScoring -- New stable pagerank like webgraph and link-analysis jobs.
NewScoringIndexingExample -- Two full fetch cycles of commands using new scoring and indexing systems.
GORA_HBase -- Configuring Nutch 2.0 with GORA and HBASE
Build Nutch 2.0 in Eclipse -- How to setup your IDE environment comfortably.
ErrorMessagesInNutch2.0 -- What they mean and suggestions for getting rid of them. :This page is in construction:
Pre Nutch 1.3 and Archive
How to edit this Wiki
This Wiki is a collaborative site, anyone can contribute and share:
- Create an account by clicking the "Login" link at the top of any page, and picking a username and password.
Edit any page by pressing Edit at the top or the bottom of the page
There are some conventions used on the Nutch wiki:
:TODO: (/!\ :TODO: /!\ ) is used to denote sections that definitely need to be cleaned up.
Some general info on using this Wiki Software: