|
Size: 6095
Comment:
|
← Revision 267 as of 2013-03-28 17:34:30 ⇥
Size: 6093
Comment:
|
| Deletions are marked like this. | Additions are marked like this. |
| Line 50: | Line 50: |
| * Commercial [[Support]] and developers for hire | * Commercial [[Support]] & developers for hire |
Welcome to the Apache Nutch Wiki
Please contribute your knowledge about Nutch here!
Contents
Nutch Version Administration
Current CommandLineOptions: Command line options for 1.X and 2.X
JavaDocs -- The JavaDocs for the most recent Nutch-1.X release.
JavaDocs -- The JavaDocs for the most recent Nutch-2.X release.
Tutorials
Nutch 1.X tutorial(s)
NutchTutorial - How to configure Nutch to crawl in local mode and post to Apache Solr for search/index.
Nutch 2.X tutorial(s)
Nutch2Tutorial -- How to get Nutch 2.X to use HBase as persistence layer for Gora
Setting up Nutch 2.0 with MySQL to handle UTF-8 - A step-by-step tutorial
Accumulo, Nutch, and Gora - A step-by-step tutorial
Other Tutorial(s)
Hadoop Tutorial Nutch being based Hadoop, it helps to have a better understanding of Hadoop.
Nutch Hadoop Tutorial - How to setup and run Nutch in deploy mode over a Hadoop cluster.
RunNutchInEclipse - How to configure, build, crawl and debug Nutch within Eclipse
Intranet Document Search - Index and search Microsoft Office, PDF etc. documents in a file system hierarchy with a Solr backend.
Recrawling with Nutch - How to re-crawl with Nutch.
Ajax-Solr Tutorial: Nutch - Quick and easy guide to getting a nice UI on top of your Nutch crawl data.
Configuration
OverviewDeploymentConfigs
:This full page requires a complete update to reflect recent Nutch releases:
NutchConfigurationFiles: An overview from Nutch developers.
NutchPropertiesCompleteList: A fine grained account of all Nutch property configuration.
HttpAuthenticationSchemes - How to enable Nutch to authenticate itself using NTLM, Basic or Digest authentication schemes.
NonDefaultIntranetCrawlingOptions - Desirable options to add to your Nutch intranet crawling configuration.
OptimizingCrawls - How to optimise your crawling/fetching speed with Nutch.
ErrorMessages -- What they mean and suggestions for getting rid of them.
:This requires extensive updating to reflect recent Nutch releases. In addition the legacy indexing and searching material should be archived.
SetupProxyForNutch - using Tinyproxy on Ubuntu
IndexStructure
:This page needs a slight update to provide more information on plugins and the data they send to Solr for indexing:
General Information
Features
:TODO:This needs to be completely overhauled to reflect recent Nutch features.
Current Nutch Gotchas
PublicServers running Nutch
Presentations on Nutch
Press Articles
Evaluations of Search Quality
Commercial Support & developers for hire
Mailing Lists
AcademicArticles that deal with Nutch
Nutch Development
Becoming a Nutch Developer - Start developing and contributing to Nutch.
PluginCentral -- How to write your own plugins and use other people's.
InternalDocumentation -- How Nutch works.
FixingOpicScoring - In planning.
TaskList -- Tasks for Nutch developers.
:Severe update required:
Committer's_Rules -- Committers should follow these guidelines when deciding, which branch to use for committing the patches and when to commit.
NutchMeetUps - Records of previous Nutch community meetup, hackathons, barcamps etc.
Nutch 2.x
Nutch2Crawling - A description of the crawling jobs and field to database mappings.
Nutch2Architecture - A high level overview of the new architecture and design
Nutch2Roadmap -- Discussions on the architecture and features of Nutch 2.0
NewScoring -- New stable pagerank like webgraph and link-analysis jobs.
NewScoringIndexingExample -- Two full fetch cycles of commands using new scoring and indexing systems.
Build Nutch 2.0 in Eclipse -- How to setup your IDE environment comfortably.
ErrorMessagesInNutch2 -- What they mean and suggestions for getting rid of them.
NutchConfigurationFiles-2.x -- Configuration files that are specific to Nutch-2.x
Understanding the columns/fields in Nutch 2.0 Webpage - Detailed article
Pre Nutch 1.3 and Archive
How to edit this Wiki
This Wiki is a collaborative site, anyone can contribute and share:
- Create an account by clicking the "Login" link at the top of any page, and picking a username and password.
Edit any page by pressing Edit at the top or the bottom of the page
There are some conventions used on the Nutch wiki:
:TODO:
(/!\ :TODO: /!\ ) is used to denote sections that definitely need to be cleaned up.
Some general info on using this Wiki Software:
Create a link to another page with joined capitalized words (like WikiSandBox) or with ["quoted words in brackets"]
See HelpForBeginners to get you going, HelpContents for all help pages.
HelpOnMoinWikiSyntax: quick access to wiki syntax