Differences between revisions 1 and 260 (spanning 259 versions)
Revision 1 as of 2005-01-27 17:31:40
Size: 1408
Editor: gci
Comment:
Revision 260 as of 2013-03-21 00:04:03
Size: 6093
Editor: 128
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
##language:en
#pragma section-numbers off
= Welcome to the Apache Nutch Wiki =
{{http://www.interadvertising.co.uk/files/nutch_logo_medium.gif}}
Line 4: Line 4:
= Welcome to the Apache Nutch Wiki = Please contribute your knowledge about Nutch here! <<TableOfContents(4)>>
Line 6: Line 6:
This wiki has just been set up as part of the [wiki:ApacheGeneral:FrontPage big Apache Wiki Farm]. It does not contain anything yet. == Nutch Version Administration ==
 * DownloadingNutch
 * Current CommandLineOptions: Command line options for 1.X and 2.X
 * [[http://nutch.apache.org/apidocs-1.6/index.html|JavaDocs]] -- The !JavaDocs for the most recent Nutch-1.X release.
 * [[http://nutch.apache.org/apidocs-2.1/index.html|JavaDocs]] -- The !JavaDocs for the most recent Nutch-2.X release.
Line 8: Line 12:
InternalDocumentation on Nutch === Tutorials ===
Line 10: Line 14:
= 'Special' Wiki pages = ==== Nutch 1.X tutorial(s) ====
 * NutchTutorial - How to configure Nutch to crawl in local mode and post to Apache Solr for search/index.
Line 12: Line 17:
  '''TitleIndex'''
    A list of all pages on this wiki.
==== Nutch 2.X tutorial(s) ====
 * Nutch2Tutorial -- How to get Nutch 2.X to use HBase as persistence layer for Gora
 * [[http://nlp.solutions.asia/?p=180|Setting up Nutch 2.0 with MySQL to handle UTF-8]] - A step-by-step tutorial
 * [[http://www.covert.io/post/18414889381/accumulo-nutch-and-gora|Accumulo, Nutch, and Gora]] - A step-by-step tutorial
Line 15: Line 22:
  '''HelpContents'''
    A basic guide to the MoinMoin wiki (including information about wiki syntax).
==== Other Tutorial(s) ====
 * [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]] Nutch being based Hadoop, it helps to have a better understanding of Hadoop.
 * [[NutchHadoopTutorial|Nutch Hadoop Tutorial]] - How to setup and run Nutch in deploy mode over a Hadoop cluster.
 * RunNutchInEclipse - How to configure, build, crawl and debug Nutch within Eclipse
 * [[IntranetDocumentSearch|Intranet Document Search]] - Index and search Microsoft Office, PDF etc. documents in a file system hierarchy with a Solr backend.
 * [[http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/|Recrawling with Nutch]] - How to re-crawl with Nutch.
 * [[https://github.com/evolvingweb/ajax-solr/wiki/Tutorial%3A-Nutch|Ajax-Solr Tutorial: Nutch]] - Quick and easy guide to getting a nice UI on top of your Nutch crawl data.
Line 18: Line 30:
  '''WordIndex'''
    A list of all the words that appear in the titles of the pages on this wiki, with links to pages that include that word.
=== Configuration ===
 * OverviewDeploymentConfigs /!\ :This full page requires a complete update to reflect recent Nutch releases: /!\
 * NutchConfigurationFiles: An overview from Nutch developers.
 * NutchPropertiesCompleteList: A fine grained account of all Nutch property configuration.
 * HttpAuthenticationSchemes - How to enable Nutch to authenticate itself using NTLM, Basic or Digest authentication schemes.
 * NonDefaultIntranetCrawlingOptions - Desirable options to add to your Nutch intranet crawling configuration.
 * OptimizingCrawls - How to optimise your crawling/fetching speed with Nutch.
 * ErrorMessages -- What they mean and suggestions for getting rid of them. /!\ :This requires extensive updating to reflect recent Nutch releases. In addition the legacy indexing and searching material should be archived. /!\
 * SetupProxyForNutch - using Tinyproxy on Ubuntu
 * IndexStructure /!\ :This page needs a slight update to provide more information on plugins and the data they send to Solr for indexing: /!\
Line 21: Line 41:
  '''FindPage'''
    A full-text search of the wiki.
== General Information ==
 * [[http://nutch.apache.org|Nutch Website]]
 * [[Features]] /!\ :TODO:This needs to be completely overhauled to reflect recent Nutch features. /!\
 * Current [[NutchGotchas|Nutch Gotchas]]
 * PublicServers running Nutch
 * [[Presentations]] on Nutch
 * Press [[Articles]]
 * [[Evaluations]] of Search Quality
 * Commercial [[Support]] and developers for hire
 * [[Mailing]] Lists
 * AcademicArticles that deal with Nutch
 * [[FAQ]]
 * HardwareRequirements
 * NutchResources
Line 24: Line 56:
  '''WantedPages'''
    All the "broken links" -- a list of all the pages on this wiki that are linked to, but do not exist.
== Nutch Development ==
 * [[Becoming_A_Nutch_Developer|Becoming a Nutch Developer]] - Start developing and contributing to Nutch.
 * PluginCentral -- How to write your own plugins and use other people's.
 * InternalDocumentation -- How Nutch works.
 * [[http://nutch.apache.org/version_control.html|Nutch Version Control]]
 * FixingOpicScoring - ''In planning''.
 * HowToContribute
 * TaskList -- Tasks for Nutch developers. /!\ :Severe update required: /!\
 * [[Committer's_Rules]] -- Committers should follow these guidelines when deciding, which branch to use for committing the patches and when to commit.
 * [[Release_HOWTO]]
 * [[Website_Update_HOWTO]]
 * [[Image_Search_Design]]
 * StrategicGoals
 * [[Getting_Started]]
 * NutchMeetUps - Records of previous Nutch community meetup, hackathons, barcamps etc.
 * [[NutchMavenSupport|Using Nutch as a Maven dependency]]
Line 27: Line 73:
  '''OrphanedPages'''
    All pages on this wiki that are not linked to from anywhere else (and are thus very hard to reach).
== Nutch 2.x ==
 * Nutch2Crawling - A description of the crawling jobs and field to database mappings.
 * Nutch2Architecture - A high level overview of the new architecture and design
 * Nutch2Roadmap -- Discussions on the architecture and features of Nutch 2.0
 * NewScoring -- New stable pagerank like webgraph and link-analysis jobs.
 * NewScoringIndexingExample -- Two full fetch cycles of commands using new scoring and indexing systems.
 * [[http://techvineyard.blogspot.com/2010/12/build-nutch-20.html|Build Nutch 2.0 in Eclipse]] -- How to setup your IDE environment comfortably.
 * ErrorMessagesInNutch2 -- What they mean and suggestions for getting rid of them.
 * [[NutchConfigurationFiles-2.x]] -- Configuration files that are specific to Nutch-2.x
 * [[http:///nlp.solutions.asia/?p=232|Understanding the columns/fields in Nutch 2.0 Webpage - Detailed article]]
Line 30: Line 84:
  '''RandomPage'''
    Generates a list of 75 random pages on this wiki.
== Pre Nutch 1.3 and Archive ==
 * [[Archive and Legacy]]
Line 33: Line 87:
  '''PageSize'''
    Generates a graph and some statistics about the sizes of pages on this wiki.
== How to edit this Wiki ==
This Wiki is a collaborative site, anyone can contribute and share:
Line 36: Line 90:
  '''EventStats/HitCounts'''
    Generates a graph of page views and page visits.
 * Create an account by clicking the "Login" link at the top of any page, and picking a username and password.
 * Edit any page by pressing '''<<GetText(Edit)>>''' at the top or the bottom of the page
Line 39: Line 93:
  '''EventStats/UserAgents'''
    Generates a graph of the web browsers used in visiting this page.
There are some conventions used on the Nutch wiki:
Line 42: Line 95:
  '''SystemInformation'''
    Shows basic information about this wiki installation, the extensions it has installed, etc.
 * /!\ :TODO: /!\ (`/!\ :TODO: /!\` ) is used to denote sections that definitely need to be cleaned up.

Some general info on using this Wiki Software:

 * Create a link to another page with joined capitalized words (like WikiSandBox) or with {{{["quoted words in brackets"]}}}
 * See HelpForBeginners to get you going, HelpContents for all help pages.
 * HelpOnMoinWikiSyntax: quick access to wiki syntax

Welcome to the Apache Nutch Wiki

http://www.interadvertising.co.uk/files/nutch_logo_medium.gif

Please contribute your knowledge about Nutch here!

Nutch Version Administration

Tutorials

Nutch 1.X tutorial(s)

  • NutchTutorial - How to configure Nutch to crawl in local mode and post to Apache Solr for search/index.

Nutch 2.X tutorial(s)

Other Tutorial(s)

Configuration

General Information

Nutch Development

Nutch 2.x

Pre Nutch 1.3 and Archive

How to edit this Wiki

This Wiki is a collaborative site, anyone can contribute and share:

  • Create an account by clicking the "Login" link at the top of any page, and picking a username and password.
  • Edit any page by pressing Edit at the top or the bottom of the page

There are some conventions used on the Nutch wiki:

  • /!\ :TODO: /!\ (/!\ :TODO: /!\ ) is used to denote sections that definitely need to be cleaned up.

Some general info on using this Wiki Software:

FrontPage (last edited 2018-09-27 15:44:39 by RoannelFernandez)