Differences between revisions 1 and 224 (spanning 223 versions)
Revision 1 as of 2005-01-27 17:31:40
Size: 1408
Editor: gci
Comment:
Revision 224 as of 2011-08-26 16:32:34
Size: 4799
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
##language:en
#pragma section-numbers off
= Welcome to the Apache Nutch Wiki =
{{http://www.interadvertising.co.uk/files/nutch_logo_medium.gif}}
Line 4: Line 4:
= Welcome to the Apache Nutch Wiki = Please contribute your knowledge about Nutch here!
<<TableOfContents(3)>>
Line 6: Line 7:
This wiki has just been set up as part of the [wiki:ApacheGeneral:FrontPage big Apache Wiki Farm]. It does not contain anything yet. == Nutch Version 1.3 Administration ==
 * DownloadingNutch
 * Current CommandLineOptions
 * [[http://nutch.apache.org/apidocs-1.3/index.html|JavaDocs]] -- The !JavaDocs for Nutch-1.3 release.
=== Tutorials ===
 * RunningNutchAndSolr - How to configure Nutch 1.3 to crawl in local mode and post to Apache Solr for search/index.
 * RunningNutchInDeployMode - How to configure Nutch 1.3 to crawl in deploy mode. /!\ :TODO:This tutorial is in construction. /!\
 * [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]] Nutch being based Hadoop, it helps to have a better understanding of Hadoop.
 * RunNutchInEclipse - How to configure, build, crawl and debug Nutch 1.3 within Eclipse
=== Configuration ===
 * OverviewDeploymentConfigs /!\ :This full page requires a complete update to reflect Nutch 1.3 release: /!\
 * NutchConfigurationFiles
 * HttpAuthenticationSchemes - How to enable Nutch to authenticate itself using NTLM, Basic or Digest authentication schemes.
 * NonDefaultIntranetCrawlingOptions - Desirable options to add to your Nutch 1.3 intranet crawling configuration.
 * OptimizingCrawls - How to optimize your crawling/fetching speed with Nutch.
 * ErrorMessages -- What they mean and suggestions for getting rid of them. /!\ :This requires extensive updating to reflect Nutch 1.3. In addition the legacy indexing and searching material should be archived. /!\
 * SetupProxyForNutch - using Tinyproxy on Ubuntu
Line 8: Line 25:
InternalDocumentation on Nutch == General Information ==
 * [[http://nutch.apache.org|Nutch Website]]
 * [[Features]] /!\ :TODO:This needs to be completely overhauled to reflect Nutch 1.3 features. /!\
 * Current [[NutchGotchas|Nutch Gotchas]]
 * PublicServers running Nutch
 * [[Presentations]] on Nutch
 * Press [[Articles]]
 * [[Evaluations]] of Search Quality
 * [[Help_Wanted]] organizations hiring Nutch expertise
 * Commercial [[Support]] and developers for hire
 * [[Mailing]] Lists
 * AcademicArticles that deal with Nutch
 * [[FAQ]] /!\ :The Indexing and Searching section require update/archive to reflect new 1.3 release: /!\
 * HardwareRequirements
 * NutchResources
Line 10: Line 41:
= 'Special' Wiki pages = == Nutch Development ==
 * [[Becoming_A_Nutch_Developer|Becoming a Nutch Developer]] - Start developing and contributing to Nutch.
 * PluginCentral -- How to write your own plugins and use other people's.
 * InternalDocumentation -- How Nutch works.
 * [[http://nutch.apache.org/version_control.html|Nutch Version Control]]
 * MultiLingualSupport - ''In development''.
 * FixingOpicScoring - ''In planning''.
 * HowToContribute
 * TaskList -- Tasks for Nutch developers.
 * [[Committer's_Rules]] -- Committers should follow these guidelines when deciding, which branch to use for committing the patches and when to commit.
 * [[Release_HOWTO]]
 * [[Website_Update_HOWTO]]
 * [[Image_Search_Design]]
 * [[NutchOSGi]]
 * StrategicGoals
 * IndexStructure /!\ :This page needs a slight update to provide more information on plugins and the data they send to Solr for indexing: /!\
 * [[Getting_Started]]
 * ApacheConUs2009MeetUp - List of topics for !MeetUp at !ApacheCon US 2009 in Oakland (Nov 2-6)
 * [[NutchMavenSupport|Using Nutch as a Maven dependency]]
Line 12: Line 61:
  '''TitleIndex'''
    A list of all pages on this wiki.
== Nutch 2.0 ==
 * Nutch2Roadmap -- Discussions on the architecture and features of Nutch 2.0
 * NewScoring -- New stable pagerank like webgraph and link-analysis jobs.
 * NewScoringIndexingExample -- Two full fetch cycles of commands using new scoring and indexing systems.
 * [[GORA_HBase]] -- Configuring Nutch 2.0 with GORA and HBASE
 * [[http://techvineyard.blogspot.com/2010/12/build-nutch-20.html|Build Nutch 2.0 in Eclipse]] -- How to setup your IDE environment comfortably.
 * ErrorMessagesInNutch2 -- What they mean and suggestions for getting rid of them.
Line 15: Line 69:
  '''HelpContents'''
    A basic guide to the MoinMoin wiki (including information about wiki syntax).
== Pre Nutch 1.3 and Archive ==
 * [[Archive and Legacy]]
Line 18: Line 72:
  '''WordIndex'''
    A list of all the words that appear in the titles of the pages on this wiki, with links to pages that include that word.
== How to edit this Wiki ==
This Wiki is a collaborative site, anyone can contribute and share:
Line 21: Line 75:
  '''FindPage'''
    A full-text search of the wiki.
 * Create an account by clicking the "Login" link at the top of any page, and picking a username and password.
 * Edit any page by pressing '''<<GetText(Edit)>>''' at the top or the bottom of the page
Line 24: Line 78:
  '''WantedPages'''
    All the "broken links" -- a list of all the pages on this wiki that are linked to, but do not exist.
There are some conventions used on the Nutch wiki:
Line 27: Line 80:
  '''OrphanedPages'''
    All pages on this wiki that are not linked to from anywhere else (and are thus very hard to reach).
 * /!\ :TODO: /!\ (`/!\ :TODO: /!\` ) is used to denote sections that definitely need to be cleaned up.
Line 30: Line 82:
  '''RandomPage'''
    Generates a list of 75 random pages on this wiki.
Some general info on using this Wiki Software:
Line 33: Line 84:
  '''PageSize'''
    Generates a graph and some statistics about the sizes of pages on this wiki.

  '''EventStats/HitCounts'''
    Generates a graph of page views and page visits.

  '''EventStats/UserAgents'''
    Generates a graph of the web browsers used in visiting this page.

  '''SystemInformation'''
    Shows basic information about this wiki installation, the extensions it has installed, etc.
 * Create a link to another page with joined capitalized words (like WikiSandBox) or with {{{["quoted words in brackets"]}}}
 * See HelpForBeginners to get you going, HelpContents for all help pages.
 * HelpOnMoinWikiSyntax: quick access to wiki syntax

Welcome to the Apache Nutch Wiki

http://www.interadvertising.co.uk/files/nutch_logo_medium.gif

Please contribute your knowledge about Nutch here!

Nutch Version 1.3 Administration

Tutorials

  • RunningNutchAndSolr - How to configure Nutch 1.3 to crawl in local mode and post to Apache Solr for search/index.

  • RunningNutchInDeployMode - How to configure Nutch 1.3 to crawl in deploy mode. /!\ :TODO:This tutorial is in construction. /!\

  • Hadoop Tutorial Nutch being based Hadoop, it helps to have a better understanding of Hadoop.

  • RunNutchInEclipse - How to configure, build, crawl and debug Nutch 1.3 within Eclipse

Configuration

General Information

Nutch Development

Nutch 2.0

Pre Nutch 1.3 and Archive

How to edit this Wiki

This Wiki is a collaborative site, anyone can contribute and share:

  • Create an account by clicking the "Login" link at the top of any page, and picking a username and password.
  • Edit any page by pressing Edit at the top or the bottom of the page

There are some conventions used on the Nutch wiki:

  • /!\ :TODO: /!\ (/!\ :TODO: /!\ ) is used to denote sections that definitely need to be cleaned up.

Some general info on using this Wiki Software:

FrontPage (last edited 2018-09-27 15:44:39 by RoannelFernandez)