Differences between revisions 259 and 310 (spanning 51 versions)
Revision 259 as of 2013-02-10 02:09:46
Size: 6003
Comment:
Revision 310 as of 2018-09-27 15:44:39
Size: 10200
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
{{http://www.interadvertising.co.uk/files/nutch_logo_medium.gif}} {{attachment:nutch_logo_medium.gif}}
Line 4: Line 4:
Please contribute your knowledge about Nutch here! <<TableOfContents(4)>> Please contribute your knowledge about Nutch here!

'''If you would like to update any content, would like to add your own content or would like to see something added then please
 * forward your wiki username to the dev [at] nutch.apache.org mailing list (someone will give you permissions)
 * browse the [[http://s.apache.org/73z|Documentation issues]] and open a [[https://issues.apache.org/jira/browse/NUTCH|Jira ticket]] (tagging it with the [[http://s.apache.org/73z|Documentation label]]).'''

<<TableOfContents(4)>>

== What is Apache Nutch? ==

Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from [[http://lucene.apache.org/|Apache Lucene]], the project has diversified and now comprises two codebases, namely:

 * Nutch 1.x: A well matured, production ready crawler. 1.x enables fine grained configuration, relying on [[http://hadoop.apache.org/|Apache Hadoop]] data structures, which are great for batch processing.
 * Nutch 2.x: An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using [[http://gora.apache.org|Apache Gora]] for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions.

Being pluggable and modular of course has it's benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations e.g. [[http://tika.apache.org|Apache Tika]] for parsing. Additionally, pluggable indexing exists for [[http://lucene.apache.org/solr|Apache Solr]], [[http://www.elasticsearch.org|Elastic Search]], etc.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

You can download Nutch [[http://nutch.apache.org/downloads.html|here]].

For more information about Apache Nutch, please see the Nutch wiki.

Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.
Line 9: Line 33:
 * [[http://nutch.apache.org/apidocs-1.6/index.html|JavaDocs]] -- The !JavaDocs for the most recent Nutch-1.X release.
 * [[http://nutch.apache.org/apidocs-2.1/index.html|JavaDocs]] -- The !JavaDocs for the most recent Nutch-2.X release.
 * [[https://nutch.apache.org/apidocs/apidocs-1.15/index.html|JavaDocs]] -- The !JavaDocs for the most recent Nutch-1.X release.
 * [[https://nutch.apache.org/apidocs/apidocs-2.3.1/index.html|JavaDocs]] -- The !JavaDocs for the most recent Nutch-2.X release.
Line 16: Line 40:
 * QuickStartparseChecker - Quick start tutorial on how to use the ParseChecker tool to quickly scrape a website.
 * [[Nutch 1.X RESTAPI|https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI]] - An overview of the entire Nutch 1.X REST API.
Line 18: Line 44:
 * Nutch2Tutorial -- How to get Nutch 2.X to use HBase as persistence layer for Gora 
 * [[http://nlp.solutions.asia/?p=180|Setting up Nutch 2.0 with MySQL to handle UTF-8]] - A step-by-step tutorial
 * [[http://www.covert.io/post/18414889381/accumulo-nutch-and-gora|Accumulo, Nutch, and Gora]] - A step-by-step tutorial
 * Nutch2Tutorial -- How to get Nutch 2.X to use HBase as persistence layer for Gora. This is the primary Nutch 2.X tutorial.
 * [[Nutch2Cassandra|Setting up Nutch 2.x with Cassandra]] - How to setup and run Nutch 2.x using Cassandra as storage.
 * [[https://wiki.apache.org/nutch/HBase%20Hive%20MetaStore%20Mapping%20for%20Nutch%202.x| How to map your Nutch 2.x H
base table to Hive]] - Sample query for Hive mapping.
 * [[http://www.covert.io/post/18414889381/accumulo-nutch-and-gora|Accumulo, Nutch, and Gora]] - A step-by-step tutorial /!\ Very Old /!\
Line 23: Line 50:
 * Focused Crawling with Nutch using [[SimilarityScoringFilter|Cosine Similarity]], [[NaiveBayesParseFilter|Naive Bayes]] or the [[Anthelion|Anthelion]] mechanisms.
Line 24: Line 52:
 * [[NutchHadoopTutorial|Nutch Hadoop Tutorial]] - How to setup and run Nutch in deploy mode over a Hadoop cluster.   * [[NutchHadoopSingleNodeTutorial|Running Nutch in (pseudo) distributed mode]] - How to setup and run Nutch in Hadoop pseudo-distributed mode.
Line 29: Line 57:
 * [[http://soryy.com/blog/2014/ajax-javascript-enabled-parsing-apache-nutch-selenium/|AJAX/JavaScript Enabled Parsing with Apache Nutch and Selenium]]
 * SetupProxyForNutch - using Tinyproxy on Ubuntu
 * SetupNutchAndTor - Crawling .onion hidden services using Nutch behind Polipo HTTP Proxy
 * [[http://digitalpebble.blogspot.co.uk/2015/09/index-web-with-aws-cloudsearch.html|CloudSearch]] - Step by step instructions on using Nutch with Cloudsearch, including pseudo distributed mode
 * [[https://t.co/c9BsaXhN80|Webcast]] : running Apache Nutch on [[https://aws.amazon.com/elasticmapreduce/|Elastic MapReduce]]
Line 38: Line 72:
 * SetupProxyForNutch - using Tinyproxy on Ubuntu
Line 40: Line 73:
 * IndexWriters: How to configure the index writers for indexing step.
 * [[Exchanges]]: How to configure the exchanges for indexing step.
Line 49: Line 84:
 * Commercial [[Support]] and developers for hire  * Commercial [[Support]] & developers for hire
Line 55: Line 90:
 * NutchScoring - The whats and wheres of Scoring implementations in Apache Nutch
 * NutchFileFormats - Provides information on the Nutch file formats
Line 61: Line 98:
 * FixingOpicScoring - ''In planning''.  * [[UsingGit]] - a guide to leveraging Git and Nutch. Nutch's source code is no longer managed in Subversion, it's managed in Git.
Line 63: Line 100:
 * TaskList -- Tasks for Nutch developers. /!\ :Severe update required: /!\
Line 66: Line 102:
 * [[Website_Update_HOWTO]]  * [[CMS_Website_Update_HOWTO]] - How to edit the Nutch website based on the [[http://www.apache.org/dev/cms.html|Apache CMS]].
Line 72: Line 108:
 * GoogleSummerOfCode - An area dedicated to GSoC projects and student/mentor development/documentation sandbox.
 * AdvancedAjaxInteraction - Discussion centered on enabling Nutch to not only fetch, but also interact with JavaScript
 * WhiteListRobots - User guide for the new host robots.txt whitelist capability
Line 77: Line 116:
 * NewScoring -- New stable pagerank like webgraph and link-analysis jobs.
 * NewScoringIndexingExample -- Two full fetch cycles of commands using new scoring and indexing systems.
Line 81: Line 118:
 * [[NutchConfigurationFiles-2.x]] -- Configuration files that are specific to Nutch-2.x
Line 82: Line 120:
 * WorkingWithGoraSnapshots - A step by step guide to working with Gora development code within your Nutch 2.x deployment
 * [[NutchRESTAPI]] - A UML diagram and overview of the entire Nutch 2.X REST API.

Welcome to the Apache Nutch Wiki

nutch_logo_medium.gif

Please contribute your knowledge about Nutch here!

If you would like to update any content, would like to add your own content or would like to see something added then please

What is Apache Nutch?

Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project has diversified and now comprises two codebases, namely:

  • Nutch 1.x: A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing.

  • Nutch 2.x: An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions.

Being pluggable and modular of course has it's benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations e.g. Apache Tika for parsing. Additionally, pluggable indexing exists for Apache Solr, Elastic Search, etc.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

You can download Nutch here.

For more information about Apache Nutch, please see the Nutch wiki.

Nutch is a project of the Apache Software Foundation and is part of the larger Apache community of developers and users.

Nutch Version Administration

Tutorials

Nutch 1.X tutorial(s)

Nutch 2.X tutorial(s)

Other Tutorial(s)

Configuration

  • OverviewDeploymentConfigs /!\ :This full page requires a complete update to reflect recent Nutch releases: /!\

  • NutchConfigurationFiles: An overview from Nutch developers.

  • NutchPropertiesCompleteList: A fine grained account of all Nutch property configuration.

  • HttpAuthenticationSchemes - How to enable Nutch to authenticate itself using NTLM, Basic or Digest authentication schemes.

  • NonDefaultIntranetCrawlingOptions - Desirable options to add to your Nutch intranet crawling configuration.

  • OptimizingCrawls - How to optimise your crawling/fetching speed with Nutch.

  • ErrorMessages -- What they mean and suggestions for getting rid of them. /!\ :This requires extensive updating to reflect recent Nutch releases. In addition the legacy indexing and searching material should be archived. /!\

  • IndexStructure /!\ :This page needs a slight update to provide more information on plugins and the data they send to Solr for indexing: /!\

  • IndexWriters: How to configure the index writers for indexing step.

  • Exchanges: How to configure the exchanges for indexing step.

General Information

Nutch Development

Nutch 2.x

Pre Nutch 1.3 and Archive

How to edit this Wiki

This Wiki is a collaborative site, anyone can contribute and share:

  • Create an account by clicking the "Login" link at the top of any page, and picking a username and password.
  • Edit any page by pressing Edit at the top or the bottom of the page

There are some conventions used on the Nutch wiki:

  • /!\ :TODO: /!\ (/!\ :TODO: /!\ ) is used to denote sections that definitely need to be cleaned up.

Some general info on using this Wiki Software:

FrontPage (last edited 2018-09-27 15:44:39 by RoannelFernandez)