Differences between revisions 5 and 6
Revision 5 as of 2013-03-26 21:30:46
Size: 1805
Revision 6 as of 2013-03-26 21:31:18
Size: 1811
Deletions are marked like this. Additions are marked like this.
Line 4: Line 4:
 * --Storage Abstraction--<<BR>>
  * --initially with back end implementations for HBase and HDFS--
  * --extend it to other storages later e.g. MySQL etc...--
 * --(Storage Abstraction)--<<BR>>
  * --(initially with back end implementations for HBase and HDFS)--
  * --(extend it to other storages later e.g. MySQL etc...)--


Here is a list of the features and architectural changes that will be implemented in Nutch 2.0.

  • Storage Abstraction

    • initially with back end implementations for HBase and HDFS

    • extend it to other storages later e.g. MySQL etc...

  • Plugin cleanup : Tika only for parsing document formats (see http://wiki.apache.org/nutch/TikaPlugin)

    • keep only stuff HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format.

    • Modify code so that parser can generate multiple documents which is what 1.x does but not 2.0
  • Externalize functionalities to crawler-commons project [http://code.google.com/p/crawler-commons/]

    • robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix,droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats.
  • Remove index / search and delegate to SOLR

    • we may still keep a thin abstract layer to allow other indexing/search backends (ElasticSearch?), but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there.

  • Rewrite SOLR deduplication : do everything using the webtable and avoid retrieving content from SOLR
  • Various new functionalities
    • e.g. sitemap support, canonical tag, better handling of redirects, detecting duplicated sites, detection of spam cliques, tools to manage the webgraph, etc.

This document is meant to serve as a basis for discussion, feel free to contribute to it

Nutch2Roadmap (last edited 2016-01-16 14:28:03 by LewisJohnMcgibbney)