Giving HTML5 support for Apache Nutch 2.x


The project is aimed at giving Html5 support to Apache Nutch 2.x with using a java library. With this project two goals is aimed. First one is implementation of a new parser which has to follow WHATWG HTML5 specification. Second one is implementation of a new plugin which uses newly implemented parser and extracts new elements of HTML5.


Reports will be added here.


Documents will be added here.

Jira Issues

Issues will be added here.

  • No labels