Tika Plugin

The Tika plugin in http://issues.apache.org/jira/browse/NUTCH-766 is a first attempt at delegating the parsing to Tika instead of having to maintain the parser plugins in Nutch. This page will list the differences in coverage or functionality between the Tika plugin and the existing Nutch parsers. Tika also has more formats not covered by Nutch which are not described here and has a more generic capability of representing structured content which can be useful for HtmlParseFilters (which are currently limited to HTML content).

html: comparable

js: ?

mp3: Nutch identifies several fields (Title, Album, Artist) whereas Tika knows only about Titles, the rest is stored as paragraphs.

Tika-app can also identify in an mp3 id3v1 and id3v2 tags like: album, artist, audioSampleRate, composer, genre, logcomment, releaseDate, trackNumber using the XMPDM interface

msexcel: comparable (+ Tika able to represent content in structured way as XHTML tables which can be useful for HTML parser plugins)

mspowerpoint: comparable

msword: Tika does not support word 95 other versions are comparable

openoffice: comparable

pdf: comparable

rss: Tika identifies only the Mimetype but does nothing about the content

rtf: deactivated in Nutch for licensing reasons | works in Tika

swf : not yet covered in Tika (see https://issues.apache.org/jira/browse/TIKA-337)

text: comparable

zip: ?

  • No labels