Composite Parser Discussion
A given mime type may be supported by several parsers. Work on TIKA-1445 (adding metadata back into OCR'd text) raised the prominence of this issue. Currently, the CompositeParser picks the first parser that supports a given mime type. In discussion on TIKA-1445 other potential use cases were identified.
The purpose of this page is to track a unified vision of the strategies that we'll implement in Tika.
The JIRA issue for this is TIKA-1509.
This page is just a start. Please contribute
Sort the parsers by non-tika vs tika and then alphabetically by class name. Pick the first parser that will handle a given mime type.
Concatenate the results (metadata and content) for several parsers
We need a better name for this!
Try one parser and if the output doesn't meet some criterion, apply another. One use case for this might be: if a file is identified as XML, try the XMLParser and if that throws an exception, try the HTMLParser.
Pick the Best Output
One use case for this: the charset detector identifies two equally likely charsets. Apply both and use the wished-for junk detector (TIKA-1443) to determine which output is more likely to be not junk.
If there are two parsers, use the faster one even if it might mean lower quality (eg avoid OCR)
Mime type hierarchies
Consider a case like:
If there were two parsers available for application/vnd.ms-excel, and another for application/x-tika-msoffice, should it be possible to specify in a strategy that a parser for the parent type also be used? Should it be possible to set a strategy like "use the dita concept, then the general dita, then the dita topic", hopping around up and down the hierarchy?
Or do we keep the current behaviour where once a point in the hierachy with a parser is found, it is parsed at that point?
Allowing the User to select a strategy
The right strategy for one user may not be the right for another. The right strategy for one file may not be the right one for another. We therefore need to allow users to pick their strategy, on an overall basis, and on a per-file basis
Currently, a great many Tika users just call TikaConfig.getDefaultConfig() and go with that.
It might be nice if they could also do things like TikaConfig.getMaxiumMetadataConfig() or TikaConfig.getTryEachInTurnConfig() to pick a different strategy
(Naming TBC, align with above)
With a Tika Configuration file
Users may wish to have full control over what parsers are used, what strategies are used for which mime types etc
For example, they might want default behaviour for most types, but to send XML through a fallback parser, and combine Image + GDAL + OCR for jpeg. The configuration file needs to support this
<parsers> <!-- Most things can use the default --> <parser class="org.apache.tika.parser.DefaultParser"> <!-- Don't use DefaultParser for these mimetypes, alternate config below --> <mime-exclude>image/jpeg</mime-exclude> <mime-exclude>application/xml</mime-exclude> <mime-exclude>application/pdf</mime-exclude> <!-- Exclude (blacklist) these parsers, have them ignored by DefaultParser --> <parser-exclude class="org.apache.tika.parser.jdbc.SQLite3Parser"/> <parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/> </parser> <!-- No PDF, thank you! --> <parser class="org.apache.tika.parser.EmptyParser"> <mime>application/pdf</mime> </parser> <!-- JPEG needs special handling - try+combine everything --> <parser class="org.apache.tika.parser.(suppliment)"> <parser class="org.apache.tika.parser.ocr.TesseractOCRParser" /> <parser class="org.apache.tika.parser.image.ImageParser" /> <parser class="org.apache.tika.parser.jpeg.JpegParser" /> <parser class="org.apache.tika.parser.gdal.GDALParser" /> <!-- TODO DO we need to give mimetypes here too? Or can we get implicitly? --> </parser> <!-- XML needs special handling - use fallbacks to get something --> <parser class="org.apache.tika.parser.(fallback)"> <parser class="my.custom.xml.parser" /> <parser class="org.apache.tika.parser.xml.XMLParser" /> <parser class="org.apache.tika.parser.html.HTMLParser" /> <parser class="org.apache.tika.parser.txt.TXTParser" /> <mime>application/xml</mime> </parser> </parsers>
Whatever we do, this must be available from code too, much as how today people can create custom CompositeParser instances, or wrap things up with custom ParserDecorator instances
We also need examples for all of these, not only in unit tests, but also in the examples package.