Troubleshooting Apache Tika

Apache Tika is great when it works, but by default can be silently forgiving of configuration mistakes. Here we'll try to cover some of the main problems, and how to go about diagnosing them

Note that while the underlying cause is often the same no matter how you call Tika, the way of telling what's wrong can vary between them.

Wrong Content Extracted

No Content Extracted

Wrong Parser Used

Content Incorrectly Detected

Tika detects content types based on mime magic, format (normally container) specific detectors, content type hints and filename hints.

Things to check:

Parsers Missing

In order for a Parser to be loaded by Apache Tika, it needs:

To check what parsers you have, see Identifying what Parsers your Tika install supports

To check if any parsers were defined but failed to load see Identifying if any Parsers failed to be loaded

To create a service file for auto-loading, see the quickstart guide

Detectors Missing

In order for a Detector to be loaded by Apache Tika, it needs:

To check what detectors you have, see Identifying what Detectors your Tika install supports

To check if any detectors were defined but failed to load see Identifying if any Detectors failed to be loaded

Mime Type Missing

Identifying your Tika Version

Tika App

java -jar tika-app-blah.jar --version

Tika Server

Go to http://localhost:9998/version

Tika Facade

// Get your Tika object, eg
Tika tika = new Tika();
// Call toString() to get the version
String version = tika.toString();

Tika Java classes

// Get your Tika Config, eg
TikaConfig config = TikaConfig.getDefaultConfig();
// Go via the Tika Facade
String version = (new Tika(config)).toString();

Identifying what Mime Types your Tika install supports

Tika App

java -jar tika-app-blah.jar --list-supported-types

Tika Server

Go to http://localhost:9998/mime-types

Tika Facade

This is not directly possible from the Tika Facade class. Instead, follow the Tika Java classes route below

Tika Java classes

// Get your Tika Config, eg
TikaConfig config = TikaConfig.getDefaultConfig();
// Get the registry
MediaTypeRegistry registry = config.getMediaTypeRegistry();
// List
for (MediaType type : registry.getTypes()) {
   String typeStr = type.toString();
}

Identifying what Parsers your Tika install supports

Tika App

java -jar tika-app-blah.jar --list-parsers

Tika Server

Go to http://localhost:9998/parsers

Tika Facade

// Get your Tika object, eg
Tika tika = new Tika();
// Get the root parser
CompositeParser parser = (CompositeParser)parser.getParser();
// Fetch the types it supports
for (MediaType type : parser.getSupportedTypes(new ParseContext())) {
   String typeStr = type.toString();
}
// Fetch the parsers that make it up (note - may need to recurse if any are a CompositeParser too)
for (Parser p : parser.getAllComponentParsers()) {
   String parserName = p.getClass().getName();
}

Tika Java classes

// Get your Tika Config, eg
TikaConfig config = TikaConfig.getDefaultConfig();
// Get the root parser
CompositeParser parser = (CompositeParser)parser.getParser();
// Fetch the types it supports
for (MediaType type : parser.getSupportedTypes(new ParseContext())) {
   String typeStr = type.toString();
}
// Fetch the parsers that make it up (note - may need to recurse if any are a CompositeParser too)
for (Parser p : parser.getAllComponentParsers()) {
   String parserName = p.getClass().getName();
   if (p instanceof CompositeParser) {
      // Check child ones too
   }
}

Identifying what Detectors your Tika install supports

Tika App

java -jar tika-app-blah.jar --list-detectors

Tika Server

Go to http://localhost:9998/detectors

Tika Facade

// Get your Tika object, eg
Tika tika = new Tika();
// Get the root detector
CompositeDetector detector = (CompositeDetector)parser.getDetector();
// Fetch the detectors that make it up (note - may need to recurse if any are a CompositeDetector too)
for (Detector d : parser.getDetectors()) {
   String detectorName = d.getClass().getName();
}

Tika Java classes

// Get your Tika Config, eg
TikaConfig config = TikaConfig.getDefaultConfig();
// Get the root detector
CompositeDetector detector = (CompositeDetector)parser.getDetector();
// Fetch the detectors that make it up (note - may need to recurse if any are a CompositeDetector too)
for (Detector d : parser.getDetectors()) {
   String detectorName = d.getClass().getName();
   if (d instanceof CompositeDetector) {
      // Check child ones too
   }
}

d

Identifying if any Parsers failed to be loaded

When staring your JVM, if you pass in -Dorg.apache.tika.service.error.warn=true then you'll get warnings logged if any Parsers or Detectors couldn't be loaded. With the default logging configuration, you'll see things like this printed to your standard output of the JVM:

WARNING: Unable to load org.apache.tika.parser.microsoft.OfficeParser
java.lang.NoClassDefFoundError: org/apache/poi/poifs/filesystem/DirectoryEntry
        at java.lang.Class.getDeclaredConstructors0(Native Method)
        at java.lang.Class.privateGetDeclaredConstructors(Class.java:2585)
        at java.lang.Class.getConstructor0(Class.java:2885)
        at java.lang.Class.newInstance(Class.java:350)
        at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:315)
        at org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:52)
        at org.apache.tika.parser.DefaultParser.<init>(DefaultParser.java:61)
        at org.apache.tika.parser.DefaultParser.<init>(DefaultParser.java:66)
        at org.apache.tika.config.TikaConfig.getDefaultParser(TikaConfig.java:76)
        at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:182)
        at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:291)
        at org.apache.tika.Tika.<init>(Tika.java:115)
        at org.apache.tika.cli.TikaCLI.version(TikaCLI.java:629)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:365)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
Caused by: java.lang.ClassNotFoundException: org.apache.poi.poifs.filesystem.DirectoryEntry
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        ... 15 more

In this case, the error is telling us that we're missing the Apache POI jars which are a required dependency of Tika Parsers, and of the org.apache.tika.parser.microsoft.OfficeParser parser.

TODO describe how to use a ServiceLoader.LoadErrorHandler.ERROR to trigger an exception

Identifying if any Detectors failed to be loaded

When staring your JVM, if you pass in -Dorg.apache.tika.service.error.warn=true then you'll get warnings logged if any Parsers or Detectors couldn't be loaded. With the default logging configuration, you'll see things like this printed to your standard output of the JVM:

WARNING: Unable to load org.apache.tika.parser.microsoft.POIFSContainerDetector
java.lang.NoClassDefFoundError: org/apache/poi/poifs/filesystem/DirectoryEntry
        at java.lang.Class.getDeclaredConstructors0(Native Method)
        at java.lang.Class.privateGetDeclaredConstructors(Class.java:2585)
        at java.lang.Class.getConstructor0(Class.java:2885)
        at java.lang.Class.newInstance(Class.java:350)
        at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:315)
        at org.apache.tika.detect.DefaultDetector.getDefaultDetectors(DefaultDetector.java:55)
        at org.apache.tika.detect.DefaultDetector.<init>(DefaultDetector.java:66)
        at org.apache.tika.config.TikaConfig.getDefaultDetector(TikaConfig.java:71)
        at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:183)
        at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:291)
        at org.apache.tika.Tika.<init>(Tika.java:115)
        at org.apache.tika.cli.TikaCLI.version(TikaCLI.java:629)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:365)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
Caused by: java.lang.ClassNotFoundException: org.apache.poi.poifs.filesystem.DirectoryEntry
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        ... 14 more

In this case, the error is telling us that we're missing the Apache POI jars which are a required dependency of Tika Parsers, and of the org.apache.tika.parser.microsoft.POIFSContainerDetector detector.

TODO describe how to use a ServiceLoader.LoadErrorHandler.ERROR to trigger an exception

PDF Text Problems

If Tika isn't extracting the right text from a PDF, and/or is giving errors, the first thing to do is identify if this is a Tika issue, or an issue with the underlying Apache PDFBox library used.

To check, grab the latest Apache PDFBox pdfbox-app jar and use the ExtractText command line tool on your problematic PDF:

java -jar pdfbox-app.X.Y.jar ExtractText problematicPDF.pdf

If that shows the same problem, it's a PDFBox bug. Please file an Apache PDFBox bug report and attach at least one failing file to the bug. When that gets fixed, Tika will pick up the new release and will get the fix

If PDFBox ExtractText works fine, it may* be a Tika bug. Please report an Apache Tika bug, attach at least one failing file, and mention that PDFBox ExtractText doesn't have the issue.

*PDFBox's ExtractText does not pull text from Annotations or Acroforms, so it is possible that a problem not encountered by PDFBox's ExtractText reveals a bug in Annotations or Acroforms; might be a bug in Tika, too. When in doubt, ask.

See also: PDFParser notes.

Troubleshooting Tika (last edited 2016-11-10 19:36:27 by TimothyAllison)