Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Note that while the underlying cause is often the same no matter how you call Tika, the way of telling what's wrong can vary between them.

Table of Contents

Wrong Content Extracted

  • Make sure you're passing Tika the source file you meant to pass, and it hasn't been corrupted in the transfer process
  • Make sure Tika is able to correctly detect your file's type, see Content Incorrectly Detected
  • Make sure Tika used the parser you meant it to, see Wrong Parser Used
  • Make sure you're actually using the version of Tika you meant to use! See Identifying your Tika Version
  • Problems with a PDF? See PDF Text Problems

No Content Extracted

  • Make sure Tika is able to correctly detect your file's type, see Content Incorrectly Detected
  • Make sure Tika has the parser for your format, and its dependencies, available and working. See Parsers Missing
  • Make sure you're actually using the version of Tika you meant to use! See Identifying your Tika Version

Wrong Parser Used

  • Make sure Tika is able to correctly detect your file's type, see Content Incorrectly Detected
  • Make sure the parser you wanted to use is available to Tika. See Identifying what Parsers your Tika install supports, Parsers Missing and Identifying is any Parsers failed to be loaded

Content Incorrectly Detected

...

  • Does Tika know about your type? See Identifying what Mime Types your Tika install supports
  • If the mime type isn't listed there, see Mime Type Missing
  • Does Tika have all its detectors? See Identifying what Detectors your Tika install supports and Detectors Missing
  • Is your file a different version of the format? Check the first few hundred bytes in a hex editor, and compare to the built-in mime type

Parsers Missing

In order for a Parser to be loaded by Apache Tika, it needs:

  • The parser class to be on the classpath at runtime
  • And all of its dependencies
  • For most parsers, that means the tika-parsers jar and dependencies
  • One of:
    • a Tika Config which explicitly lists the parser class
    • a Tika Config (eg default one) which uses DefaultParser and a service file for the parser and no exclusion of that parser or parser's type

To check what parsers you have, see Identifying what Parsers your Tika install supports

...

  • The detector class to be on the classpath at runtime
  • And all of its dependencies
  • For most detectors, that means the tika-parsers jar and dependencies (the container detectors are generally stored along with the parsers)
  • One of:
    • a Tika Config which explicitly lists the detector class
    • a Tika Config (eg default one) which uses DefaultDetector and a service file for the detector

To check what detectors you have, see Identifying what Detectors your Tika install supports

...

  • If Tika doesn't out of the box, you need to add a custom mimetypes file. See the quick guide for how
  • If you have written a custom mimetypes file, it needs to be present on your classpath at runtime with the exact name of org/apache/tika/mime/custom-mimetypes.xml . Double check you added it to your classpath, it has exactly that name (no typos, no prefix directories, no suffixes etc), and use Identifying what Mime Types your Tika install supports to see if you've loaded it or not

Identifying your Tika Version

...

If Tika isn't extracting the right text from a PDF, and/or is giving errors, the first thing to do is identify if this is a Tika issue, or an issue with the underlying Apache PDFBox library used, or an issue with the PDF itself.

To check, grab the latest Apache PDFBox pdfbox-app jar and use the ExtractText command line tool on your problematic PDF:

No Format
java -jar pdfbox-app.X.Y.jar ExtractText problematicPDF.pdf

If PDFBox reports that there are unmapped Unicode characters or other problems, there may be a problem with the PDF itself.  Try opening it in, for example, Adobe Reader and "saving as text" or copying and pasting the text. 

If the "saved text" is just as errorful as what Tika was extracting, there's a problem with the PDF file itself.

If the "saved text" is in good shape, then there may be a problem in PDFBox.  In which case, please that shows the same problem, it's a PDFBox bug. Please file an Apache PDFBox bug report and attach at least one failing file to the bug. When that gets fixed, Tika will pick up the new release and will get the fix.

If PDFBox ExtractText works fine, it may* be a Tika bug. Please report an Apache Tika bug, attach at least one failing file, and mention that PDFBox ExtractText doesn't have the issue.

...