Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

No Format
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
    </parser>
  </parsers>
</properties>

In Tika 2.x, you can selectively turn off OCR per parse programmatically by setting skipOcr  on a TesseractOCRConfig. This will only affect that one call to parse.

No Format
        TesseractOCRConfig config = new TesseractOCRConfig();
        config.setSkipOcr(true);
        ParseContext context = new ParseContext();
        context.set(TesseractOCRConfig.class, config);
        
        Parser parser = new AutoDetectParser();
        parser.parse(inputStream, handler, metadata, context);

In Tika 2.x,  with tika-server, add this header to skip OCR per request: X-Tika-OCRskipOcr: true

Optional Dependencies

Tika will run preprocessing of images (rotation detection and image normalizing with ImageMagick) before sending the image to tesseract if the user has included dependencies (listed below) and if the user opts to include these preprocessing steps.

...