...

No Format

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
    </parser>
  </parsers>
</properties>

In Tika 2.x, you can selectively turn off OCR per parse programmatically by setting skipOcr on a TesseractOCRConfig. This will only affect that one call to parse.

No Format

        TesseractOCRConfig config = new TesseractOCRConfig();
        config.setSkipOcr(true);
        ParseContext context = new ParseContext();
        context.set(TesseractOCRConfig.class, config);
        
        Parser parser = new AutoDetectParser();
        parser.parse(inputStream, handler, metadata, context);

In Tika 2.x, with tika-server, add this header to skip OCR per request: X-Tika-OCRskipOcr: true

Optional Dependencies

Tika will run preprocessing of images (rotation detection and image normalizing with ImageMagick) before sending the image to tesseract if the user has included dependencies (listed below) and if the user opts to include these preprocessing steps.

...

Page tree

Versions Compared

Old Version 5

New Version 6

Key

Optional Dependencies

Page tree

Page History

Versions Compared

Old Version 5

New Version 6

Key

Optional Dependencies