...
No Format |
---|
<?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/> </parser> </parsers> </properties> |
In Tika 2.x, you can selectively turn off OCR per parse programmatically by setting skipOcr
on a TesseractOCRConfig
. This will only affect that one call to parse.
No Format |
---|
TesseractOCRConfig config = new TesseractOCRConfig();
config.setSkipOcr(true);
ParseContext context = new ParseContext();
context.set(TesseractOCRConfig.class, config);
Parser parser = new AutoDetectParser();
parser.parse(inputStream, handler, metadata, context); |
In Tika 2.x, with tika-server
, add this header to skip OCR per request: X-Tika-OCRskipOcr: true
Optional Dependencies
Tika will run preprocessing of images (rotation detection and image normalizing with ImageMagick) before sending the image to tesseract if the user has included dependencies (listed below) and if the user opts to include these preprocessing steps.
...