Overriding Default Configuration

When using the OCR Parser Tika will use the following default settings:

Tesseract installation path = ""
Language dictionary = "eng"
Page Segmentation Mode = "1"
Minmum file size = 0
Maximum file size = 2147483647
Timeout = 120

To changes these settings you can either modify the existing TesseractOCRConfig.properties file in tika-parser/src/main/resources/org/apache/tika/parser/ocr, or overriding it by creating your own and placing it in the package org/apache/tika/parser/ocr on your classpath.

It is worth noting that doing this when using one of the executable JARs, either the tika-app or tika-server JARs, will require you to execute them without using the -jar command. For example, something like the following for the tika-app or tika-server, respectively:

java -cp /path/to/your/classpath:/path/to/tika-app-X.X.jar org.apache.tika.cli.TikaCLI

...

In Tika 2.x, users can modify configurations via a tika-config.xml. With the exceptions of the paths, we document the defaults in the following:

Code Block

language	xml
title	TesseractOCR Configuration

<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <!-- this is not formally necessary, but prevents loading of unnecessary parser -->
      <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
    </parser>
    <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
      <params>
        <!-- these are the defaults; you only need to specify the ones you want
             to modify -->
        <param name="applyRotation" type="bool">false</param>
        <param name="colorSpace" type="string">gray</param>
        <param name="density" type="int">300</param>
        <param name="depth" type="int">4</param>
        <param name="enableImagePreprocessing" type="bool">false</param>
        <param name="filter" type="string">triangle</param>
        <param name="imageMagickPath" type="string">/my/custom/imageMagicPath</param>
        <param name="language" type="string">eng</param>
        <param name="maxFileSizeToOcr" type="long">2147483647</param>
        <param name="minFileSizeToOcr" type="long">0</param>
        <param name="pageSegMode" type="string">1</param>
        <param name="pageSeparator" type="string"></param>
        <param name="preserveInterwordSpacing" type="bool">false</param>
        <param name="resize" type="int">200</param>
        <param name="skipOcr" type="bool">false</param>
        <param name="tessdataPath" type="string">/my/custom/data</param>
        <param name="tesseractPath" type="string">/my/custom/path</param>
        <param name="timeoutSeconds" type="int">120</param>
      </params>
    </parser>
  </parsers>
</properties>

OCR and PDFs

See also PDFParser notes for more details on options for performing OCR on PDFs.

...

Page tree

Versions Compared

Old Version 7

New Version Current

Key

Overriding Default Configuration

OCR and PDFs

Page tree

Page History

Versions Compared

Old Version 7

New Version Current

Key

Overriding Default Configuration

OCR and PDFs