...

See also PDFParser notes for more details on options for performing OCR on PDFs.

Note: With Tika server 1.x, the PDFConfig is generated for each document, so any configurations that you may specify in the tika-config.xml file that you pass to the tika-server on startup are overwritten. This behavior is changed in Tika 2.x, where the PDFConfig remembers settings from tika-config.xml and will only temporarily update custom configs sent via headers.

To go with option 1 for OCR'ing PDFs (run OCR against inline images), you need to specify configurations for the PDFParser like so:

...

To go with option 2 (render each page and then run OCR on that rendered image), you need to specify the ocr strategy:
curl -T testOCR.pdf http://localhost:9998/tika --header "X-Tika-PDFOcrStrategy: ocr_only"

Note: These two options are independent. If you set extractInlineImages to true and select an OcrStrategy that includes OCR on the rendered page, Tika will run OCR on the extracted inline images and the rendered page.

Disable OCR in Tika
Anchor
disable-ocr
disable-ocr

...

Page tree

Versions Compared

Old Version 6

New Version 7

Key

Disable OCR in Tika
Anchor
disable-ocr
disable-ocr

Page tree

Page History

Versions Compared

Old Version 6

New Version 7

Key

Disable OCR in Tika Anchordisable-ocrdisable-ocr

Disable OCR in Tika
Anchor
disable-ocr
disable-ocr