...
To go with option 2 (render each page and then run OCR on that rendered image), you need to specify the ocr strategy:curl -T testOCR.pdf http://localhost:9998/tika --header "X-Tika-PDFOcrStrategy: ocr_only"
Disable
...
OCR in Tika
Anchor | ||||
---|---|---|---|---|
|
Tika's OCR will trigger on images embedded within, say, office documents in addition to images you upload directly. Because OCR slows down Tika, you might want to disable it if you don't need the results. You can disable OCR by simply uninstalling tesseract, but if that's not an option, here is a tika.xml config file that disables OCR:
...