This page is an in-progress attempt to document best practice for Parser authors on how to handle problematic files

Allowed Responses

The Parser contract is that a Tika parser will populate the Metadata object, send XML events to the ContentHandler, or throw one of:

Suggested Responses

Corrupt File

If the file is corrupted in some way, and cannot be processed, a TikaException should be thrown (see Parser contract)

File cannot be read

If an IO problem occurs when reading the document, an IOException should be thrown (see Parser contract)

"Empty" File (No Text)

If there is no text in the file, either because it's empty (eg 0 byte text file), or because it's a format that doesn't have text (eg an image), then ???

TBC - should the body be opened then immediately closed, or something else?

File is password protected

EncryptedDocumentException (a subtype of TikaException) should be thrown if the file is password protected and no/incorrect password is given.

(A PasswordProvider should be placed on the ParseContext)

Parser can't handle File

If the file is in a sub-format that the parser can't handle (eg parser supports v2 and v3, document is v1, all share the same mimetype), or uses some options that means that parser can't sensibly handle it, then

TBC - should this be an exception, or treated as an empty file?

Document Structure is Broken

If something is very broken with the file / file structure, and it will be impossible to output valid XML for it for some reason, then probably a SAXException is the right thing

ErrorsAndExceptions (last edited 2013-06-24 14:30:28 by NickBurch)