This table highlights some differences between some of the handlers. I've temporarily left in question marks for items we need to confirm.

Feature/tika (text|body)/tika (html)/tika (json)/rmeta/meta/unpack
Text (including text of embedded documents)YYYYNY ( with /unpack/all)
Metadata of main documentNYYYYY ( with /unpack/all)
Metadata of embedded documents/attachmentsNNNYNN
Notification of parse exceptionY/N[1]Y/N[1]YYYY?
Specific stacktrace if server is started with the -s (stacktrace)  optionNNYYNN
MetadataFilters are applied (see ModifyingContentWithHandlersAndMetadataFilters)NNYYNN
Notification of parse exception in embedded documentNNY as of 2.4.1YNN?
Specific stacktrace for parse exception in embedded documentNNY as of 2.4.1YNN
Streaming write[2]YYNNNN
WriteLimit with the writeLimit  headerNNYYN/AN
Actual attachments (raw bytes)NNNNNY

1 If the parse exception comes early in the parse before the streaming starts (as with an EncryptedDocumentException), you'll get an http status 422 in /tika (text) and /tika (html).  With the /tika (text)  option, if the parse exception happens after content has started streaming, the stream will simply stop and you'll have no idea that there was a parse exception.  With the /tika (html)  option, you'll see truncated html in /tika (html) if this happens.

2 Tika tries to stream while parsing and while writing the output.  For some file formats, the parsers currently load the full document into memory and then write the content.  So, this row focuses on whether Tika streams the writing of the content (and not the streaming read/parse of the file).

  • No labels