Differences between revisions 45 and 46
Revision 45 as of 2016-09-12 13:41:43
Size: 15479
Comment: We took out the fileUrl option in Tika 1.10 because it introduced a security vulnerability.
Revision 46 as of 2016-09-23 18:45:07
Size: 16915
Comment:
Deletions are marked like this. Additions are marked like this.
Line 397: Line 397:

== Specifying a URL Instead of Putting Bytes ==
In Tika 1.10, we removed this capability because it posed a security vulnerability (CVE-2015-3271). Anyone with
access to the service had the server's access rights; someone could request local files via {{{file:///}}} or pages
from an intranet that they might not otherwise have access to.

In Tika 1.14, we added the capability back, but the user has to acknowledge the security risk by including two commandline arguments:
{{{
$ java -jar tika-server-x.x.jar -enableUnsecureFeatures -enableFileUrl
}}}

This allows the user to specify a {{{fileUrl}}} in the header:
{{{
curl -i -H "fileUrl:http://tika.apache.org" -H "Accept:text/plain" -X PUT http://localhost:9998/tika
}}}

or

{{{
curl -i -H "fileUrl:file:///C:/data/my_test_doc.pdf" -H "Accept:text/plain" -X PUT http://localhost:9998/tika
}}}

By adding back this capability, we did not remove the security vulnerability. Rather, if a user is confident that only authorized clients are able to submit a request, the user can choose to operate tika-server with this insecure setting. '''BE CAREFUL!'''

Also, please be polite. This feature was added as a convenience. Please consider using a robust crawler (instead of our simple {{{TikaInputStream.get(new URL(fileUrl))}}}) that will allow for better configuration of redirects, timeouts, cookies, etc.; and a robust crawler will respect robots.txt!

Introduction

This page is documentation on tika's JSR 311 network server, tika-server. The server package uses the Apache CXF framework that provides an implementation of JAX-RS for Java. The Tika server component builds to a standalone package in Tika, tika-server.

Contents

  1. Introduction
    1. Installation
    2. Building from source
    3. Running the Tika Server
    4. Using prebuilt Docker image
  2. Services
    1. Metadata Resource
      1. Multipart Support
    2. Tika Resource
      1. Get HELLO message back
      2. Get the Text of a Document
      3. Multipart Support
    3. Detector Resource
      1. PUT an RTF file and get back RTF
      2. PUT a CSV file without filename hint and get back text/plain
      3. PUT a CSV file with filename hint and get back text/csv
    4. Language Resource
      1. PUT a TXT file with English This is English! and get back en
      2. PUT a TXT file with French comme çi comme ça and get back fr
      3. PUT a string with English This is English! and get back en
      4. PUT a string with French comme çi comme ça and get back fr
    5. Translate Resource
      1. PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Lingo24
      2. PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Microsoft
      3. PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Google
      4. PUT a TXT file named sentences2 with French comme çi comme ça and get back the English translation using Google auto-detecting the language
    6. Recursive Metadata and Content
      1. Multipart Support
    7. Unpack Resource
      1. PUT zip file and get back met file zip
      2. PUT doc file and get back met file tar
      3. PUT doc file and get back the content and metadata
  3. Information Services
    1. Available Endpoints
    2. Defined Mime Types
    3. Available Detectors
    4. Available Parsers
    5. Specifying a URL Instead of Putting Bytes

Installation

The easiest way to get the Tika JAXRS server is to download the latest stable release binary. This is available from the Apache Tika downloads page, via your favourite local mirror. You want the tika-server-1.x.jar file, eg tika-server-1.13.jar

Alternatively you can use unofficial docker image from Dave Meikle.

Building from source

If you need to customise the server in some way, and/or need the very latest version to try out a fix, then to build from source:

  1. Checkout the source from SVN as detailed on the Apache Tika contributions page or retrieve the latest code from Github,

  2. Build source using Maven
  3. Run the Apache Tika JAXRS server runnable jar.

svn co https://svn.apache.org/repos/asf/tika/trunk/ tika-trunk
cd ./tika-trunk/
mvn install
cd ./tika-server/target/
java -jar tika-server-x.x.jar

Remember to replace x.x with the version you have built.

Running the Tika Server

The Tika Server binary is a standalone runnable jar. You can start it by calling java with the -jar option, eg something like java -jar tika-server-1.13.jar

You will then see a message such as the following:

$ java -jar tika-server-1.8-SNAPSHOT.jar
19-Jan-2015 14:23:36 org.apache.tika.server.TikaServerCli main
INFO: Starting Apache Tika 1.8-SNAPSHOT server
19-Jan-2015 14:23:36 org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
19-Jan-2015 14:23:36 org.slf4j.impl.JCLLoggerAdapter info
INFO: jetty-8.y.z-SNAPSHOT
19-Jan-2015 14:23:36 org.slf4j.impl.JCLLoggerAdapter info
INFO: Started SelectChannelConnector@localhost:9998
19-Jan-2015 14:23:36 org.apache.tika.server.TikaServerCli main
INFO: Started

Which lets you know that it started correctly.

You can specify additional information to change the host name and port number:

java -jar tika-server-x.x.jar --host=intranet.local --port=12345

Once the server is running, you can visit the server's URL in your browser (eg http://localhost:9998/), and the basic welcome page will confirm that the Server is running, and give links to the various endpoints available.

Below is some basic documentation on how to interact with the services using cURL and HTTP.

Using prebuilt Docker image

Also, you can download and start it with

docker pull logicalspark/docker-tikaserver # only on initial download/update
docker run --rm -p 9998:9998 logicalspark/docker-tikaserver

With --rm option it will be deleted as soon as container stopped. Dockerfile can be found at Github.

Services

All services that take files use HTTP "PUT" requests. When "PUT" is used, the original file must be sent in request body without any additional encoding (do not use multipart/form-data or other containers).

Additionally, TikaResource, Metadata and RecursiveMetadata Services accept POST multipart/form-data requests, where the original file is sent as a single attachment.

Information services (eg defined mimetypes, defined parsers etc) work with HTML "GET" requests.

You may optionally specify content type in "Content-Type" header. If you do not specify mime type, Tika will use its detectors to guess it.

You may specify additional identifier in URL after resource name, like "/tika/my-file-i-sent-to-tika-resource" for "/tika" resource. Tikaserver uses this name only for logging, so you may put there file name, UUID or any other identifier (do not forget to url-encode any special characters).

Resources may return following HTTP codes:

  • 200 Ok - request completed sucessfully
  • 204 No content - request completed sucessfully, result is empty
  • 422 Unprocessable Entity - Unsupported mime-type, encrypted document & etc

  • 500 Error - Error while processing document

Metadata Resource

/meta

HTTP PUTs a document to the /meta service and you get back "text/csv" of the metadata.

Some Example calls with cURL:

$ curl -X PUT --data-ascii @zipcode.csv http://localhost:9998/meta --header "Content-Type: text/csv"
$ curl -T price.xls http://localhost:9998/meta

Returns:

"Content-Encoding","ISO-8859-2"
"Content-Type","text/plain"

Get metadata as JSON:

$ curl -T test_recursive_embedded.docx http://localhost:9998/meta --header "Accept: application/json"

Or XMP:

$ curl -T test_recursive_embedded.docx http://localhost:9998/meta --header "Accept: application/rdf+xml"

Get specific metadata key's value as simple text string:

$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: text/plain"

Returns:

application/vnd.openxmlformats-officedocument.wordprocessingml.document

Get specific metadata key's value(s) as CSV:

$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: text/csv"

Or JSON:

$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: application/json"

Or XMP:

$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: application/rdf+xml"

Note: when requesting specific metadata keys value(s) in XMP, make sure to request the XMP name, e.g. "dc:creator" vs. "Author"

Multipart Support

Metadata Resource also accepts the files as multipart/form-data attachments with POST. Posting files as multipart attachments may be beneficial in cases when the files are too big for them to be PUT directly in the request body. Note that Tika JAX-RS server makes the best effort at storing some of the multipart content to the disk while still supporting the streaming:

curl -F upload=@price.xls URL http://localhost:9998/meta/form

Note that the address has an extra "/form" path segment.

Tika Resource

/tika

HTTP PUTs a document to the /tika service and you get back the extracted text. HTTP GET prints a greeting stating the server is up.

Some Example calls with cURL:

Get HELLO message back

$ curl -X GET http://localhost:9998/tika
This is Tika Server. Please PUT

Get the Text of a Document

$ curl -X PUT --data-binary @GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type: application/pdf"
$ curl -T price.xls http://localhost:9998/tika --header "Accept: text/html"
$ curl -T price.xls http://localhost:9998/tika --header "Accept: text/plain"

Multipart Support

Tika Resource also accepts the files as multipart/form-data attachments with POST. Posting files as multipart attachments may be beneficial in cases when the files are too big for them to be PUT directly in the request body. Note that Tika JAX-RS server makes the best effort at storing some of the multipart content to the disk while still supporting the streaming:

curl -F upload=@price.xls URL http://localhost:9998/tika/form

Note that the address has an extra "/form" path segment.

Detector Resource

/detect/stream

HTTP PUTs a document and uses the Default Detector from Tika to identify its MIME/media type. The caveat here is that providing a hint for the filename can increase the quality of detection.

Default return is a string of the Media type name.

Some Example calls with cURL:

PUT an RTF file and get back RTF

$ curl -X PUT --data-binary @TODO.rtf http://localhost:9998/detect/stream

PUT a CSV file without filename hint and get back text/plain

$ curl -X PUT --upload-file foo.csv http://localhost:9998/detect/stream

PUT a CSV file with filename hint and get back text/csv

$ curl -X PUT -H "Content-Disposition: attachment; filename=foo.csv" --upload-file foo.csv http://localhost:9998/detect/stream

Language Resource

/language/stream

HTTP PUTs or POSTs a document to the LanguageIdentifier to identify its language.

Default return is a string of the 2 character identified language.

Some Example calls with cURL:

PUT a TXT file with English This is English! and get back en

$ curl -X PUT --data-binary @foo.txt http://localhost:9998/language/stream
en

PUT a TXT file with French comme çi comme ça and get back fr

curl -X PUT --data-binary @foo.txt http://localhost:9998/language/stream
fr

/language/string

HTTP PUTs or POSTs a text string to the LanguageIdentifier to identify its language.

Default return is a string of the 2 character identified language.

Some Example calls with cURL:

PUT a string with English This is English! and get back en

$ curl -X PUT --data "This is English!" http://localhost:9998/language/string
en

PUT a string with French comme çi comme ça and get back fr

curl -X PUT --data "comme çi comme ça" http://localhost:9998/language/string
fr

Translate Resource

/translate/all/translator/src/dest

HTTP PUTs or POSTs a document to the identified *translator* and translates from *src* language to *dest*

Default return is the translated string if successful, else the original string back.

Note that: * *translator* should be a fully qualified Tika class name (with package) e.g., org.apache.tika.language.translate.Lingo24Translator * *src* should be the 2 character short code for the source language, e.g., 'en' for English * *dest* should be the 2 character short code for the dest language, e.g., 'es' for Spanish.

Some Example calls with cURL:

PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Lingo24

$ curl -X PUT --data-binary @sentences http://localhost:9998/translate/all/org.apache.tika.language.translate.Lingo24Translator/es/en
lack of practice in Spanish

PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Microsoft

$ curl -X PUT --data-binary @sentences http://localhost:9998/translate/all/org.apache.tika.language.translate.MicrosoftTranslator/es/en
I need practice in Spanish

PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Google

$ curl -X PUT --data-binary @sentences http://localhost:9998/translate/all/org.apache.tika.language.translate.GoogleTranslator/es/en
I need practice in Spanish

/translate/all/src/dest

HTTP PUTs or POSTs a document to the identified *translator* and auto-detects the *src* language using LanguageIdentifiers, and then translates *src* to *dest*

Default return is the translated string if successful, else the original string back.

Note that: * *translator* should be a fully qualified Tika class name (with package) e.g., org.apache.tika.language.translate.Lingo24Translator * *dest* should be the 2 character short code for the dest language, e.g., 'es' for Spanish.

PUT a TXT file named sentences2 with French comme çi comme ça and get back the English translation using Google auto-detecting the language

$ curl -X PUT --data-binary @sentences2 http://localhost:9998/translate/all/org.apache.tika.language.translate.GoogleTranslator/en
so so

Recursive Metadata and Content

/rmeta

Returns a JSONified list of Metadata objects for the container document and all embedded documents. The text that is extracted from each document is stored in the metadata object under "X-TIKA:content".

$ curl -T test_recursive_embedded.docx http://localhost:9998/rmeta

Returns:

[
 {"Application-Name":"Microsoft Office Word",
  "Application-Version":"15.0000",
  "X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.microsoft.ooxml.OOXMLParser"],
  "X-TIKA:content":"embed_0 "
  ...
 },
 {"Content-Encoding":"ISO-8859-1",
  "Content-Length":"8",
  "Content-Type":"text/plain; charset=ISO-8859-1"
  "X-TIKA:content":"embed_1b",
  ...
 }
 ...
]

The default format for "X-TIKA:content" is XML. However, you can select "text only" with

/rmeta/text

HTML with

/rmeta/html

and no content (metadata only) with

/rmeta/ignore

Multipart Support

Metadata Resource also accepts the files as multipart/form-data attachments with POST. Posting files as multipart attachments may be beneficial in cases when the files are too big for them to be PUT directly in the request body. Note that Tika JAX-RS server makes the best effort at storing some of the multipart content to the disk while still supporting the streaming:

curl -F upload=@test_recursive_embedded.docx URL http://localhost:9998/rmeta/form

Note that the address has an extra "/form" path segment.

Unpack Resource

/unpack

HTTP PUTs an embedded document type to the /unpack service and you get back a zip or tar of the extracted text for each resource filename in the original PUT embedded document type. You can also use /unpack/all to get back both the text and metadata.

Default return type is ZIP (without internal compression). Use "Accept" header for TAR return type.

Please note the mapping of this resource was changed in Apache Tika 1.6 from /unpacker/id to /unpack/id /all/id & /unpack/all/id (TIKA-1324).

Some Example calls with cURL:

PUT zip file and get back met file zip

$ curl -X PUT --data-binary @foo.zip http://localhost:9998/unpack --header "Content-type: application/zip"

PUT doc file and get back met file tar

$ curl -T Doc1_ole.doc -H "Accept: application/x-tar" http://localhost:9998/unpack > /var/tmp/x.tar

PUT doc file and get back the content and metadata

$ curl -T Doc1_ole.doc http://localhost:9998/unpack/all > /var/tmp/x.zip

Text is stored in __TEXT__ file, metadata cvs in __METADATA__. Use "accept" header if you want TAR output.

Information Services

Available Endpoints

/

Hitting the route of the server in your web browser will give a basic report of all the endpoints defined in the server, what URL they have etc

Defined Mime Types

/mime-types

Mime types, their aliases, their supertype, and the parser. Available as plain text, json or human readable HTML

Available Detectors

/detectors

The top level Detector to be used, and any child detectors within it. Available as plain text, json or human readable HTML

Available Parsers

/parsers

Lists all of the parsers currently available

/parsers/details

List all the available parsers, along with what mimetypes they support

Specifying a URL Instead of Putting Bytes

In Tika 1.10, we removed this capability because it posed a security vulnerability (CVE-2015-3271). Anyone with access to the service had the server's access rights; someone could request local files via file:/// or pages from an intranet that they might not otherwise have access to.

In Tika 1.14, we added the capability back, but the user has to acknowledge the security risk by including two commandline arguments:

$ java -jar tika-server-x.x.jar -enableUnsecureFeatures -enableFileUrl

This allows the user to specify a fileUrl in the header:

curl -i -H "fileUrl:http://tika.apache.org" -H "Accept:text/plain" -X PUT http://localhost:9998/tika

or

curl -i -H "fileUrl:file:///C:/data/my_test_doc.pdf" -H "Accept:text/plain" -X PUT http://localhost:9998/tika

By adding back this capability, we did not remove the security vulnerability. Rather, if a user is confident that only authorized clients are able to submit a request, the user can choose to operate tika-server with this insecure setting. BE CAREFUL!

Also, please be polite. This feature was added as a convenience. Please consider using a robust crawler (instead of our simple TikaInputStream.get(new URL(fileUrl))) that will allow for better configuration of redirects, timeouts, cookies, etc.; and a robust crawler will respect robots.txt!

TikaJAXRS (last edited 2016-09-23 18:45:07 by TimothyAllison)