Introduction

This page is documentation on tika's JSR 311 network server, tika-server. The server package uses the Apache CXF framework that provides an implementation of JAX-RS for Java. The Tika server component builds to a standalone package in Tika, tika-server. Releases of Tika version 1.2 and beyond will ship with tika-server enabled, but to get the software before then and to experiment you can follow the below steps:

  1. svn export http://svn.apache.org/repos/asf/tika/trunk/tika-server

  2. mvn install
  3. java -jar target/tika-server-X.Y.jar

You will then see a message such as the following:

$ java -jar target/tika-server-1.2-SNAPSHOT.jar
Apr 4, 2012 7:48:49 AM org.apache.tika.server.TikaServerCli main
INFO: Starting Tikaserver ${project.version}
Apr 4, 2012 7:48:50 AM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
2012-04-04 07:48:50.316:INFO:oejs.Server:jetty-7.x.y-SNAPSHOT
2012-04-04 07:48:50.375:INFO:oejs.AbstractConnector:Started SelectChannelConnector@localhost:9998 STARTING
2012-04-04 07:48:50.399:INFO:oejsh.ContextHandler:started o.e.j.s.h.ContextHandler{,null}
Apr 4, 2012 7:48:50 AM org.apache.tika.server.TikaServerCli main
INFO: Started

Which lets you know that it started correctly. Below is some basic documentation on how to interact with the services using cURL and HTTP.

Services

All services uses HTTP "PUT" request. Original file must be sent in request body without any additional encoding (do not use multipart/form-data or other containers).

You may optionally specify content type in "Content-Type" header. If you do not specify mime type, Tika will use its detectors to guess it.

You may specify additional identifier in URL after resource name, like "/tika/my-file-i-sent-to-tika-resource" for "/tika" resource. Tikaserver uses this name only for logging, so you may put there file name, UUID or any other identifier (do not forget to url-encode any special characters).

Resources may return following HTTP codes:

Metadata Resource

/meta

HTTP PUTs a document to the /meta service and you get back "text/csv" of the metadata.

Some Example calls with cURL:

$ curl -X PUT -d @zipcode.csv http://localhost:9998/meta --header "Content-Type: text/csv"
$ curl -T price.xls http://localhost:9998/meta

Returns:

"Content-Encoding","ISO-8859-2"
"Content-Type","text/plain"

Tika Resource

/tika

HTTP PUTs a document to the /tika service and you get back the extracted text. HTTP GET prints a greeting stating the server is up.

Some Example calls with cURL:

Get HELLO message back

$ curl -X GET http://localhost:9998/tika
This is Tika Server. Please PUT

Get the Text of a Document

$ curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type: application/pdf"
$ curl -T price.xls http://localhost:9998/tika

Unpacker Resource

/unpacker

HTTP PUTs an embedded document type to the /unpacker service and you get back a zip or tar of the extracted text for each resource filename in the original PUT embedded document type.

Default return type is ZIP (without internal compression). Use "Accept" header for TAR return type.

Some Example calls with cURL:

PUT zip file and get back met file zip

$ curl -X PUT -d @foo.zip http://localhost:9998/unpacker --header "Content-type: application/zip"

PUT doc file and get back met file tar

$ curl -T Doc1_ole.doc -H "Accept: application/x-tar" http://localhost:9998/unpacker > /var/tmp/x.tar

"All" resource

Get text, metadata and attachments in one request.

$ curl -T Doc1_ole.doc http://localhost:9998/all > /var/tmp/x.zip

Text is stored in __TEXT__ file, metadata cvs in __METADATA__. Use "accept" header if you want TAR output.

Extracting A Document From A URL

It is possible to use a remote file with TikaJAXRS by downloading it via its URL first then piping it to the appropriate service:

$ curl -s "http://url/to/my.file" | curl -X PUT -T - http://localhost:9998/meta
$ curl -s "http://url/to/my.file" | curl -X PUT -T - http://localhost:9998/tika

The caveat with above is that it fetches the entire file, so large files such as video can take some time to download. Therefore, you may wish to use curl to get preliminary information (content type, name and size) about the file before you proceed:

$ curl -I http://url/to/my.file

If the file should be parsed (E.g. you only want to get information about mp3s, mp4s and PDFs), send it on to TikaJAXRS.

TikaJAXRS (last edited 2013-10-31 18:36:11 by HaydenYoung)