Introduction
This page is documentation on tika's JSR 311 network server, tika-server. The server package uses the Apache CXF framework that provides an implementation of JAX-RS for Java. The Tika server component builds to a standalone package in Tika, tika-server. Releases of Tika version 1.2 and beyond will ship with tika-server enabled, but to get the software before then and to experiment you can follow the below steps:
svn export http://svn.apache.org/repos/tika/trunk/tika-server
- mvn install
- java -jar target/tika-server-X.Y.jar
You will then see a message such as the following:
$ java -jar target/tika-server-1.2-SNAPSHOT.jar
Apr 4, 2012 7:48:49 AM org.apache.tika.server.TikaServerCli main
INFO: Starting Tikaserver ${project.version}
Apr 4, 2012 7:48:50 AM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
2012-04-04 07:48:50.316:INFO:oejs.Server:jetty-7.x.y-SNAPSHOT
2012-04-04 07:48:50.375:INFO:oejs.AbstractConnector:Started SelectChannelConnector@localhost:9998 STARTING
2012-04-04 07:48:50.399:INFO:oejsh.ContextHandler:started o.e.j.s.h.ContextHandler{,null}
Apr 4, 2012 7:48:50 AM org.apache.tika.server.TikaServerCli main
INFO: StartedWhich lets you know that it started correctly. Below is some basic documentation on how to interact with the services using cURL and HTTP.
Services
All services uses HTTP "PUT" request. Original file must be sent in request body without any additional encoding (do not use multipart/form-data or other containers).
You may optionally specify content type in "Content-Type" header. If you do not specify mime type, Tika will use its detectors to guess it.
You may specify additional identifier in URL after resource name, like "/tika/my-file-i-sent-to-tika-resource" for "/tika" resource. Tikaserver uses this name only for logging, so you may put there file name, UUID or any other identifier (do not forget to url-encode any special characters).
Resources may return following HTTP codes:
- 200 Ok - request completed sucessfully
- 204 No content - request completed sucessfully, result is empty
422 Unprocessable Entity - Unsupported mime-type, encrypted document & etc
- 500 Error - Error while processing document
Metadata Resource
/meta
HTTP PUTs a document to the /meta service and you get back "text/csv" of the metadata.
Some Example calls with cURL:
$ curl -X PUT -d @zipcode.csv http://localhost:9998/meta --header "Content-Type: text/csv" $ curl -T price.xls http://localhost:9998/meta
Returns:
"Content-Encoding","ISO-8859-2" "Content-Type","text/plain"
Tika Resource
/tika
HTTP PUTs a document to the /tika service and you get back the extracted text. HTTP GET prints a greeting stating the server is up.
Some Example calls with cURL:
Get HELLO message back
$ curl -X GET http://localhost:9998/tika This is Tika Server. Please PUT
Get the Text of a Document
$ curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type: application/pdf" $ curl -T price.xls http://localhost:9998/tika
Unpacker Resource
/unpacker
HTTP PUTs an embedded document type to the /unpacker service and you get back a zip or tar of the extracted text for each resource filename in the original PUT embedded document type.
Default return type is ZIP (without internal compression). Use "Accept" header for TAR return type.
Some Example calls with cURL:
PUT zip file and get back met file zip
$ curl -X PUT -d @foo.zip http://localhost:9998/unpacker --header "Content-type: application/zip"
PUT doc file and get back met file tar
$ curl -T Doc1_ole.doc -H "Accept: application/x-tar" http://localhost:9998/unpacker > /var/tmp/x.tar
"All" resource
Get text, metadata and attachements in one request.
$ curl -T Doc1_ole.doc http://localhost:9998/all > /var/tmp/x.zip
Text is stored in __TEXT__ file, metadata cvs in __METADATA__. Use "accept" header if you want TAR output.