Differences between revisions 21 and 22
Revision 21 as of 2014-05-04 20:47:43
Size: 5633
Editor: NickBurch
Comment: Give accept examples
Revision 22 as of 2014-06-14 18:43:25
Size: 6442
Comment: - add in docs on DetectorResource
Deletions are marked like this. Additions are marked like this.
Line 77: Line 77:

== Detector Resource ==
{{{
/detect/stream
}}}
HTTP PUTs a document and uses the Default Detector from Tika to identify its MIME/media type.
The caveat here is that providing a hint for the filename can increase the quality of detection.

Default return is a string of the Media type name.

Some Example calls with cURL:

=== PUT an RTF file and get back RTF ===
{{{
$ curl -X PUT -d @TODO.rtf http://localhost:9998/detect/stream
}}}
=== PUT a CSV file without filename hint and get back text/plain ===
{{{
$ curl -X PUT --upload-file foo.csv http://localhost:9998/detect/stream
}}}
=== PUT a CSV file with filename hint and get back text/csv ===
{{{
$ curl -X PUT -H "Content-Disposition: attachment; filename=foo.csv" --upload-file foo.csv http://localhost:9998/detect/stream
}}}

Introduction

This page is documentation on tika's JSR 311 network server, tika-server. The server package uses the Apache CXF framework that provides an implementation of JAX-RS for Java. The Tika server component builds to a standalone package in Tika, tika-server. Releases of Tika version 1.2 and beyond will ship with tika-server enabled, but to get the software before then and to experiment you can follow the below steps:

  1. svn export http://svn.apache.org/repos/asf/tika/trunk/tika-server

  2. mvn install
  3. java -jar target/tika-server-X.Y.jar

You will then see a message such as the following:

$ java -jar target/tika-server-1.2-SNAPSHOT.jar
Apr 4, 2012 7:48:49 AM org.apache.tika.server.TikaServerCli main
INFO: Starting Tikaserver ${project.version}
Apr 4, 2012 7:48:50 AM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
2012-04-04 07:48:50.316:INFO:oejs.Server:jetty-7.x.y-SNAPSHOT
2012-04-04 07:48:50.375:INFO:oejs.AbstractConnector:Started SelectChannelConnector@localhost:9998 STARTING
2012-04-04 07:48:50.399:INFO:oejsh.ContextHandler:started o.e.j.s.h.ContextHandler{,null}
Apr 4, 2012 7:48:50 AM org.apache.tika.server.TikaServerCli main
INFO: Started

Which lets you know that it started correctly. Below is some basic documentation on how to interact with the services using cURL and HTTP.

Services

All services that take files use HTTP "PUT" requests. Original file must be sent in request body without any additional encoding (do not use multipart/form-data or other containers).

Information services (eg defined mimetypes, defined parsers etc) work with HTML "GET" requests.

You may optionally specify content type in "Content-Type" header. If you do not specify mime type, Tika will use its detectors to guess it.

You may specify additional identifier in URL after resource name, like "/tika/my-file-i-sent-to-tika-resource" for "/tika" resource. Tikaserver uses this name only for logging, so you may put there file name, UUID or any other identifier (do not forget to url-encode any special characters).

Resources may return following HTTP codes:

  • 200 Ok - request completed sucessfully
  • 204 No content - request completed sucessfully, result is empty
  • 422 Unprocessable Entity - Unsupported mime-type, encrypted document & etc

  • 500 Error - Error while processing document

Metadata Resource

/meta

HTTP PUTs a document to the /meta service and you get back "text/csv" of the metadata.

Some Example calls with cURL:

$ curl -X PUT -d @zipcode.csv http://localhost:9998/meta --header "Content-Type: text/csv"
$ curl -T price.xls http://localhost:9998/meta

Returns:

"Content-Encoding","ISO-8859-2"
"Content-Type","text/plain"

Tika Resource

/tika

HTTP PUTs a document to the /tika service and you get back the extracted text. HTTP GET prints a greeting stating the server is up.

Some Example calls with cURL:

Get HELLO message back

$ curl -X GET http://localhost:9998/tika
This is Tika Server. Please PUT

Get the Text of a Document

$ curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type: application/pdf"
$ curl -T price.xls http://localhost:9998/tika --header "Accept: text/html"
$ curl -T price.xls http://localhost:9998/tika --header "Accept: text/plain"

Detector Resource

/detect/stream

HTTP PUTs a document and uses the Default Detector from Tika to identify its MIME/media type. The caveat here is that providing a hint for the filename can increase the quality of detection.

Default return is a string of the Media type name.

Some Example calls with cURL:

PUT an RTF file and get back RTF

$ curl -X PUT -d @TODO.rtf http://localhost:9998/detect/stream

PUT a CSV file without filename hint and get back text/plain

$ curl -X PUT --upload-file foo.csv http://localhost:9998/detect/stream

PUT a CSV file with filename hint and get back text/csv

$ curl -X PUT -H "Content-Disposition: attachment; filename=foo.csv" --upload-file foo.csv http://localhost:9998/detect/stream

Unpacker Resource

/unpacker

HTTP PUTs an embedded document type to the /unpacker service and you get back a zip or tar of the extracted text for each resource filename in the original PUT embedded document type.

Default return type is ZIP (without internal compression). Use "Accept" header for TAR return type.

Some Example calls with cURL:

PUT zip file and get back met file zip

$ curl -X PUT -d @foo.zip http://localhost:9998/unpacker --header "Content-type: application/zip"

PUT doc file and get back met file tar

$ curl -T Doc1_ole.doc -H "Accept: application/x-tar" http://localhost:9998/unpacker > /var/tmp/x.tar

"All" resource

Get text, metadata and attachments in one request.

$ curl -T Doc1_ole.doc http://localhost:9998/all > /var/tmp/x.zip

Text is stored in __TEXT__ file, metadata cvs in __METADATA__. Use "accept" header if you want TAR output.

Information Services

Defined Mime Types

/mime-types

Mime types, their aliases, their supertype, and the parser. Available as plain text, json or human readable HTML

Available Detectors

/detectors

The top level Detector to be used, and any child detectors within it. Available as plain text, json or human readable HTML

Available Parsers

TODO

Extracting A Document From A URL

It is possible to use a remote file with TikaJAXRS by downloading it via its URL first then piping it to the appropriate service:

$ curl -s "http://url/to/my.file" | curl -X PUT -T - http://localhost:9998/meta
$ curl -s "http://url/to/my.file" | curl -X PUT -T - http://localhost:9998/tika

The caveat with above is that it fetches the entire file, so large files such as video can take some time to download. Therefore, you may wish to use curl to get preliminary information (content type, name and size) about the file before you proceed:

$ curl -I http://url/to/my.file

If the file should be parsed (E.g. you only want to get information about mp3s, mp4s and PDFs), send it on to TikaJAXRS.

TikaJAXRS (last edited 2014-06-14 18:43:25 by ChrisMattmann)