Differences between revisions 24 and 25
Revision 24 as of 2014-08-29 07:02:28
Size: 6841
Editor: HaydenYoung
Comment: Link to Github project.
Revision 25 as of 2014-11-18 12:15:14
Size: 6906
Editor: DaveMeikle
Comment: Moved the /all resource to /unpacker/all
Deletions are marked like this. Additions are marked like this.
Line 123: Line 123:
HTTP PUTs an embedded document type to the /unpacker service and you get back a zip or tar of the extracted text for each resource filename in the original PUT embedded document type. HTTP PUTs an embedded document type to the /unpacker service and you get back a zip or tar of the extracted text for each resource filename in the original PUT embedded document type.  You can also use /unpacker/all to get back both the text and metadata.
Line 137: Line 137:
== "All" resource ==
Get text, metadata and attachments in one request.
=== PUT doc file and get back the content and metadata ===
Line 141: Line 139:
$ curl -T Doc1_ole.doc http://localhost:9998/all > /var/tmp/x.zip $ curl -T Doc1_ole.doc http://localhost:9998/unpacker/all > /var/tmp/x.zip

Introduction

This page is documentation on tika's JSR 311 network server, tika-server. The server package uses the Apache CXF framework that provides an implementation of JAX-RS for Java. The Tika server component builds to a standalone package in Tika, tika-server.

Installation

To install:

  1. Download the latest stable source from the Apache Tika download page or retrieve the latest code from Github,

  2. Build source using Maven,
  3. Run the Apache Tika JAXRS server.

wget http://mirror.vorboss.net/apache/tika/tika-x.x-src.zip
unzip tika-x.x-src
cd ./tika-x.x/
mvn install
cd ./tika-server/target/
java -jar tika-server-x.x.jar

Remember to replace x.x with the version you have downloaded.

You will then see a message such as the following:

$ java -jar target/tika-server-1.2-SNAPSHOT.jar
Apr 4, 2012 7:48:49 AM org.apache.tika.server.TikaServerCli main
INFO: Starting Tikaserver ${project.version}
Apr 4, 2012 7:48:50 AM org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
2012-04-04 07:48:50.316:INFO:oejs.Server:jetty-7.x.y-SNAPSHOT
2012-04-04 07:48:50.375:INFO:oejs.AbstractConnector:Started SelectChannelConnector@localhost:9998 STARTING
2012-04-04 07:48:50.399:INFO:oejsh.ContextHandler:started o.e.j.s.h.ContextHandler{,null}
Apr 4, 2012 7:48:50 AM org.apache.tika.server.TikaServerCli main
INFO: Started

Which lets you know that it started correctly.

You can specify additional information to change the host name and port number:

java -jar tika-server-x.x.jar --host=intranet.local --port=12345

Below is some basic documentation on how to interact with the services using cURL and HTTP.

Services

All services that take files use HTTP "PUT" requests. Original file must be sent in request body without any additional encoding (do not use multipart/form-data or other containers).

Information services (eg defined mimetypes, defined parsers etc) work with HTML "GET" requests.

You may optionally specify content type in "Content-Type" header. If you do not specify mime type, Tika will use its detectors to guess it.

You may specify additional identifier in URL after resource name, like "/tika/my-file-i-sent-to-tika-resource" for "/tika" resource. Tikaserver uses this name only for logging, so you may put there file name, UUID or any other identifier (do not forget to url-encode any special characters).

Resources may return following HTTP codes:

  • 200 Ok - request completed sucessfully
  • 204 No content - request completed sucessfully, result is empty
  • 422 Unprocessable Entity - Unsupported mime-type, encrypted document & etc

  • 500 Error - Error while processing document

Metadata Resource

/meta

HTTP PUTs a document to the /meta service and you get back "text/csv" of the metadata.

Some Example calls with cURL:

$ curl -X PUT -d @zipcode.csv http://localhost:9998/meta --header "Content-Type: text/csv"
$ curl -T price.xls http://localhost:9998/meta

Returns:

"Content-Encoding","ISO-8859-2"
"Content-Type","text/plain"

Tika Resource

/tika

HTTP PUTs a document to the /tika service and you get back the extracted text. HTTP GET prints a greeting stating the server is up.

Some Example calls with cURL:

Get HELLO message back

$ curl -X GET http://localhost:9998/tika
This is Tika Server. Please PUT

Get the Text of a Document

$ curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type: application/pdf"
$ curl -T price.xls http://localhost:9998/tika --header "Accept: text/html"
$ curl -T price.xls http://localhost:9998/tika --header "Accept: text/plain"

Detector Resource

/detect/stream

HTTP PUTs a document and uses the Default Detector from Tika to identify its MIME/media type. The caveat here is that providing a hint for the filename can increase the quality of detection.

Default return is a string of the Media type name.

Some Example calls with cURL:

PUT an RTF file and get back RTF

$ curl -X PUT -d @TODO.rtf http://localhost:9998/detect/stream

PUT a CSV file without filename hint and get back text/plain

$ curl -X PUT --upload-file foo.csv http://localhost:9998/detect/stream

PUT a CSV file with filename hint and get back text/csv

$ curl -X PUT -H "Content-Disposition: attachment; filename=foo.csv" --upload-file foo.csv http://localhost:9998/detect/stream

Unpacker Resource

/unpacker

HTTP PUTs an embedded document type to the /unpacker service and you get back a zip or tar of the extracted text for each resource filename in the original PUT embedded document type. You can also use /unpacker/all to get back both the text and metadata.

Default return type is ZIP (without internal compression). Use "Accept" header for TAR return type.

Some Example calls with cURL:

PUT zip file and get back met file zip

$ curl -X PUT -d @foo.zip http://localhost:9998/unpacker --header "Content-type: application/zip"

PUT doc file and get back met file tar

$ curl -T Doc1_ole.doc -H "Accept: application/x-tar" http://localhost:9998/unpacker > /var/tmp/x.tar

PUT doc file and get back the content and metadata

$ curl -T Doc1_ole.doc http://localhost:9998/unpacker/all > /var/tmp/x.zip

Text is stored in __TEXT__ file, metadata cvs in __METADATA__. Use "accept" header if you want TAR output.

Information Services

Defined Mime Types

/mime-types

Mime types, their aliases, their supertype, and the parser. Available as plain text, json or human readable HTML

Available Detectors

/detectors

The top level Detector to be used, and any child detectors within it. Available as plain text, json or human readable HTML

Available Parsers

TODO

Extracting A Document From A URL

It is possible to use a remote file with TikaJAXRS by downloading it via its URL first then piping it to the appropriate service:

$ curl -s "http://url/to/my.file" | curl -X PUT -T - http://localhost:9998/meta
$ curl -s "http://url/to/my.file" | curl -X PUT -T - http://localhost:9998/tika

The caveat with above is that it fetches the entire file, so large files such as video can take some time to download. Therefore, you may wish to use curl to get preliminary information (content type, name and size) about the file before you proceed:

$ curl -I http://url/to/my.file

If the file should be parsed (E.g. you only want to get information about mp3s, mp4s and PDFs), send it on to TikaJAXRS.

TikaJAXRS (last edited 2014-11-18 12:15:14 by DaveMeikle)