What is Apache Tika Pipes?

In Apache 3.x I am working on an experimental Grpc Server to front Apache Tika Pipes. https://github.com/apache/tika/pull/1702

See tika-pipes for some initial information on Apache Tika pipes in general. In summary Apache Tika Pipes is a java program that has multiple workers pulling in "FetchInput" data that tells tika how to get a file, and Tika Pipes will respond with the parsed output FetchOutput.

So you have a pool of Tika Pipes VMs that use parallel processing to process large amounts of data quickly using Apache Tika parsers.

Introducing the Apache Tika Pipes Grpc Server

I have created a gRPC server that fronts Apache Tika Pipes. Here is the protobuf for the Apache Tika Pipes Grpc service: https://github.com/apache/tika/blob/2f251cd3e3994c64af1d4f049e340da5796ec243/tika-pipes/tika-grpc/src/main/proto/tika.proto 

The use-case is that you have worker threads (we will refer to these as the "parse client") that want to parse some documents with multi-node parallel processing.

The parse client worker threads are responsible for finding documents that need to be parsed and then submitting them to the Apache Tika Grpc Pipes Server.

Using the power of gRPC, a bi-directional  connection is made with an input stream of "Fetch Item"s sending "what to parse" and simultaneously emitting an output stream stream of "Parse Output"s that describe "What was parsed.".

Running a simple Tika Pipes Grpc Bi-Directional Streaming Example

Here are the requirements for the example we are going to run:

  • Download test documents and extract on local server
  • Start a local web server that serves the files from that directory.
    • Enable JWT authentication on this server - thus requiring a bearer token to access the web resources
      • The point of this is to prove we can authenticate safely in this process.
  • Start the Apache Tika Grpc Server
    • Configured with tika-config XML custom tailored to our needs.
  • Provide both Java and Go clients that are capable of establishing a Grpc Client to the Apache Tika Grpc Services, stream the list of http links for the documents into the service and obtain the parsed output,
    • Show various configuration the parallel number of worker threads in play
  • The Grpc server will use TLS Mutual authentication 

Java Bi-Directional Streaming Example

A Java Tika Grpc Server with an HTTP fetcher is started, and a Tika Grpc Client opens a bidirectional stream and processes a bunch of files that need parsing.

https://github.com/apache/tika/blob/tika-grpc-3x-features/tika-pipes/tika-grpc/src/test/java/org/apache/tika/pipes/grpc/PipesBiDirectionalStreamingIntegrationTest.java

Go Bi-Directional Streaming Example

TODO

Build Tika Grpc on Docker

You can build this for docker using the following example:

https://github.com/apache/tika/tree/tika-grpc-3x-features/tika-pipes/tika-grpc/example-dockerfile

The dockerfile example only shows non-TLS so far. Will add this soon.


  • No labels