Prerequisites

DockerHub Repository and Access

To be able to release the Apache Tika Docker image on DockerHub you will need to have access to the apache/tika repository. This is controlled by the ASF Infra team and can be requested through a INFRA JIRA ticket. Make sure to tag the ticket with the Docker label.

tika-docker repo

This repository contains the Dockerfiles used to create the minimal and full images for Apache Tika. Its also containers helper examples and configurations.

General Information

Image Types

There are two image types:

  • Minimal - containing just Apache Tika and it's base dependencies (i.e. Java)
  • Full - containing Apache Tika, it's dependencies, as well as Tesseract and GDAL.

The Dockerfile for each image is in the correspondingly named directory, and are the only assets used to public the images.

Docker Compose Files

There are a number of Docker Compose files to allow users to quickly test certain scenarios:

  • Recognising and Captioning Video and Images with TensorFlow REST (see here)
  • Enriching Academic PDF Parsing with Grobid REST (see here)
  • OCR of PDF or Images with Tesseract including a Custom Configuration (see here)
  • Named Entity Recognition (see here)

These different scenarios use the corresponding configuration in the sample-configs directory.

Neither these Docker Compose YML files or the Sample Configurations are used for publishing Apache Tika's Docker image. They are only used to provide examples for complex configurations.

An example of using these is provided here.

docker-tool.sh

This shell file is a helper script used to simplify the building, testing and publication of the images.

It provides the following options:

  • build - to build a minimal and full image of the passed in version
  • test - to verify the built image can start and the version number be received back
  • publish - to publish the image on DockerHub (only for those who have access to the DockerHub repo)
  • latest - to tag the supplied version built locally as latest on DockerHub.
republish-images.sh

This shell file was used to republish the older images when the Dockerfile was updated. It is redundant now but kept in the repo incase something similar needs done in the future.

Release Process

  1. Update the README.md's  Available Tags section

  2. Update the TAG version in .env to be X.Y.Z.Q+1

  3. Update the version in .travis.yml to be X.Y.Z.Q+1 X.Y.Z
  4. Update CHANGES.md to include this release, changes and release date
  5. Test the release as in the example below
  6. Commit the changes
  7. To release a new version of Apache Tika on DockerHub, you can follow the below steps (replacing 2.5.0 with the version number you wish to publish).  As of 2.5.0, we started having to version our docker images even when based on the same Tika version.  So, Docker tags might be 2.5.0.1 for Tika version 2.5.0.  The first version in the commandlines is the Docker version, and the second version in the build command is the Tika version.
$ git clone https://github.com/apache/tika-docker && cd tika-docker
$ ./docker-tool.sh build 2.5.0.1 2.5.0
$ ./docker-tool.sh test 2.5.0.1

# If you see the test passed, you can then publish it and tag it as latest:
$ ./docker-tool.sh publish 2.5.0.1
$ ./docker-tool.sh latest 2.5.0.1

        6. If everything worked, tag the last commit

    1. git tag -a 2.5.0.1 -m "New release for 2.5.0.1"
    2. git push  --tags
  • No labels