Overview of the 'tika-eval' Module

This page offers a first draft of the documentation for the tika-eval module, which was recently added to Tika 1.15-SNAPSHOT.

The module is intended to offer insight from the output of a single extraction tool or to enable some comparisons between tools. This module is designed to be used to help with Tika, but it could be used to evaluate other tools as well.

As part of Tika's periodic regression testing, we run this module against ~3 million files. However, it will not scale to 100s of millions of files as it is currently designed. Patches are welcomed!

Background

There are many tools for extracting text from various file formats, and even within a single tool there are usually countless parameters that can be tweaked. The goal of 'tika-eval' is to allow developers to quickly compare the output of:

  1. Two different tools
  2. Two versions of the same tool ("Should we upgrade? Or are there problems with the newer version?")
  3. Two runs with the same tool but with different settings ("Does increasing the DPI for OCR improve extraction? Let's try two runs, one with 200 DPI and one with 300")
  4. Different tools against a gold standard

In addition to this "comparison mode", there is also plenty of information one can get from looking at a profile of a single run.

Some basic metrics for both the "comparison" and "profiling" mode might include:

Quick Start Usage

NOTE: tika-eval will not overwrite the contents of the database you specify in Profile or Compare mode. Add -drop to the commandline to drop tables if you are reusing the database.

The following assumes that you are using the default in-memory H2 database. To connect tika-eval to your own db via jdbc see TikaEvalJdbc.

Single Output from One Tool (Profile)

  1. Create a directory of extract files that mirrors your input directory. These files may be UTF-8 text files with '.txt' appended to the original file's name or they may be the RecursiveParserWrapper's '.json' representation from tika-app's '-J -t' option.

  2. Profile the directory of extracts and create a local H2 database:
    • java -jar tika-eval.X.Y.jar Profile -extractDir json -db profiledb

  3. Write reports from the database:
    • java -jar tika-eval.X.Y.jar Report -db profiledb

You'll have a directory of .xlsx reports under the "reports" directory.

Comparing Output from Two Tools/Settings (Compare)

  1. Create two directories of extract files that mirror your input directory. These files may be UTF-8 text files with '.txt' appended to the original file's name or they may be the RecursiveParserWrapper's '.json' representation from tika-app's '-J -t' option.

  2. Compare the extract directory A with extract directory B and write results to a local H2 database:
    • java -jar tika-eval.X.Y.jar Compare -extractsA tika_1_14 -extractsB tika_1_15 -db comparisondb

  3. Write reports from the database:
    • java -jar tika-eval.X.Y.jar Report -db comparisondb

You'll have a directory of .xlsx reports under the "reports" directory.

Investigating the Database

  1. Fire up the H2 localhost server:
    • java -jar tika-eval.X.Y.jar StartDB -- this calls java -cp ... org.h2.tools.Console -web

  2. Navigate a browser to http://localhost:8082 and enter the jdbc connector code followed by the full path to your db file:

    • jdbc:h2:/C:/users/someone/mystuff/tika-eval/comparisondb

If your reaction is: "You call this a database?!", please open tickets and contribute to improving the structure.

See TikaEvalDbDesign for more information on the underlying structure of the database.

More detailed usage

Evaluating Success via Common Words

In the absence of ground truth, it is often helpful to count the number of common words that were extracted (see TikaEvalMetrics for a discussion of this).

"Common words" are specified per language in the "resources/commonwords" directory. Each file is named for the language code, e.g. 'en', and each file is a UTF-8 text file with one word per line.

The token processor runs language id against content and then selects the appropriate set of common words for its counts. If there is no common words file for a language, then it backs off to the default list, which is currently hardcoded to 'en'.

Make sure that your common words have gone through the same analysis chain as specified by the Common Words analyzer in 'analyzers.json'!

Reading Extracts

alterExtract

Let's say you want to compare the output of Tika to another tool that extracts text. You happen to have a directory of .json files for Tika and a directory of UTF-8 .txt files from the other tool.

  1. If the other tool extracts embedded content, you'd want to concatenate all the content within Tika's .json file for a fair comparison:
    • java -jar tika-eval.X.Y.jar Compare -extractsA tika_1_14 -extractsB tika_1_15 -db comparisondb -alterExtract concatenate_content

  2. If the other tool does not extract embedded content, you'd only want to look at the first metadata object (representing the container file) in the .json file:
    • java -jar tika-eval.X.Y.jar Compare -extractsA tika_1_14 -extractsB tika_1_15 -db comparisondb -alterExtract first_only

Min/Max Extract Size

You may find that some extracts are too big to fit in memory, in which case use -maxExtractSize <maxBytes>, or you may want to focus only on extracts that are greater than a minimum length: -minExtractSize <minBytes>.

Reports

The module tika-eval comes with a list of reports. However, you might want to generate your own. Each report is specified by SQL and a few other configurations in an xml file. See comparison-reports.xml and profile-reports.xml to get a sense of the syntax.

To specify your own reports on the commandline, use -rf (report file):

If you'd like to write the reports to a root directory other than 'reports', specify that with -rd (report directory):

Again, see TikaEvalDbDesign for more information on the underlying structure of the database.

TikaEval (last edited 2017-03-03 17:31:40 by TimothyAllison)