Design of the Tika Eval Database

Background

The tika-eval module was initially designed to work with the output from Tika's RecursiveParserWrapper (-J -t in tika-app-ese or /rmeta/text from tika-server). The basic idea is that for each input file, there is a list of Metadata objects. The first Metadata object in the list contains the information from the outer/container input file, and if there are attachments or embedded files, those are added to the list. The content that was extracted is stored in the "X-TIKA:content" key in each Metadata object.

In the following discussion, we use the term "container" to refer to the initial input file, and we use "file" to refer either to the container file or to an embedded object. An "extract" is the file that contains the extracted metadata/content.

Because of this one->many mapping, the table structure became fairly complex fairly quickly. The initial flat file didn't work so well.

Tables

Profile

Comparison

Nearly all of the above exist with '_A' or '_B' added to the end of the table names. The ID in PROFILES_A matches the ID in PROFILES_B. If a given container file contained a different number of attachments or if the code couldn't figure out which attachments map to which attachments, there can be incorrect double entries between the two.

In addition to the profiling tables, there is also:

Columns

CONTENTS

CONTENT_COMPARISONS