Design of the Tika Eval Database

Background

The tika-eval module was initially designed to work with the output from Tika's RecursiveParserWrapper (-J -t in tika-app-ese or /rmeta/text from tika-server). The basic idea is that for each input file, there is a list of Metadata objects. The first Metadata object in the list contains the information from the outer/container input file, and if there are attachments or embedded files, those are added to the list. The content that was extracted is stored in the "X-TIKA:content" key in each Metadata object.

In the following discussion, we use the term "container" to refer to the initial input file, and we use "file" to refer either to the container file or to an embedded object. An "extract" is the file that contains the extracted metadata/content.

Because of this one->many mapping, the table structure became fairly complex fairly quickly. The initial flat file didn't work so well.

Tables

Profile

Comparison

Columns

CONTENTS

CONTENT_COMPARISONS

TikaEvalDbDesign (last edited 2017-03-03 17:32:57 by TimothyAllison)