Connecting tika-eval to your own db via JDBC

As default, tika-eval uses an in-memory H2 database, specified by -db on the commandline.

However, you can also specify a jdbc connection string with -jdbc.

NOTE: We will try to keep the SQL used in Profile and Compare simple enough to apply generally through jdbc. However, we cannot support all the flavors of SQL in our default Report tool. If you choose to use -jdbc, you are on your own for translating the report SQL to your dialect.

So far, I've only tested the jdbc connection with postgresql, derby in memory and H2 in memory. Please open a JIRA issue if you find problems with Profile or Compare with other flavors of SQL.

Examples

First, make sure to put the appropriate db driver jar on your classpath. In the following, we assume that you have the driver jar and the tika-eval.jar in bin/.

Basic Profile

    java -cp "bin/*" org.apache.tika.eval.TikaEvalCLI Profile -jdbc "jdbc:postgresql:tika_eval?user=user&password=superSecret" -extracts extracts

Basic Profile with Custom Table Prefixes

Now, let's say you want to profile 3 separate runs of OCR with different settings, say quality=1.0, quality=0.5 and quality=0.1. You'll need to prefix your tables with a different prefix for each run.

    java -cp "bin/*" org.apache.tika.eval.TikaEvalCLI Profile -jdbc "jdbc:postgresql:tika_eval?user=user&password=superSecret" -extracts extracts10 -tablePrefix o10
    java -cp "bin/*" org.apache.tika.eval.TikaEvalCLI Profile -jdbc "jdbc:postgresql:tika_eval?user=user&password=superSecret" -extracts extracts05 -tablePrefix o05
    java -cp "bin/*" org.apache.tika.eval.TikaEvalCLI Profile -jdbc "jdbc:postgresql:tika_eval?user=user&password=superSecret" -extracts extracts01 -tablePrefix o01

This will leave you with 24 tables, 7 for each run and then 3 reference tables.

Then, let's say you want to compare the number of "common tokens" in each table for each container file. Unfortunately, the "ids" are unique per run, so you can really only focus on the container files, and you must join them by the file path.

So, step 1 is to build indices on the file path:

    create unique index path01_idx on o01_containers(file_path);
    create unique index path05_idx on o05_containers(file_path);
    create unique index path10_idx on o10_containers(file_path);

Then, join on the file_path to get the paths, and then add in the contents tables for each run:

    select o10c.file_path,
        o10ct.num_common_tokens,
        o05ct.num_common_tokens,
        o01ct.num_common_tokens

    from o10_containers o10c
        left join o05_containers o05c on o05c.file_path=o10c.file_path
	left join o01_containers o01c on o01c.file_path=o10c.file_path

        left join o10_contents o10ct on o10c.container_id=o10ct.id
        left join o05_contents o05ct on o05c.container_id=o05ct.id
        left join o01_contents o01ct on o01c.container_id=o01ct.id
  • No labels