tika-eval metrics

To be filled in.

Profiling Metrics

Common Words

For a given language, count the number of "common words" extracted. If the assumption is that your documents generally contain natural language (e.g., not just parts lists or numbers), then calculating the number of common words extracted divided by the number of alphabetic words may offer some insight into how "languagey" the extracted text is.

Tilman Hausherr originally recommended this metric as a comparison metric when comparing the output from different versions of PDFBox. For our initial collaboration with PDFBox, we found a list of common English words and removed those that had fewer than four characters. The intuition is that if tool A extracts 500, but tool B extracts 1,000, there is some information that tool B may have done a better job.

Implementation Details

For now, we've set up an Analyzer chain in Lucene that:

But wait, what's a word for non-whitespace (e.g. Chinese/Japanese) languages? We've followed common practice for non-whitespace languages of tokenizing bigrams...this is linguistically abhorrent, but it is mildly useful if inaccurate for our purposes.

Benefits: Easy to implement.

Risks:

Comparison Metrics

TikaEvalMetrics (last edited 2017-02-13 21:10:49 by TimothyAllison)