To be filled in.
For a given language, count the number of "common words" extracted. If the assumption is that your documents generally contain natural language (e.g., not just parts lists or numbers), then calculating the number of common words extracted divided by the number of alphabetic words may offer some insight into how "languagey" the extracted text is.
Tilman Hausherr originally recommended this metric as a comparison metric when comparing the output from different versions of PDFBox. For our initial collaboration with PDFBox, we found a list of common English words and removed those that had fewer than four characters. The intuition is that if tool A extracts 500, but tool B extracts 1,000, there is some information that tool B may have done a better job.
For now, we've set up an Analyzer chain in Lucene that:
- Filters out tokens that don't contain an alphabetic or ideographic character.
Maps urls to "url" and emails to "email" (We don't want to penalize documents with urls and emails).
Requires that a token be at least 4 characters long unless it is comprised entirely of CJK characters.
But wait, what's a word for non-whitespace (e.g. Chinese/Japanese) languages? We've followed common practice for non-whitespace languages of tokenizing bigrams...this is linguistically abhorrent, but it is mildly useful if inaccurate for our purposes.
Benefits: Easy to implement.
If an OCR engine relies solely on dictionary lookup and does not allow for out-of-vocabulary terms, the generated text will contain only known words, and the "common words" score will be incorrectly high. Yes, the text contains known words, but they might not reflect the correct text.
- If a document contains part numbers or other non-natural language tokens, then this metric will not accurately reflect success.
- Multi-lingual documents can cause challenges for interpretation. If the language id component "detects" English, even though the majority of the document is in Chinese, this metric will be misleading.