Page History

...

When text is extracted for a given document, we run language id automatic language detection (thank you OpenNLP!) on the string and then count the number of common words for that detected language in the extracted text divided . We then the number of "common tokens" by the number of alphabetic words total extracted. This gives us a percentage of common words or the inverse (1 - (commonTokens/alphabeticTokens), the Out of Vocabulary (OOV) statistic. This is some indication of how "languagey" the extracted text is.

...

Page tree

Versions Compared

Old Version 3

New Version Current

Key