Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

When text is extracted for a given document, we run language id automatic language detection (thank you OpenNLP!) on the string and then count the number of common words for that detected language in the extracted text divided . We then the number of "common tokens" by the number of alphabetic words total extracted.  This gives us a percentage of common words or the inverse (1 - (commonTokens/alphabeticTokens), the Out of Vocabulary (OOV) statistic.  This is some indication of how "languagey" the extracted text is.

...