Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We gathered the top 30k most common words for ~120 languages from wikipedia or the Leipzig corpus.  These word lists are available in the common_tokens directory.

When text is extracted for a given document, we run language id automatic language detection (thank you OpenNLP!) on the string and then count the number of common words for that detected language in the extracted text divided . We then the number of "common tokens" by the number of alphabetic words total extracted.  This gives us a percentage of common words or the inverse (1 - (commonTokens/alphabeticTokens), the Out of Vocabulary (OOV) statistic.  This is some indication of how "languagey" the extracted text is.

...