Implementing a strategy first proposed by Benedetto, Caglioti, and Loreto,1 CompLangDetector uses the zlib compression library by way of java.util.zip.Deflater2 to provide a simple, elegant means of language detection.
A parallel multilingual corpus, in this case versions of the UN Universal Declaration of Human Rights,3 is used to provide "fingerprints" of various languages. Each "fingerprint" is compressed and the size of the compressed artefact is noted.
A candidate for language detection is then appended to each "fingerprint". The resulting object is compressed and the size of the compressed artefact is noted.
Language detection is then a function of the least difference between corresponding pairs of compressed artefacts comp(fingerprint(language1...n) + candidate) and comp(fingerprint(language1...n)).
Note: The current implementation only supports detection of English, Dutch, French, German and Spanish. Given that it employs a compression algorithm, CompLangDetector cannot currently reliably detect the language of shorter texts. N-gram-based profiling to increase reliability is planned.
Footnotes
-
BENEDETTO, Dario; CAGLIOTI, Emanuele; LORETO, Vittorio. Language trees and zipping. Physical Review Letters, 2002, 88. Jg., Nr. 4, S. 048702. (DOI: https://doi.org/10.1103/PhysRevLett.88.048702) ↩
-
https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/zip/Deflater.html ↩
-
https://www.ohchr.org/en/human-rights/universal-declaration/universal-declaration-human-rights/about-universal-declaration-human-rights-translation-project ↩