CompLangDetector

Implementing a strategy first proposed by Benedetto, Caglioti, and Loreto,¹ CompLangDetector uses the zlib compression library by way of java.util.zip.Deflater² to provide a simple, elegant means of language detection.

A parallel multilingual corpus, in this case versions of the UN Universal Declaration of Human Rights,³ is used to provide "fingerprints" of various languages. Each "fingerprint" is compressed and the size of the compressed artefact is noted.

A candidate for language detection is then appended to each "fingerprint". The resulting object is compressed and the size of the compressed artefact is noted.

Language detection is then a function of the least difference between corresponding pairs of compressed artefacts comp(fingerprint(language₁..._n) + candidate) and comp(fingerprint(language₁..._n)).

Note: The current implementation only supports detection of English, Dutch, French, German and Spanish. Given that it employs a compression algorithm, CompLangDetector cannot currently reliably detect the language of shorter texts. N-gram-based profiling to increase reliability is planned.

BENEDETTO, Dario; CAGLIOTI, Emanuele; LORETO, Vittorio. Language trees and zipping. Physical Review Letters, 2002, 88. Jg., Nr. 4, S. 048702. (DOI: https://doi.org/10.1103/PhysRevLett.88.048702) ↩
https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/zip/Deflater.html ↩
https://www.ohchr.org/en/human-rights/universal-declaration/universal-declaration-human-rights/about-universal-declaration-human-rights-translation-project ↩

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CompLangDetector

About

Releases

Packages

Languages

License

sean-leichtle/CompLangDetector

Folders and files

Latest commit

History

Repository files navigation

CompLangDetector

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages