This is a refined and re-implemented version of the archived plugin for ElasticSearch elasticsearch-langdetect, which itself builds upon the original work by Nakatani Shuyo, found at https://github.com/shuyo/language-detection. The aforementioned implementation by Nakatani Shuyo serves as the default language detection component within Apache Solr.
The library leverages an n-gram probabilistic model, utilizing n-grams of sizes ranging from 1
to 3
(incl.), alongside a Bayesian classifier (Naive Bayes classification algorithm, see LanguageDetector#detectBlock(String)) that incorporates various normalization techniques and feature sampling methods.
The precision is over 99% for 72 languages. See the following PR description to read about the benchmaks done by @yanirs : jprante/elasticsearch-langdetect#69
The current version of the library introduces several enhancements compared to previous implementations, which may offer improvements in efficiency and performance under specific conditions.
For clarity, I'm linking these enhancements to the original implementation with examples:
-
Eliminating unnecessary ArrayList resizing during n-gram extraction from the input string. In the current implementation, the ArrayList is pre-allocated based on the estimated number of n-grams, thereby reducing the overhead caused by element copying during resizing. See the original code here.
-
Removing per-character normalization at runtime. In the current implementation, instead of normalizing characters during execution, all
65,535
Unicode BMP characters are pre-normalized into a char[] array, making runtime normalization a simple array lookup. See the original code here.
The following is a list of ISO 639-1 languages code supported by the library:
Language | Flag | Country | ISO 639-1 |
---|---|---|---|
Afrikaans | 🇿🇦 | South Africa | af |
Albanian | 🇦🇱 | Albania | sq |
Amharic | 🇪🇹 | Ethiopia | am |
Arabic | 🇦🇪 | UAE | ar |
Armenian | 🇦🇲 | Armenia | hy |
Azerbaijani | 🇦🇿 | Azerbaijan | az |
Bangla | 🇧🇩 | Bangladesh | bn |
Basque | 🇪🇸 | Spain | eu |
Breton | 🇫🇷 | France | br |
Bulgarian | 🇧🇬 | Bulgaria | bg |
Catalan | 🇪🇸 | Spain | ca |
Chinese (China) | 🇨🇳 | China | zh-cn |
Chinese (Taiwan) | 🇹🇼 | Taiwan | zh-tw |
Croatian | 🇭🇷 | Croatia | hr |
Czech | 🇨🇿 | Czech Republic | cs |
Danish | 🇩🇰 | Denmark | da |
Dutch | 🇳🇱 | Netherlands | nl |
English | 🇺🇸 | United States | en |
Estonian | 🇪🇪 | Estonia | et |
Filipino | 🇵🇭 | Philippines | tl |
Finnish | 🇫🇮 | Finland | fi |
French | 🇫🇷 | France | fr |
Georgian | 🇬🇪 | Georgia | ka |
German | 🇩🇪 | Germany | de |
Greek | 🇬🇷 | Greece | el |
Gujarati | 🇮🇳 | India | gu |
Hebrew | 🇮🇱 | Israel | he |
Hindi | 🇮🇳 | India | hi |
Hungarian | 🇭🇺 | Hungary | hu |
Indonesian | 🇮🇩 | Indonesia | id |
Irish | 🇮🇪 | Ireland | ga |
Italian | 🇮🇹 | Italy | it |
Japanese | 🇯🇵 | Japan | ja |
Kannada | 🇮🇳 | India | kn |
Kazakh | 🇰🇿 | Kazakhstan | kk |
Korean | 🇰🇷 | South Korea | ko |
Kyrgyz | 🇰🇬 | Kyrgyzstan | ky |
Latvian | 🇱🇻 | Latvia | lv |
Lithuanian | 🇱🇹 | Lithuania | lt |
Luxembourgish | 🇱🇺 | Luxembourg | lb |
Macedonian | 🇲🇰 | North Macedonia | mk |
Malayalam | 🇮🇳 | India | ml |
Marathi | 🇮🇳 | India | mr |
Mongolian | 🇲🇳 | Mongolia | mn |
Nepali | 🇳🇵 | Nepal | ne |
Norwegian | 🇳🇴 | Norway | no |
Persian | 🇮🇷 | Iran | fa |
Polish | 🇵🇱 | Poland | pl |
Portuguese | 🇵🇹 | Portugal | pt |
Punjabi | 🇮🇳 | India | pa |
Romanian | 🇷🇴 | Romania | ro |
Russian | 🇷🇺 | Russia | ru |
Serbian | 🇷🇸 | Serbia | sr |
Sinhala | 🇱🇰 | Sri Lanka | si |
Slovak | 🇸🇰 | Slovakia | sk |
Slovenian | 🇸🇮 | Slovenia | sl |
Somali | 🇸🇴 | Somalia | so |
Spanish | 🇪🇸 | Spain | es |
Swahili | 🇹🇿 | Tanzania | sw |
Swedish | 🇸🇪 | Sweden | sv |
Tajik | 🇹🇯 | Tajikistan | tg |
Tamil | 🇮🇳 | India | ta |
Telugu | 🇮🇳 | India | te |
Thai | 🇹🇭 | Thailand | th |
Tibetan | 🇨🇳 | China | bo |
Tigrinya | 🇪🇷 | Eritrea | ti |
Turkish | 🇹🇷 | Turkey | tr |
Ukrainian | 🇺🇦 | Ukraine | uk |
Urdu | 🇵🇰 | Pakistan | ur |
Vietnamese | 🇻🇳 | Vietnam | vi |
Welsh | 🇬🇧 | United Kingdom | cy |
Yiddish | 🇮🇱 | Israel | yi |
The following model src/main/resources/model/parameters.json can be configured as ENV vars to modify language detection at runtime.
Use with caution. You don't need to modify the default settings. This list is just for the sake of completeness. For successful modification of the model parameters, you should study the source code (see LanguageDetector#detectBlock(String)) to familiarize yourself with probabilistic matching using Naive Bayes classification algorithm with character n-gram. See also Ted Dunning, Statistical Identification of Language, 1994.
Name | Configured by the ENV variable | Description |
---|---|---|
baseFrequency |
LANGUAGE_DETECT_BASE_FREQUENCY |
Default: 10000 |
iterationLimit |
LANGUAGE_DETECT_ITERATION_LIMIT |
Safeguard to break loop. Default: 10000 |
numberOfTrials |
LANGUAGE_DETECT_NUMBER_OF_TRIALS |
Number of trials (affects CPU usage). Default: 7 |
alpha |
LANGUAGE_DETECT_ALPHA |
Naive Bayes classifier smoothing parameterto prevent zero probabilities and improve the robustness of the classifier. Default: 0.5 |
alphaWidth |
LANGUAGE_DETECT_ALPHA_WIDTH |
The width of smoothing. Default: 0.05 |
convergenceThreshold |
LANGUAGE_DETECT_CONVERGENCE_THRESHOLD |
Detection is terminated when normalized probability exceeds this threshold. Default: 0.99999 |
Furthermore, the library offers a highly accurate CJK language detection mode specifically designed for short strings where there can be a mix of CJK/Latin/Numeric characters.
The library bypasses the performance bottlenecks of traditional machine learning or n-gram based solutions, which are ill-suited for such limited / mixed text. By directly iterating over characters, the library efficiently identifies CJK script usage, enabling rapid and precise language classification. This direct character analysis is significantly faster and simpler for short texts, avoiding the complexities of statistical models.
Search language detection can be used programmatically in your own code
The API is fairly straightforward that allows to configure the language detector via a builder. The public API of the library never returns null
.
The following is a reasonable configuration:
final LanguageDetectionSettings languageDetectionSettings =
LanguageDetectionSettings
.fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn") // or: en, ja, es, fr, de, it, zh-cn
.withClassifyChineseAsJapanese()
.build();
final LanguageDetectionOrchestrator orchestrator = new LanguageDetectionOrchestrator(languageDetectionSettings);
final Language language = orchestrator.detect("languages are awesome");
final String languageCode = language.getIsoCode639_1();
final double probability = language.getProbability();
In some classification tasks, you may already know that your language data is not written in the Latin script, such as with languages that use different alphabets. In these situations, the accuracy of language detection can improve by either excluding unrelated languages from the process or by focusing specifically on the languages that are relevant:
.fromAllIsoCodes639_1()
- Default: N/A
- Description: Enables the library to perform language detection for all the 53 languages by the ISO 639-1 codes
.fromIsoCodes639_1(String)
- Default: N/A
- Description: Enables the library to perform language detection for specific languages by the ISO 639-1 codes
LanguageDetectionSettings
.fromAllIsoCodes639_1()
.build();
LanguageDetectionSettings
.fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn")
.build();
.withMaxTextChars(Integer)
- Default:
3,000
. The default limit is set to3,000
characters (this corresponds to around 2 to 3 page document). For comparison, in Solr, the default maximum text length is set to20,000
characters. - Description: Restricts the maximum number of characters from the input text that will be processed for language detection by the library. This functionality is valuable because the library does not need to analyze the entire document to accurately detect the language; a sufficient portion of the text is often enough to achieve reliable results.
LanguageDetectionSettings
.fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn")
.withMaxTextChars(3000)
.build();
.withoutSanitizeForSearch()
- Default:
true
(perform input sanitization for search). By default, the library sanitizes short input strings for search purposes by removing file extensions from any part of the text and filtering out Solr boolean operators (AND, NOT, and OR), as these elements are irrelevant to language detection. - Description: Invoking the API bypasses this sanitization process for short input strings, allowing the text to be processed without such modifications.
LanguageDetectionSettings
.fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn")
.withoutSanitizeForSearch()
.build();
.withClassifyChineseAsJapanese()
- Default:
false
(does not classify Chinese text as Japanese) - Description: Invoking this API enables the classification of Kanji-only text (text containing only Chinese characters, without any Japanese Hiragana or Katakana characters) or mixed text containing both Latin and Kanji characters as Japanese. This functionality is particularly important when Japanese identification must be prioritized. As such, this config option aims to optimize for more accurate language detection to minimize the misclassification of Japanese text. Additionally, this approach proves useful when identifying the language of very short strings.
LanguageDetectionSettings
.fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn")
.withClassifyChineseAsJapanese()
.build();
.withMininumCertainty(Float)
- Default:
0.1f
. Specifies a certainty threshold value between0...1
. - Description: The library requires that the language identification probability surpass a predefined threshold for any detected language. If the probability falls short of this threshold, the library systematically filters out those languages, excluding them from the results.
Please be aware that the .withMininumCertainty(Float)
method cannot be used in conjunction with the .withTopLanguageMininumCertainty(Float, String)
method (explained in the next section). The setting that is applied last during the configuration process will take priority.
LanguageDetectionSettings
.fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn")
.withMininumCertainty(0.65f)
.build();
.withTopLanguageMininumCertainty(Float, String)
- Default: Not set. Specifies a certainty threshold value between
0...1
and a fallback language ISO 639-1 code. - Description: The language identification probability must exceed the threshold value for the top detected language. If this threshold is not met, the library defaults to the configured ISO 639-1 fallback code, treating it as the top and sole detected language.
Please be aware that the .withTopLanguageMininumCertainty(Float, String)
method cannot be used in conjunction with the .withMinimumCertainty(Float)
method (explained in the previous section). The setting that is applied last during the configuration process will take priority.
LanguageDetectionSettings
.fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn")
.withTopLanguageMininumCertainty(0.65f, "en")
.build();
- The plugin keeps Java 11 source compatibility at the moment
- At least JDK 11
Before your first commit, run this command in the root project directory:
cp pre-commit .git/hooks
If you forget to do this, there is a Gradle task defined in build.gradle that installs the hook for you.
The plugin uses Gradle for as a build system.
For list of all the available Gradle tasks, run the following command:
./gradlew tasks
Building and packaging can be done with the following command:
./gradlew build
The sources will be auto-formatted using Google Java format upon each commit. But, should there ba need to manually format, run the following command:
./gradlew googleJavaFormat
To run unit tests, run the following command:
./gradlew test
The classification accuracy analysis help to improve our understanding of how the library performs on texts of various lengths and types, see src/accuracyTest/java/io/github/azagniotov/language/LanguageDetectorAccuracyTest.java
To run the classification accuracy tests and generate an accuracy report CSV, run the following command:
./gradlew clean accuracyTest
The generated report will be found under build/reports/accuracy/accuracy-report-<UNIX_TIMESTAMP>.csv