Skip to content

This is a refined and re-implemented version of the archived plugin for ElasticSearch elasticsearch-langdetect, which itself builds upon the original work by Nakatani Shuyo, found at https://github.com/shuyo/language-detection. The aforementioned implementation by Nakatani Shuyo serves as the default language detection component within Apache Solr.

License

Notifications You must be signed in to change notification settings

azagniotov/language-detection

Repository files navigation

Language Detection

Build and Test Maven Central GitHub Packages

This is a refined and re-implemented version of the archived plugin for ElasticSearch elasticsearch-langdetect, which itself builds upon the original work by Nakatani Shuyo, found at https://github.com/shuyo/language-detection. The aforementioned implementation by Nakatani Shuyo serves as the default language detection component within Apache Solr.

Table of Contents

About this library

The library leverages an n-gram probabilistic model, utilizing n-grams of sizes ranging from 1 to 3 (incl.), alongside a Bayesian classifier (Naive Bayes classification algorithm, see LanguageDetector#detectBlock(String)) that incorporates various normalization techniques and feature sampling methods.

The precision is over 99% for 72 languages. See the following PR description to read about the benchmaks done by @yanirs : jprante/elasticsearch-langdetect#69

Enhancements over past implementations

The current version of the library introduces several enhancements compared to previous implementations, which may offer improvements in efficiency and performance under specific conditions.

For clarity, I'm linking these enhancements to the original implementation with examples:

  1. Eliminating unnecessary ArrayList resizing during n-gram extraction from the input string. In the current implementation, the ArrayList is pre-allocated based on the estimated number of n-grams, thereby reducing the overhead caused by element copying during resizing. See the original code here.

  2. Removing per-character normalization at runtime. In the current implementation, instead of normalizing characters during execution, all 65,535 Unicode BMP characters are pre-normalized into a char[] array, making runtime normalization a simple array lookup. See the original code here.

Supported ISO 639-1 codes

The following is a list of ISO 639-1 languages code supported by the library:

Language Flag Country ISO 639-1
Afrikaans   🇿🇦   South Africa af
Albanian   🇦🇱   Albania sq
Amharic   🇪🇹   Ethiopia am
Arabic   🇦🇪   UAE ar
Armenian   🇦🇲   Armenia hy
Azerbaijani   🇦🇿   Azerbaijan az
Bangla   🇧🇩   Bangladesh bn
Basque   🇪🇸   Spain eu
Breton   🇫🇷   France br
Bulgarian   🇧🇬   Bulgaria bg
Catalan   🇪🇸   Spain ca
Chinese (China)   🇨🇳   China zh-cn
Chinese (Taiwan)   🇹🇼   Taiwan zh-tw
Croatian   🇭🇷   Croatia hr
Czech   🇨🇿   Czech Republic cs
Danish   🇩🇰   Denmark da
Dutch   🇳🇱   Netherlands nl
English   🇺🇸   United States en
Estonian   🇪🇪   Estonia et
Filipino   🇵🇭   Philippines tl
Finnish   🇫🇮   Finland fi
French   🇫🇷   France fr
Georgian   🇬🇪   Georgia ka
German   🇩🇪   Germany de
Greek   🇬🇷   Greece el
Gujarati   🇮🇳   India gu
Hebrew   🇮🇱   Israel he
Hindi   🇮🇳   India hi
Hungarian   🇭🇺   Hungary hu
Indonesian   🇮🇩   Indonesia id
Irish   🇮🇪   Ireland ga
Italian   🇮🇹   Italy it
Japanese   🇯🇵   Japan ja
Kannada   🇮🇳   India kn
Kazakh   🇰🇿   Kazakhstan kk
Korean   🇰🇷   South Korea ko
Kyrgyz   🇰🇬   Kyrgyzstan ky
Latvian   🇱🇻   Latvia lv
Lithuanian   🇱🇹   Lithuania lt
Luxembourgish   🇱🇺   Luxembourg lb
Macedonian   🇲🇰   North Macedonia mk
Malayalam   🇮🇳   India ml
Marathi   🇮🇳   India mr
Mongolian   🇲🇳   Mongolia mn
Nepali   🇳🇵   Nepal ne
Norwegian   🇳🇴   Norway no
Persian   🇮🇷   Iran fa
Polish   🇵🇱   Poland pl
Portuguese   🇵🇹   Portugal pt
Punjabi   🇮🇳   India pa
Romanian   🇷🇴   Romania ro
Russian   🇷🇺   Russia ru
Serbian   🇷🇸   Serbia sr
Sinhala   🇱🇰   Sri Lanka si
Slovak   🇸🇰   Slovakia sk
Slovenian   🇸🇮   Slovenia sl
Somali   🇸🇴   Somalia so
Spanish   🇪🇸   Spain es
Swahili   🇹🇿   Tanzania sw
Swedish   🇸🇪   Sweden sv
Tajik   🇹🇯   Tajikistan tg
Tamil   🇮🇳   India ta
Telugu   🇮🇳   India te
Thai   🇹🇭   Thailand th
Tibetan   🇨🇳   China bo
Tigrinya   🇪🇷   Eritrea ti
Turkish   🇹🇷   Turkey tr
Ukrainian   🇺🇦   Ukraine uk
Urdu   🇵🇰   Pakistan ur
Vietnamese   🇻🇳   Vietnam vi
Welsh   🇬🇧   United Kingdom cy
Yiddish   🇮🇱   Israel yi

Model parameters

The following model src/main/resources/model/parameters.json can be configured as ENV vars to modify language detection at runtime.

Use with caution. You don't need to modify the default settings. This list is just for the sake of completeness. For successful modification of the model parameters, you should study the source code (see LanguageDetector#detectBlock(String)) to familiarize yourself with probabilistic matching using Naive Bayes classification algorithm with character n-gram. See also Ted Dunning, Statistical Identification of Language, 1994.

Name Configured by the ENV variable Description
baseFrequency LANGUAGE_DETECT_BASE_FREQUENCY Default: 10000
iterationLimit LANGUAGE_DETECT_ITERATION_LIMIT Safeguard to break loop. Default: 10000
numberOfTrials LANGUAGE_DETECT_NUMBER_OF_TRIALS Number of trials (affects CPU usage). Default: 7
alpha LANGUAGE_DETECT_ALPHA Naive Bayes classifier smoothing parameterto prevent zero probabilities and improve the robustness of the classifier. Default: 0.5
alphaWidth LANGUAGE_DETECT_ALPHA_WIDTH The width of smoothing. Default: 0.05
convergenceThreshold LANGUAGE_DETECT_CONVERGENCE_THRESHOLD Detection is terminated when normalized probability exceeds this threshold. Default: 0.99999

Quick detection of CJK languages

Furthermore, the library offers a highly accurate CJK language detection mode specifically designed for short strings where there can be a mix of CJK/Latin/Numeric characters.

The library bypasses the performance bottlenecks of traditional machine learning or n-gram based solutions, which are ill-suited for such limited / mixed text. By directly iterating over characters, the library efficiently identifies CJK script usage, enabling rapid and precise language classification. This direct character analysis is significantly faster and simpler for short texts, avoiding the complexities of statistical models.

How to use?

Search language detection can be used programmatically in your own code

Basic usage

The API is fairly straightforward that allows to configure the language detector via a builder. The public API of the library never returns null.

The following is a reasonable configuration:

final LanguageDetectionSettings languageDetectionSettings =
  LanguageDetectionSettings
    .fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn") // or: en, ja, es, fr, de, it, zh-cn
    .withClassifyChineseAsJapanese()
    .build();

final LanguageDetectionOrchestrator orchestrator = new LanguageDetectionOrchestrator(languageDetectionSettings);
final Language language = orchestrator.detect("languages are awesome");

final String languageCode = language.getIsoCode639_1();
final double probability = language.getProbability();

Back to top

Methods to build the LanguageDetectionSettings

Configuring ISO 639-1 codes

In some classification tasks, you may already know that your language data is not written in the Latin script, such as with languages that use different alphabets. In these situations, the accuracy of language detection can improve by either excluding unrelated languages from the process or by focusing specifically on the languages that are relevant:

.fromAllIsoCodes639_1()

  • Default: N/A
  • Description: Enables the library to perform language detection for all the 53 languages by the ISO 639-1 codes

.fromIsoCodes639_1(String)

  • Default: N/A
  • Description: Enables the library to perform language detection for specific languages by the ISO 639-1 codes
LanguageDetectionSettings
    .fromAllIsoCodes639_1()
    .build();

LanguageDetectionSettings
    .fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn")
    .build();

Back to top

Maximum text chars

.withMaxTextChars(Integer)

  • Default: 3,000. The default limit is set to 3,000 characters (this corresponds to around 2 to 3 page document). For comparison, in Solr, the default maximum text length is set to 20,000 characters.
  • Description: Restricts the maximum number of characters from the input text that will be processed for language detection by the library. This functionality is valuable because the library does not need to analyze the entire document to accurately detect the language; a sufficient portion of the text is often enough to achieve reliable results.
LanguageDetectionSettings
    .fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn")
    .withMaxTextChars(3000)
    .build();

Back to top

Skipping input sanitization for search

.withoutSanitizeForSearch()

  • Default: true (perform input sanitization for search). By default, the library sanitizes short input strings for search purposes by removing file extensions from any part of the text and filtering out Solr boolean operators (AND, NOT, and OR), as these elements are irrelevant to language detection.
  • Description: Invoking the API bypasses this sanitization process for short input strings, allowing the text to be processed without such modifications.
LanguageDetectionSettings
    .fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn")
    .withoutSanitizeForSearch()
    .build();

Back to top

Classify any Chinese content as Japanese

.withClassifyChineseAsJapanese()

  • Default: false (does not classify Chinese text as Japanese)
  • Description: Invoking this API enables the classification of Kanji-only text (text containing only Chinese characters, without any Japanese Hiragana or Katakana characters) or mixed text containing both Latin and Kanji characters as Japanese. This functionality is particularly important when Japanese identification must be prioritized. As such, this config option aims to optimize for more accurate language detection to minimize the misclassification of Japanese text. Additionally, this approach proves useful when identifying the language of very short strings.
LanguageDetectionSettings
    .fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn")
    .withClassifyChineseAsJapanese()
    .build();

Back to top

General minimum detection certainty

.withMininumCertainty(Float)

  • Default: 0.1f. Specifies a certainty threshold value between 0...1.
  • Description: The library requires that the language identification probability surpass a predefined threshold for any detected language. If the probability falls short of this threshold, the library systematically filters out those languages, excluding them from the results.

Please be aware that the .withMininumCertainty(Float) method cannot be used in conjunction with the .withTopLanguageMininumCertainty(Float, String) method (explained in the next section). The setting that is applied last during the configuration process will take priority.

LanguageDetectionSettings
    .fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn")
    .withMininumCertainty(0.65f)
    .build();

Back to top

Minimum detection certainty for top language with a fallback

.withTopLanguageMininumCertainty(Float, String)

  • Default: Not set. Specifies a certainty threshold value between 0...1 and a fallback language ISO 639-1 code.
  • Description: The language identification probability must exceed the threshold value for the top detected language. If this threshold is not met, the library defaults to the configured ISO 639-1 fallback code, treating it as the top and sole detected language.

Please be aware that the .withTopLanguageMininumCertainty(Float, String) method cannot be used in conjunction with the .withMinimumCertainty(Float) method (explained in the previous section). The setting that is applied last during the configuration process will take priority.

LanguageDetectionSettings
    .fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn")
    .withTopLanguageMininumCertainty(0.65f, "en")
    .build();

Back to top

Local development

System requirements

  • The plugin keeps Java 11 source compatibility at the moment
  • At least JDK 11

Pre-commit Hook

Before your first commit, run this command in the root project directory:

cp pre-commit .git/hooks

If you forget to do this, there is a Gradle task defined in build.gradle that installs the hook for you.

Back to top

Build system

The plugin uses Gradle for as a build system.

List of Gradle tasks

For list of all the available Gradle tasks, run the following command:

./gradlew tasks

Building

Building and packaging can be done with the following command:

./gradlew build

Formatting

The sources will be auto-formatted using Google Java format upon each commit. But, should there ba need to manually format, run the following command:

./gradlew googleJavaFormat

Back to top

Testing

Unit tests

To run unit tests, run the following command:

./gradlew test

Back to top

Classification accuracy analysis

The classification accuracy analysis help to improve our understanding of how the library performs on texts of various lengths and types, see src/accuracyTest/java/io/github/azagniotov/language/LanguageDetectorAccuracyTest.java

To run the classification accuracy tests and generate an accuracy report CSV, run the following command:

./gradlew clean accuracyTest

The generated report will be found under build/reports/accuracy/accuracy-report-<UNIX_TIMESTAMP>.csv

Back to top

About

This is a refined and re-implemented version of the archived plugin for ElasticSearch elasticsearch-langdetect, which itself builds upon the original work by Nakatani Shuyo, found at https://github.com/shuyo/language-detection. The aforementioned implementation by Nakatani Shuyo serves as the default language detection component within Apache Solr.

Resources

License

Stars

Watchers

Forks

Packages