Language Detection

This is a refined and re-implemented version of the archived plugin for ElasticSearch elasticsearch-langdetect, which itself builds upon the original work by Nakatani Shuyo, found at https://github.com/shuyo/language-detection. The aforementioned implementation by Nakatani Shuyo serves as the default language detection component within Apache Solr.

About this library

The library leverages an n-gram probabilistic model, utilizing n-grams of sizes ranging from 1 to 3 (incl.), alongside a Bayesian classifier (Naive Bayes classification algorithm, see LanguageDetector#detectBlock(String)) that incorporates various normalization techniques and feature sampling methods.

The precision is over 99% for 72 languages. See the following PR description to read about the benchmaks done by @yanirs : jprante/elasticsearch-langdetect#69

Enhancements over past implementations

The current version of the library introduces several enhancements compared to previous implementations, which may offer improvements in efficiency and performance under specific conditions.

For clarity, I'm linking these enhancements to the original implementation with examples:

Eliminating unnecessary ArrayList resizing during n-gram extraction from the input string. In the current implementation, the ArrayList is pre-allocated based on the estimated number of n-grams, thereby reducing the overhead caused by element copying during resizing. See the original code here.
Removing per-character normalization at runtime. In the current implementation, instead of normalizing characters during execution, all 65,535 Unicode BMP characters are pre-normalized into a char[] array, making runtime normalization a simple array lookup. See the original code here.

Supported ISO 639-1 codes

The following is a list of ISO 639-1 languages code supported by the library:

Language	Flag	Country	ISO 639-1
Afrikaans	🇿🇦	South Africa	af
Albanian	🇦🇱	Albania	sq
Amharic	🇪🇹	Ethiopia	am
Arabic	🇦🇪	UAE	ar
Armenian	🇦🇲	Armenia	hy
Azerbaijani	🇦🇿	Azerbaijan	az
Bangla	🇧🇩	Bangladesh	bn
Basque	🇪🇸	Spain	eu
Breton	🇫🇷	France	br
Bulgarian	🇧🇬	Bulgaria	bg
Catalan	🇪🇸	Spain	ca
Chinese (China)	🇨🇳	China	zh-cn
Chinese (Taiwan)	🇹🇼	Taiwan	zh-tw
Croatian	🇭🇷	Croatia	hr
Czech	🇨🇿	Czech Republic	cs
Danish	🇩🇰	Denmark	da
Dutch	🇳🇱	Netherlands	nl
English	🇺🇸	United States	en
Estonian	🇪🇪	Estonia	et
Filipino	🇵🇭	Philippines	tl
Finnish	🇫🇮	Finland	fi
French	🇫🇷	France	fr
Georgian	🇬🇪	Georgia	ka
German	🇩🇪	Germany	de
Greek	🇬🇷	Greece	el
Gujarati	🇮🇳	India	gu
Hebrew	🇮🇱	Israel	he
Hindi	🇮🇳	India	hi
Hungarian	🇭🇺	Hungary	hu
Indonesian	🇮🇩	Indonesia	id
Irish	🇮🇪	Ireland	ga
Italian	🇮🇹	Italy	it
Japanese	🇯🇵	Japan	ja
Kannada	🇮🇳	India	kn
Kazakh	🇰🇿	Kazakhstan	kk
Korean	🇰🇷	South Korea	ko
Kyrgyz	🇰🇬	Kyrgyzstan	ky
Latvian	🇱🇻	Latvia	lv
Lithuanian	🇱🇹	Lithuania	lt
Luxembourgish	🇱🇺	Luxembourg	lb
Macedonian	🇲🇰	North Macedonia	mk
Malayalam	🇮🇳	India	ml
Marathi	🇮🇳	India	mr
Mongolian	🇲🇳	Mongolia	mn
Nepali	🇳🇵	Nepal	ne
Norwegian	🇳🇴	Norway	no
Persian	🇮🇷	Iran	fa
Polish	🇵🇱	Poland	pl
Portuguese	🇵🇹	Portugal	pt
Punjabi	🇮🇳	India	pa
Romanian	🇷🇴	Romania	ro
Russian	🇷🇺	Russia	ru
Serbian	🇷🇸	Serbia	sr
Sinhala	🇱🇰	Sri Lanka	si
Slovak	🇸🇰	Slovakia	sk
Slovenian	🇸🇮	Slovenia	sl
Somali	🇸🇴	Somalia	so
Spanish	🇪🇸	Spain	es
Swahili	🇹🇿	Tanzania	sw
Swedish	🇸🇪	Sweden	sv
Tajik	🇹🇯	Tajikistan	tg
Tamil	🇮🇳	India	ta
Telugu	🇮🇳	India	te
Thai	🇹🇭	Thailand	th
Tibetan	🇨🇳	China	bo
Tigrinya	🇪🇷	Eritrea	ti
Turkish	🇹🇷	Turkey	tr
Ukrainian	🇺🇦	Ukraine	uk
Urdu	🇵🇰	Pakistan	ur
Vietnamese	🇻🇳	Vietnam	vi
Welsh	🇬🇧	United Kingdom	cy
Yiddish	🇮🇱	Israel	yi

Model parameters

The following model src/main/resources/model/parameters.json can be configured as ENV vars to modify language detection at runtime.

Use with caution. You don't need to modify the default settings. This list is just for the sake of completeness. For successful modification of the model parameters, you should study the source code (see LanguageDetector#detectBlock(String)) to familiarize yourself with probabilistic matching using Naive Bayes classification algorithm with character n-gram. See also Ted Dunning, Statistical Identification of Language, 1994.

Name	Configured by the ENV variable	Description
`baseFrequency`	`LANGUAGE_DETECT_BASE_FREQUENCY`	Default: `10000`
`iterationLimit`	`LANGUAGE_DETECT_ITERATION_LIMIT`	Safeguard to break loop. Default: `10000`
`numberOfTrials`	`LANGUAGE_DETECT_NUMBER_OF_TRIALS`	Number of trials (affects CPU usage). Default: `7`
`alpha`	`LANGUAGE_DETECT_ALPHA`	Naive Bayes classifier smoothing parameterto prevent zero probabilities and improve the robustness of the classifier. Default: `0.5`
`alphaWidth`	`LANGUAGE_DETECT_ALPHA_WIDTH`	The width of smoothing. Default: `0.05`
`convergenceThreshold`	`LANGUAGE_DETECT_CONVERGENCE_THRESHOLD`	Detection is terminated when normalized probability exceeds this threshold. Default: `0.99999`

Quick detection of CJK languages

Furthermore, the library offers a highly accurate CJK language detection mode specifically designed for short strings where there can be a mix of CJK/Latin/Numeric characters.

The library bypasses the performance bottlenecks of traditional machine learning or n-gram based solutions, which are ill-suited for such limited / mixed text. By directly iterating over characters, the library efficiently identifies CJK script usage, enabling rapid and precise language classification. This direct character analysis is significantly faster and simpler for short texts, avoiding the complexities of statistical models.

How to use?

Search language detection can be used programmatically in your own code

Basic usage

The API is fairly straightforward that allows to configure the language detector via a builder. The public API of the library never returns null.

The following is a reasonable configuration:

final LanguageDetectionSettings languageDetectionSettings =
  LanguageDetectionSettings
    .fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn") // or: en, ja, es, fr, de, it, zh-cn
    .withClassifyChineseAsJapanese()
    .build();

final LanguageDetectionOrchestrator orchestrator = new LanguageDetectionOrchestrator(languageDetectionSettings);
final Language language = orchestrator.detect("languages are awesome");

final String languageCode = language.getIsoCode639_1();
final double probability = language.getProbability();

Back to top

Methods to build the LanguageDetectionSettings

Configuring ISO 639-1 codes

In some classification tasks, you may already know that your language data is not written in the Latin script, such as with languages that use different alphabets. In these situations, the accuracy of language detection can improve by either excluding unrelated languages from the process or by focusing specifically on the languages that are relevant:

.fromAllIsoCodes639_1()

Default: N/A
Description: Enables the library to perform language detection for all the 53 languages by the ISO 639-1 codes

.fromIsoCodes639_1(String)

Default: N/A
Description: Enables the library to perform language detection for specific languages by the ISO 639-1 codes

LanguageDetectionSettings
    .fromAllIsoCodes639_1()
    .build();

LanguageDetectionSettings
    .fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn")
    .build();

Back to top

Maximum text chars

.withMaxTextChars(Integer)

Default: 3,000. The default limit is set to 3,000 characters (this corresponds to around 2 to 3 page document). For comparison, in Solr, the default maximum text length is set to 20,000 characters.
Description: Restricts the maximum number of characters from the input text that will be processed for language detection by the library. This functionality is valuable because the library does not need to analyze the entire document to accurately detect the language; a sufficient portion of the text is often enough to achieve reliable results.

LanguageDetectionSettings
    .fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn")
    .withMaxTextChars(3000)
    .build();

Back to top

Skipping input sanitization for search

.withoutSanitizeForSearch()

Default: true (perform input sanitization for search). By default, the library sanitizes short input strings for search purposes by removing file extensions from any part of the text and filtering out Solr boolean operators (AND, NOT, and OR), as these elements are irrelevant to language detection.
Description: Invoking the API bypasses this sanitization process for short input strings, allowing the text to be processed without such modifications.

LanguageDetectionSettings
    .fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn")
    .withoutSanitizeForSearch()
    .build();

Back to top

Classify any Chinese content as Japanese

.withClassifyChineseAsJapanese()

Default: false (does not classify Chinese text as Japanese)
Description: Invoking this API enables the classification of Kanji-only text (text containing only Chinese characters, without any Japanese Hiragana or Katakana characters) or mixed text containing both Latin and Kanji characters as Japanese. This functionality is particularly important when Japanese identification must be prioritized. As such, this config option aims to optimize for more accurate language detection to minimize the misclassification of Japanese text. Additionally, this approach proves useful when identifying the language of very short strings.

LanguageDetectionSettings
    .fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn")
    .withClassifyChineseAsJapanese()
    .build();

Back to top

General minimum detection certainty

.withMininumCertainty(Float)

Default: 0.1f. Specifies a certainty threshold value between 0...1.
Description: The library requires that the language identification probability surpass a predefined threshold for any detected language. If the probability falls short of this threshold, the library systematically filters out those languages, excluding them from the results.

Please be aware that the .withMininumCertainty(Float) method cannot be used in conjunction with the .withTopLanguageMininumCertainty(Float, String) method (explained in the next section). The setting that is applied last during the configuration process will take priority.

LanguageDetectionSettings
    .fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn")
    .withMininumCertainty(0.65f)
    .build();

Back to top

Minimum detection certainty for top language with a fallback

.withTopLanguageMininumCertainty(Float, String)

Default: Not set. Specifies a certainty threshold value between 0...1 and a fallback language ISO 639-1 code.
Description: The language identification probability must exceed the threshold value for the top detected language. If this threshold is not met, the library defaults to the configured ISO 639-1 fallback code, treating it as the top and sole detected language.

Please be aware that the .withTopLanguageMininumCertainty(Float, String) method cannot be used in conjunction with the .withMinimumCertainty(Float) method (explained in the previous section). The setting that is applied last during the configuration process will take priority.

LanguageDetectionSettings
    .fromIsoCodes639_1("en,ja,es,fr,de,it,zh-cn")
    .withTopLanguageMininumCertainty(0.65f, "en")
    .build();

Back to top

Local development

System requirements

The plugin keeps Java 11 source compatibility at the moment
At least JDK 11

Pre-commit Hook

Before your first commit, run this command in the root project directory:

cp pre-commit .git/hooks

If you forget to do this, there is a Gradle task defined in build.gradle that installs the hook for you.

Back to top

Build system

The plugin uses Gradle for as a build system.

List of Gradle tasks

For list of all the available Gradle tasks, run the following command:

./gradlew tasks

Building

Building and packaging can be done with the following command:

./gradlew build

Formatting

The sources will be auto-formatted using Google Java format upon each commit. But, should there ba need to manually format, run the following command:

./gradlew googleJavaFormat

Back to top

Testing

Unit tests

To run unit tests, run the following command:

./gradlew test

Back to top

Classification accuracy analysis

The classification accuracy analysis help to improve our understanding of how the library performs on texts of various lengths and types, see src/accuracyTest/java/io/github/azagniotov/language/LanguageDetectorAccuracyTest.java

To run the classification accuracy tests and generate an accuracy report CSV, run the following command:

./gradlew clean accuracyTest

The generated report will be found under build/reports/accuracy/accuracy-report-<UNIX_TIMESTAMP>.csv

Back to top

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.github		.github
conf/gradle		conf/gradle
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
latest-version.txt		latest-version.txt
pre-commit		pre-commit
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Detection

Table of Contents

About this library

Enhancements over past implementations

Supported ISO 639-1 codes

Model parameters

Quick detection of CJK languages

How to use?

Basic usage

Methods to build the LanguageDetectionSettings

Configuring ISO 639-1 codes

Maximum text chars

Skipping input sanitization for search

Classify any Chinese content as Japanese

General minimum detection certainty

Minimum detection certainty for top language with a fallback

Local development

System requirements

Pre-commit Hook

Build system

List of Gradle tasks

Building

Formatting

Testing

Unit tests

Classification accuracy analysis

About

Releases 27

Packages

Contributors 2

Languages

License

azagniotov/language-detection

Folders and files

Latest commit

History

Repository files navigation

Language Detection

Table of Contents

About this library

Enhancements over past implementations

Supported ISO 639-1 codes

Model parameters

Quick detection of CJK languages

How to use?

Basic usage

Methods to build the LanguageDetectionSettings

Configuring ISO 639-1 codes

Maximum text chars

Skipping input sanitization for search

Classify any Chinese content as Japanese

General minimum detection certainty

Minimum detection certainty for top language with a fallback

Local development

System requirements

Pre-commit Hook

Build system

List of Gradle tasks

Building

Formatting

Testing

Unit tests

Classification accuracy analysis

About

Resources

License

Stars

Watchers

Forks

Releases 27

Packages 0

Contributors 2

Languages

Packages