Language Identification

4 labeled samples per language + 3 min training = language detector with 99% F1 😎

This is an implementation of automatic language identification based on XLM-RoBERTa [1]. We support the following 20 languages:

Arabic (ar), Bulgarian (bg), German (de), Modern Greek (el), English (en), Spanish (es), French (fr), Hindi (hi), 
Italian (it), Japanese (ja), Dutch (nl), Polish (pl), Portuguese (pt), Russian (ru), Swahili (sw), Thai (th), 
Turkish (tr), Urdu (ur), Vietnamese (vi), and Chinese (zh)

How to use?

Install dependencies

python 3.8
PyTorch 1.11.0
transformers 4.18.1
numpy 1.21.5

Please download our trained model from here and put it under the ./results/

Our method can perform sentence-level language identification. Here we give an example: for the document ./example/example.txt with multiple sentences,

...
綺麗にCDが収納できるからとても良い！
Fonctionne très bien
Love, love, love this! Made cutting my diamonds and triangles a breeze and corners were sharp and precise!
翻译的很差，语句和逻辑不通，耐着性子好几次，实在是读不下去。
...

use the following command:

bash run.sh

The generated file ./example/example_pred.txt will give the predicted language category.

...
Japanese:綺麗にCDが収納できるからとても良い！
French:Fonctionne très bien
English:Love, love, love this! Made cutting my diamonds and triangles a breeze and corners were sharp and precise!
Chinese:翻译的很差，语句和逻辑不通，耐着性子好几次，实在是读不下去。
...

How to train?

Data

We use the language identification dataset listed in the huggingface to train and evaluate our model, which is a collection of 90k samples consisting of text passages and corresponding language label. This dataset was created by collecting data from 3 sources: Multilingual Amazon Reviews Corpus, XNLI, and STSb Multi MT.

'labels': 'fr', 'text': 'Conforme à la description, produit pratique.'
'labels': 'zh', 'text': '有句话说，懂得很多道理，但是仍然过不好这一生。'
'labels': 'en', 'text': 'It was very over priced.'

The statistics are listed in the table below:

#train	#val	#test
3,500 x 20 = 70,000	500 x 20 = 10,000	500 x 20 = 10,000

Model

We provide two methods based on XLM-RoBERTa:

fine-tuning: a simple classifier on top of XLM-RoBERTa.
prompt-tuning: we add such a template

"The language of this sentence is [MASK]"

after the sentence, and predict the language category by generating [MASK] as the corresponding language name.

Training

To train our model, use the command in the root directory

bash train.sh

The experiments can be conducted on one GPU with 24GB of memory.

Experiments

We conduct experiments using K labeled instances per language to train and evaluate the model, respectively. The K-shot data can be automatically generated using the following command:

bash generate_k_shot_data.sh

The experimental results are shown in the table below.

	K=1	K=2	K=4	K=8	Full
fine-tuning	18.6	46.9	98.0	99.3	99.6
prompt-tuning	95.5	98.5	99.4	99.5	99.7

It can be observed that prompt-tuning is more effective in the extremely low resource scenario.

Future work

Though effective, the method can only detect one language in a sentence. In our future work, we would like to address the language identification of codemixed text [2]. As shown in the figure below, we have a rough idea that performs token-level classification.

Due to the lack of codemixed data, we have not yet implemented this model. We find an off-the-shelf tool that helps the automatic generation of grammatically valid synthetic codemixed data. Therefore, we can generate the data we need and hopefully train the model.

References

[1] Conneau A, Khandelwal K, Goyal N, et al. Unsupervised cross-lingual representation learning at scale. In Proceedings of ACL, 2020.

[2] Zhang Y, Riesa J, Gillick D, et al. A fast, compact, accurate model for language identification of codemixed text. In Proceedings of EMNLP, 2018.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Identification

How to use?

Install dependencies

How to train?

Data

Model

Training

Experiments

Future work

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
code		code
data		data
example		example
figure		figure
README.md		README.md
generate_k_shot_data.sh		generate_k_shot_data.sh
run.sh		run.sh
train.sh		train.sh

hanjiale/language-identification-xlm

Folders and files

Latest commit

History

Repository files navigation

Language Identification

How to use?

Install dependencies

How to train?

Data

Model

Training

Experiments

Future work

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages