4 labeled samples per language + 3 min training = language detector with 99% F1 😎
This is an implementation of automatic language identification based on XLM-RoBERTa [1]. We support the following 20 languages:
Arabic (ar), Bulgarian (bg), German (de), Modern Greek (el), English (en), Spanish (es), French (fr), Hindi (hi),
Italian (it), Japanese (ja), Dutch (nl), Polish (pl), Portuguese (pt), Russian (ru), Swahili (sw), Thai (th),
Turkish (tr), Urdu (ur), Vietnamese (vi), and Chinese (zh)
python 3.8
PyTorch 1.11.0
transformers 4.18.1
numpy 1.21.5
Please download our trained model from here and put it under the ./results/
Our method can perform sentence-level language identification. Here we give an example:
for the document ./example/example.txt
with multiple sentences,
...
綺麗にCDが収納できるからとても良い!
Fonctionne très bien
Love, love, love this! Made cutting my diamonds and triangles a breeze and corners were sharp and precise!
翻译的很差,语句和逻辑不通,耐着性子好几次,实在是读不下去。
...
use the following command:
bash run.sh
The generated file ./example/example_pred.txt
will give the predicted language category.
...
Japanese:綺麗にCDが収納できるからとても良い!
French:Fonctionne très bien
English:Love, love, love this! Made cutting my diamonds and triangles a breeze and corners were sharp and precise!
Chinese:翻译的很差,语句和逻辑不通,耐着性子好几次,实在是读不下去。
...
We use the language identification dataset listed in the huggingface to train and evaluate our model, which is a collection of 90k samples consisting of text passages and corresponding language label. This dataset was created by collecting data from 3 sources: Multilingual Amazon Reviews Corpus, XNLI, and STSb Multi MT.
'labels': 'fr', 'text': 'Conforme à la description, produit pratique.'
'labels': 'zh', 'text': '有句话说,懂得很多道理,但是仍然过不好这一生。'
'labels': 'en', 'text': 'It was very over priced.'
The statistics are listed in the table below:
#train | #val | #test |
---|---|---|
3,500 x 20 = 70,000 | 500 x 20 = 10,000 | 500 x 20 = 10,000 |
We provide two methods based on XLM-RoBERTa:
- fine-tuning: a simple classifier on top of XLM-RoBERTa.
- prompt-tuning: we add such a template
"The language of this sentence is [MASK]"
after the sentence, and predict the language category by generating [MASK] as the corresponding language name.
To train our model, use the command in the root directory
bash train.sh
The experiments can be conducted on one GPU with 24GB of memory.
We conduct experiments using K labeled instances per language to train and evaluate the model, respectively. The K-shot data can be automatically generated using the following command:
bash generate_k_shot_data.sh
The experimental results are shown in the table below.
K=1 | K=2 | K=4 | K=8 | Full | |
---|---|---|---|---|---|
fine-tuning | 18.6 | 46.9 | 98.0 | 99.3 | 99.6 |
prompt-tuning | 95.5 | 98.5 | 99.4 | 99.5 | 99.7 |
It can be observed that prompt-tuning is more effective in the extremely low resource scenario.
Though effective, the method can only detect one language in a sentence. In our future work, we would like to address the language identification of codemixed text [2]. As shown in the figure below, we have a rough idea that performs token-level classification.
Due to the lack of codemixed data, we have not yet implemented this model. We find an off-the-shelf tool that helps the automatic generation of grammatically valid synthetic codemixed data. Therefore, we can generate the data we need and hopefully train the model.
[1] Conneau A, Khandelwal K, Goyal N, et al. Unsupervised cross-lingual representation learning at scale. In Proceedings of ACL, 2020.
[2] Zhang Y, Riesa J, Gillick D, et al. A fast, compact, accurate model for language identification of codemixed text. In Proceedings of EMNLP, 2018.