Role Play Datasets in Multilanguages
Roleplaying is very important in the AI era. We have been role-playing as doctors, engineers, and other professionals since we were young. Now, as the technologies get more powerful, roleplaying has become an important part of our lives.
In a Large Language Model, roleplaying can bring empathy, which results in more engagement with the user. Thus, why roleplaying needs to be brought to the world of the Language model.
But for languages around the world, most are low resources and not supported very well by the open-source LLM, it is about fine-tuning the LLM to bring the technology into the local communities.
For roleplaying, datasets are very rare to fine-tune the language model. Thus I, Min Si Thu, create datasets for multiple languages to fine-tune for roleplay.
The base dataset is GPTeacher role play dataset by teknium 1, which can be found under this link, released under MIT License. The dataset is then translated into respective languages. The translation process is powered by Google Translate, using cloud translation API.
To the knowledge of my best, these datasets could be the very first role-play datasets for most of low resource languages listed below.
The following are the available languages dataset hyperlink, which can be found on huggingface collections.
- Burmese (my)
- Lao (lo)
- Khmer (khm)
- Malay (ms)
- Vietnam (vi)
- Thai (th)
- Hindi (hi)
- Indonesian (id)
- Filipino (fil)
- Bengali (bn)
- Afrikaans (af)
- Albanian (sq)
- Amharic (am)
- Georgian (ka)
- Irish (ga)
- Zulu (zu)
- Serbian (sr)
- Kinyarwanda (rw)
- Somali (so)
- Kurdish (ku)
- Huasa (ha)
- Icelandic (is)
- Nepali (ne)
- Panjabi/Punjabi (pa)
- Tamil (ta)
- Yiddish (yi)
- Hebrew (he)
- Azarbaijani (az)
- Kazakh (kk)
- Cebuano (ceb)
- Turkish (tr)
- Finnish (fin)
- Czech (cs)
- Norwegian (no)
- Mongolian (mn)
- Lithuanian (lt)
For more information, contact Min Si Thu.