Prepare SentencePiece (T5, Llama2) and Byte level (GPT2, RoBERTa) BPE on Malaysian texts (Jawi, Melayu, Manglish, Mandarin, Tamil).
- https://huggingface.co/datasets/malaysia-ai/dedup-text-dataset
- https://huggingface.co/datasets/mesolitica/translated-code-instructions-122k
- https://huggingface.co/datasets/mesolitica/translated-unnatural_code_instructions_20M
- https://huggingface.co/datasets/mesolitica/translated-python-evol-instruct-51k
- https://huggingface.co/datasets/mesolitica/google-translate-ms-pa
- https://huggingface.co/datasets/mesolitica/google-translate-ms-ta
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('malaysia-ai/sentencepiece-tokenizer')
tokenizer.encode('husein comel')
tokenizer.encode('husein cute')
tokenizer.encode('حسين چوميل')
tokenizer.encode('侯赛因很可爱')
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('malaysia-ai/bpe-tokenizer')
tokenizer.encode('husein comel')
tokenizer.encode('husein cute')
tokenizer.encode('حسين چوميل')
tokenizer.encode('侯赛因很可爱')
tokenizer.encode('ஹுசைன் அழகாக இருக்கிறார்')
- Train SentencePiece,
python3 train-sentencepiece.py
When training SentencePiece,
- Always partitioned long texts.
We use Standard_HB60-15rs to train.
- Train BPE,
python3 train-bpe.py
We use Standard_HB60-15rs to train.