This repository provides a data processing pipeline for large language model training. It consists of four stages: initial data cleaning, near deduplication, exact deduplication, and a second round of data cleaning. The data cleaning part is especially optimized for south-east asian languages (e.g., Thai).
Install the packages and download the models for data cleaning. Here we only download the models for English, Chinese, Thai, Vietnamese, Indonesian, Malay, and Lao. You can add more languages by modifying the --used_language_ids
parameter. The full language list can be found here.
pip install -r requirements.txt
mkdir lm_resource
wget https://huggingface.co/datasets/sail/sailcraft_lm_resource/resolve/main/lid.176.bin -P lm_resource
python code/data_cleaning/download_sentencepiece_kenlm_models.py --used_language_ids en zh th vi id ms lo --output_dir_path lm_resource
Install Rust for exact deduplication, refer to this guidance for more details.
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
. "$HOME/.cargo/env"
We sample 1,000 lines from the cc100 Indonesian subset for a preliminary analysis.
Execute the script by running:
bash run_example.sh
Upon successful execution, you should observe the following logs indicating the processing stages:
Counting lines in cleaned data output: 987
Counting lines in near deduplication output: 974
Counting lines in exact deduplication output: 963
Counting lines in final output: 949
This output confirms the sequential filtering and deduplication stages of the dataset.
The final output can be accessed at data/data_output/final_output/sample/data_clean.jsonl
.
To integrate your own dataset into the project, follow these steps:
- Prepare Your Dataset: Place your dataset file, named
ALIAS.jsonl
, in the./data/data_input/
directory. - Configure Script Variables: Adjust the
ALIAS
andLANGUAGE
variables in the./run_example.sh
script to correspond with your dataset details.
Ensure proper configuration of the processes by setting the following parameters:
- Data Cleaning: Set the parameters for each filter. Detailed configuration can be found here.
- Near Deduplication: Specify the number of permutations to use in MinHash by referring to the example here.
- Exact Deduplication: Define the identified substrings of the given length as shown in the example here.
- For data cleaning, check the
code/data_cleaning/filtering_logs
for each filter. - Run
code/exact_dedup/scripts/count_topk_occurrences.py
to obtain the top-k occurrences.
python code/exact_dedup/scripts/count_topk_occurrences.py \
--data_alias sample \
--split train \
--top_k_number 100 \
--threshold 2 \
--cache_dir cache/exact_dedup_cache
This script displays the top 100 most frequent text spans that occur more than twice in the dataset.
Count | Span |
---|---|
4 | 'pernah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan' |
4 | 'k pernah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) da' |
4 | 'nah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan tid' |
4 | 'sentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan tidak pul' |
4 | 'uh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan tidak pula ol' |
4 | 'ah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan tida' |
4 | 'ak pernah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) d' |
4 | 'ernah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan t' |
3 | 'manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka) dan tidak pula oleh jin.' |
3 | 'tidak pernah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami mereka' |
3 | 'tidak pernah disentuh oleh manusia sebelum mereka (penghuni-penghuni surga yang menjadi suami merek' |
Thanks to the contributors of the following projects:
If you use this repository or sailor models, please cite
@inproceedings{dou-etal-2024-sailor,
title = "Sailor: Open Language Models for South-{E}ast {A}sia",
author = "Dou, Longxu and Liu, Qian and Zeng, Guangtao and Guo, Jia and Zhou, Jiahui and Mao, Xin and Jin, Ziqi and Lu, Wei and Lin, Min",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
year = "2024",
}
If you have any questions, please raise an issue on our GitHub repository or contact [email protected].