GitHub

Distributed Pre-training Data Pre-processing 🐢 + 💰 = 🐢💨💨

Purpose

This is an attempt to distrubte heavy workload across multiple Google VMs through Google Instance Group, utilizing 1000+ CPUs.

This is useful for large-scale data processing, e.g. NLP corpus pre-processing.

This repo contains an example for creating Chinese X English ELECTRA training data.

Features

Spin up VMs automatically - use Google Instance Template to spawn worker VMs
No racing issue - fetch/save data to Google Bucket
Easy to update scripts - Update scripts by uploading to Google Bucket without the need to recreate images
Auto deletion of VMs to avoid extra charge - Close the VM after finsing the jobs

Structure

spawn.py - A work generator, responsible for locking and spawning a consumer
convert_to_bytes.py - A work consumer, responsible to consume one task in its entire lifetime
update.sh - To update the libraries, if any (optional)

Steps

Suppose we want to run script in N machines in Google Cloud

Setup a VM with everything installed
- Libraries (e.g. Huggingface tokenizers)
- Scripts to be ran for each VM (e.g. spawn.py, generate.py)
and create an image for this VM
Prepare the data to be processed in Google Bucket
e.g. gs://<bucket-name>/data
Sync your codes to Google Bucket

BUCKET_NAME=<bucket-name>
REPO_NAME=generate_data
CODE_PATH=/path/to/local/codes/$REPO_NAME

gsutil rsync -x '\.git.*' $CODE_PATH gs://$BUCKET_NAME/gits/$REPO_NAME

Create an instance template
- Configure the hardware (e.g. # of cores)
- Configure Service Account for access to Cloud API
  (Use project service account to allow editing in Google Bucket)
- Allow network traffic (if the VM is a web server for receiving requests)
- Attach the following startup script

#! /bin/bash
USER=<username>
BUCKET_NAME=<bucket-name>
REPO_NAME=generate_data

# The command that runs for every worker
COMMAND="bash update.sh && /opt/conda/bin/python spawn.py $BUCKET_NAME"


# Update codes from bucket
ROOT=/home/$USER
REPO=$ROOT/$REPO_NAME
gsutil rsync gs://$BUCKET_NAME/gits/$REPO_NAME $REPO


# *************************** #
# ******* No edit below ***** #
# *************************** #

LOG_PATH=$ROOT/log.txt

export PATH="/usr/local/bin:/usr/bin:/bin"
function fail { echo $1 >&2; exit 1;}
function retry { local n=1;local max=50;local delay=3; while true; do "$@" && break || { if [[ $n -lt $max ]]; then ((n++)); echo "Command failed. Attempt $n/$max:"; sleep $delay; else fail "The command has failed after $n attempts."; fi }; done; }

cd $ROOT
touch $LOG_PATH
chmod 777 $LOG_PATH

COMMAND=${COMMAND//\"/\\\"}
run="echo \"run\" >> $LOG_PATH && \
cd $ROOT &>> $LOG_PATH && \
sleep 2 && \
cd $REPO &>> $LOG_PATH && \
$COMMAND &>> $LOG_PATH; \
export NAME=\$(curl -X GET http://metadata.google.internal/computeMetadata/v1/instance/name -H 'Metadata-Flavor: Google') && \
export ZONE=\$(curl -X GET http://metadata.google.internal/computeMetadata/v1/instance/zone -H 'Metadata-Flavor: Google') && \
gcloud --quiet compute instances delete \\\$NAME --zone=\\\$ZONE"

run=${run//\"/\\\"}

func="echo \"start\" >> $LOG_PATH && function fail { echo \$1 >\&2; exit 1;};function retry { local n=1;local max=100;local delay=2; while true; do \"\$@\" && break || { if [[ \$n -lt \$max ]]; then ((n++)); echo \"Command failed. Attempt \$n/\$max:\"; sleep \$delay; else fail \"The command has failed after \$n attempts.\"; fi }; done; }; echo \"func\" >> $LOG_PATH && retry tmux new -d -s 2 \"$run\";"
echo $func > script.sh

retry tmux new -d -s 1 "bash script.sh 2>> $LOG_PATH" &>> $LOG_PATH

Notes

`git clone / git pull` Issue

I was unable to make VM clone/pull from the repo at startup, but I do not know why ¯\(ツ)/¯ Anyone knows how to make Github work?

python version

Since the startup script is ran as root, it is using a different python to when I ssh into the compute engine, so I simply specify the exact python version

Pre-training details

ELECTRA	Small	Base	Large
bsz	128	256	2048
seq length	128	512	512
steps	1 M	766 K	400 K
# ex.	128 M	196M	919 M
# tokens	16.4 B	100 B	470 M
# tokens (acutal)	3.3 B	33 B	33 M

BERT	Small	Base	Large
# tokens (acutal)	-	3.3 B	3.3 M

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
cantokenizer-vocab.txt		cantokenizer-vocab.txt
cantokenizer.py		cantokenizer.py
convert_to_bytes.py		convert_to_bytes.py
cythonize.py		cythonize.py
data_utils.pyx		data_utils.pyx
spawn.py		spawn.py
update.sh		update.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Pre-training Data Pre-processing 🐢 + 💰 = 🐢💨💨

Purpose

Features

Structure

Steps

Notes

`git clone / git pull` Issue

python version

Pre-training details

About

Releases

Packages

Languages

ecchochan/generate_data

Folders and files

Latest commit

History

Repository files navigation

Distributed Pre-training Data Pre-processing 🐢 + 💰 = 🐢💨💨

Purpose

Features

Structure

Steps

Notes

git clone / git pull Issue

python version

Pre-training details

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`git clone / git pull` Issue

Packages