This guide will walk you through the steps to get started with Levanter on TPU VMs.
First you need gcloud installed and configured. You can find instructions for that here
or if you're a conda person you can just run conda install -c conda-forge google-cloud-sdk
.
Second, you need to follow some steps to enable Cloud TPU VM. You can follow Google's directions here but the gist of it is you need to enable the TPU API and the Compute Engine API. You can do this by running:
gcloud components install alpha
gcloud services enable tpu.googleapis.com
gcloud config set account your-email-account
gcloud config set project your-project
You can follow more steps there to get used to things like creating instances and such, but we'll only discuss the most important details here.
You may also need to create an SSH key and add it to your Google Cloud account. TODO
An important thing to know about TPU VMs is that they are not a single machine (for more than a v3-8). Instead, they are a collection of workers that are all connected to the same TPU pod. Each worker manages a set of 8 TPUs. This means that you can't just run a single process on a TPU VM instance, you need to run a distributed process, and you can't just set up one machine, but a whole cluster. We have some scripts to help with this.
You can use infra/spin-up-vm.sh
to create a TPU VM instance. In addition to creating the instance, it will set up
the venv on each worker, and it will clone the repo to ~/levanter/
.
For Public Users:
bash infra/spin-up-tpu-vm.sh <name> -z <zone> -t <type> [--preemptible]
Defaults are:
zone
:us-east1-d
type
:v3-32
preemptible
:false
For Stanford CRFM Users:
Stanford CRFM folks can pass a different setup script to infra/spin-up-vm.sh
to get our NFS automounted:
bash infra/spin-up-vm.sh <name> -z <zone> -t <type> [--preemptible] -s infra/setup-tpu-vm-nfs.sh
In addition to creating the instance, it will also mount the /files/
nfs share to all workers, which has a good
venv and a copy of the repo.
Notes:
- This uploads setup scripts via scp. If the ssh-key that you used for Google Cloud requires passphrase or your ssh key
path is not
~/.ssh/google_compute_engine
, you will need to modify the script. - The command will spam you with a lot of output, sorry.
- If you use a preemptible instance, you probably want to use the "babysitting" script that automatically re-creates the VM. That's explained down below in the "Running Levanter GPT-2" section.
gcloud compute tpus tpu-vm ssh $name --zone us-east1-d --worker=0
gcloud compute tpus tpu-vm ssh $name --zone us-east1-d --worker=all --command="echo hello"
gcloud compute tpus tpu-vm scp my_file $name:path/to/file --zone us-east1-d --worker=all
gcloud compute tpus tpu-vm scp my_file $name:path/to/file --zone us-east1-d --worker=0
gcloud compute tpus tpu-vm scp $name:path/to/file my_file --zone us-east1-d --worker=0
Now that you have a TPU VM instance, you can follow the [Running Levanter] steps, but here are a few shortcuts:
gcloud compute tpus tpu-vm ssh $NAME --zone $ZONE --worker=all --command 'WANDB_API_KEY=... levanter/infra/launch.sh python levanter/examples/gpt2_example.py --config_path levanter/config/gpt2_small.yaml --trainer.checkpointer.base_path gs://<somewhere>'
launch.sh will run the command in the background and redirect stdout and stderr to a log file in the home directory on each worker.
This version writes to the terminal, you should use tmux or something for long running jobs for this version. It's mostly for debugging.
gcloud compute tpus tpu-vm ssh $NAME --zone $ZONE --worker=all --command 'WANDB_API_KEY=... levanter/infra/run.sh python levanter/examples/gpt2_example.py --config_path levanter/config/gpt2_small.yaml --trainer.checkpointer.base_path gs://<somewhere>'
If you are using a preemptible TPU VM, you probably want to use the "babysitting" script that automatically re-creates the VM. This is because preemptible instances can be preempted and will always be killed every 24 hours. The baby-sitting script handles both the creation of the node and the running of a job, and also relaunches the TPU VM if it gets preempted. It keeps running the command (and relaunching) until the command exits successfully.
Running in this mode is a bit more complex because you need to set a unique run id and (ideally unique) run name for your run, which would otherwise be generated for you by WandB.
You can run it like this:
infra/babysit-tpu-vm <name> -z <zone> -t <type> [--preemptible] -s infra/setup-tpu-vm-nfs.sh -- \
WANDB_API_KEY=... levanter/infra/run.sh python levanter/examples/gpt2_example.py --config_path levanter/config/gpt2_small.yaml
That --
is important! It separates the spin up args from the running args. Also, you should never use launch.sh
with babysit
, because nohup exits immediately with exit code 0.
If you want to run your own config, we suggest you start from one of the existing configs. Then, if you're not using an NFS server or similar, you should upload your config to GCS:
gsutil cp my_config.yaml gs://my_bucket//my_config.yaml
Afterwards, you can use the config directly from the TPU VM instance, e.g.:
infra/babysit-tpu-vm <name> -z <zone> -t <type> [--preemptible] -s infra/setup-tpu-vm-nfs.sh -- \
WANDB_API_KEY=... levanter/infra/run.sh python levanter/examples/gpt2_example.py --config_path gs://my_bucket/my_config.yaml \
--trainer.wandb.id rrr --trainer.wandb.name zzz --trainer.checkpointer.base_path gs://path/to/checkpoints/
The --config_path
argument can be a local path, a GCS path, or any URL loadable by fsspec. --trainer.wandb.id
must be unique
to use WandB, and --trainer.wandb.name
is a human-readable name for the run, though it is where checkpoints
will be written (specifically to gs://path/to/checkpoints/${RUN_NAME}
), so you should probably use a unique name.
With this configuration (unless trainer.load_checkpoint
is false), Levanter will automatically
try to load the latest checkpoint if it exists.
Tokenizers are also loaded via fsspec, so you can use the same trick to load them from GCS if you have a custom tokenizer, or you can use an HF tokenizer.
If you get a permission denied error on /files
, you probably need to run sudo chmod -R a+rw /files/whatever
on the
TPU VM instance. This is because the TPU VM instance sets different UID/GID for the user on each and every worker, so
you need to make sure that the permissions are set correctly. These periodically get messed up. A umask would probably
fix this. (TODO!)
Git doesn't like doing operations in a directory that is owned by root or that has too funky of permissions. If you get a git error, you probably need to add a safe directory on your workers:
gcloud compute tpus tpu-vm ssh $NAME --zone $ZONE --worker=all --command 'git config --global --add safe.directory /files/<wherever>'
If you get the warning No GPU/TPU found, falling back to CPU.
then something else might be using the TPU, like a zombie python
process. You can kill all python processes with gcloud compute tpus tpu-vm ssh $NAME --zone $ZONE --worker=all --command 'pkill python'
If that fails, you can try rebooting the TPU VM instance. This is harder than it sounds, because you can't just reboot the instance from the console or with a special command. You have to do it with ssh:
gcloud compute tpus tpu-vm ssh $NAME --zone $ZONE --worker=all --command 'sudo reboot'
and then you have to ctrl-c this after about 10 seconds. Otherwise, gcloud will think the command failed and will try again, and get stuck in a loop forever. (You can ctrl-c it at any point after 10 seconds.)
I (@dlwh) personally like to use pdsh instead of gcloud to run commands on all workers. It doesn't have the reboot
issue, and seems to work better for long lived jobs and such. You can install it with sudo apt-get install pdsh
.
You can then get the ips for your machines like so:
gcloud compute tpus tpu-vm describe --zone us-east1-d $name | awk '/externalIp: (.*)/ {print $2}' > my-hosts
Then you can run a command on all workers like so:
pdsh -R ssh -w ^my-hosts 'echo hello'