Distributed Training Guide

This guide aims at a comprehensive guide on best practices for distributed training, diagnosing errors, and fully utilize all resources available.

Questions this guide answers:

How do I update a single gpu training/fine tuning script to run on multiple GPUs or multiple nodes?
How do I diagnose hanging/errors that happen during training?
My model/optimizer is too big for a single gpu - how do I train/fine tune it on my cluster?
How do I schedule/launch training on a cluster?
How do I scale my hyperparameters when increasing the number of workers?

Best practices for logging stdout/stderr and wandb are also included, as logging is vitally important in diagnosing/debugging training runs on a cluster.

How to read

This guide is organized into sequential chapters, each with a README.md and a train_llm.py script in them. The readme will discuss the changes introduced in that chapter, and go into more details.

Each of the training scripts is aimed at training a causal language model (i.e. gpt).

Set up

Clone this repo

git clone https://github.com/LambdaLabsML/distributed-training-guide.git

Virtual Environment

cd distributed-training-guide
python3 -m venv venv
source venv/bin/activate
python -m pip install -U pip
pip install -U setuptools wheel
pip install -r requirements.txt

wandb

This tutorial uses wandb as an experiment tracker.

wandb login

🦄 Other exciting ML projects at Lambda: ML Times, Text2Video, GPU Benchmark.

Name		Name	Last commit message	Last commit date
Latest commit History 208 Commits
01-single-gpu		01-single-gpu
02-multi-gpu		02-multi-gpu
03-multi-node		03-multi-node
04-job-launchers		04-job-launchers
05-sharding-deepspeed		05-sharding-deepspeed
05-sharding-fsdp		05-sharding-fsdp
06-training-llama-405b		06-training-llama-405b
diagnosing-errors		diagnosing-errors
related-topics		related-topics
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
top-cluster.py		top-cluster.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Training Guide

Questions this guide answers:

How to read

Set up

Clone this repo

Virtual Environment

wandb

About

Releases

Packages

Languages

License

tpc2233/distributed-training-guide

Folders and files

Latest commit

History

Repository files navigation

Distributed Training Guide

Questions this guide answers:

How to read

Set up

Clone this repo

Virtual Environment

wandb

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages