Replies: 2 comments
-
1 GPU training in FP32 - got very close #'s to the above. Had to reduce the batch size to 8 though because it ran out of memory. |
Beta Was this translation helpful? Give feedback.
-
A lot of changes went into llm.c recently and llm.c make train_gpt2cu PRECISION=FP32
# ./train_gpt2cu \
mpirun -np 4 ./train_gpt2cu \
-i "dev/data/fineweb10B/fineweb_train_*.bin" \
-j "dev/data/fineweb10B/fineweb_val_*.bin" \
-v 250 -s 20000 -g 144 \
-h 1 \
-b 16 -t 1024 \
-d 524288 \
-r 0 \
-z 0 \
-c 0.1 \
-l 0.0006 \
-q 0.0 \
-u 700 \
-e "d12" PyTorch # python train_gpt2.py \
torchrun --standalone --nproc_per_node=4 train_gpt2.py \
--input_bin "dev/data/fineweb10B/fineweb_train_*.bin" \
--input_val_bin "dev/data/fineweb10B/fineweb_val_*.bin" \
--val_loss_every 250 \
--sample_every 0 \
--write_tensors 0 \
--model d12 \
--batch_size 16 \
--sequence_length 1024 \
--total_batch_size 524288 \
--dtype float32 \
--compile 0 \
--tensorcores 1 \
--flash 0 \
--num_iterations 18865 \
--weight_decay 0.1 \
--zero_stage 0 \
--learning_rate 0.0006 \
--warmup_iters 700 \
--learning_rate_decay_frac 0.0 \
--overfit_single_batch 0 Both right now print something along the lines of:
|
Beta Was this translation helpful? Give feedback.
-
llm.c is starting to get to the point where we can start doing nice and serious "production" pretraining runs. That means:
etc etc. like - actually a nice run. I wanted to start a thread where I will be comparing llm.c and PyTorch training, with a focus on CORRECTNESS (i.e. the two training runs matching), ignore speed considerations in this thread.
So let's begin.
1 GPU training in FP32
llm.c launch the training run:
This is (pre)training on FineWeb, from scratch, GPT-2 (124M) (
-e "d12"
), batch size 16, seqlen 1024, total desired batch size 131072, so gradient accumulation to meet it is 8 (calculated internally), learning rate 1e-4, turning off warmup and learning rate decay (-u 0 -q 1.0
), turning off weight decay (-c 0.0
), 50 iterations, validate and generate every 200 steps for 144 steps (i.e. turn it off)Equivalent run in PyTorch:
In case you don't want to preprocess all of fineweb10B, you can also run tinystories, just pass in
"dev/data/tinystories/TinyStories_train.bin"
for-i -j -input_bin
flags instead. You'll get different numbers of course, but you should be able to see the two match.The repository master is currently at this commit.
This is on A100 40GB. So here is what we get.
TLDR the two are remarkably same, within ~1e-3 or less of loss and gradient norm difference, after 50 iterations of training:
LEFT: llm.c. RIGHT: PyTorch
This gives nice confidence for correctness and match to PyTorch!
1 GPU training in BF16
To run in bfloat16, make the following adjustments. Compile llm.c with
make train_gpt2cu USE_CUDNN=1
instead, and on PyTorch side use--dtype bfloat16 --flash 1
, to use the dtype, and to turn on flash attention.multiGPU
If you have 4 GPUs, you can now achieve the exact same results in llm.c as follows:
I've added
-z 1
to use ZeRO-1 (optimizer sharding), which saves GPU memory. This is optional and you can take it out. But basically, because the total batch size remains identical, llm.c will simply decrease the number of gradient accumulation steps, and everything else stays the same, it just goes faster.In PyTorch, we'd now want to launch as follows:
But unfortunately this does not work and produces different results. I am tracking this down now. But I just wanted to post a log of where things are for now at least. But in any case, encouraging!!
Beta Was this translation helpful? Give feedback.
All reactions