PyTorch vs. llm.c cross-checks #454

karpathy · 2024-05-24T01:59:54Z

karpathy
May 24, 2024
Maintainer

llm.c is starting to get to the point where we can start doing nice and serious "production" pretraining runs. That means:

start training from scratch (random initialization)
train on a nice/large diverse dataset, e.g. currently using FineWeb
evaluate on a standard dataset, e.g. currently using HellaSwag
bfloat16
multi-GPU
gradient clipping
gradient accumulation
learning rate scheduling

etc etc. like - actually a nice run. I wanted to start a thread where I will be comparing llm.c and PyTorch training, with a focus on CORRECTNESS (i.e. the two training runs matching), ignore speed considerations in this thread.

So let's begin.

1 GPU training in FP32

llm.c launch the training run:

make train_gpt2cu PRECISION=FP32
./train_gpt2cu \
            -i "dev/data/fineweb10B/fineweb_train_*.bin" \
            -j "dev/data/fineweb10B/fineweb_val_*.bin" \
            -e "d12" \
            -b 16 -t 1024 \
            -d 131072 \
            -l 0.0001 \
            -u 0 \
            -q 1.0 \
            -c 0.0 \
            -x 50 \
            -v 200 -s 200 -g 144

This is (pre)training on FineWeb, from scratch, GPT-2 (124M) (-e "d12"), batch size 16, seqlen 1024, total desired batch size 131072, so gradient accumulation to meet it is 8 (calculated internally), learning rate 1e-4, turning off warmup and learning rate decay (-u 0 -q 1.0), turning off weight decay (-c 0.0), 50 iterations, validate and generate every 200 steps for 144 steps (i.e. turn it off)

Equivalent run in PyTorch:

python train_gpt2.py \
    --input_bin dev/data/fineweb10B/fineweb_train_000001.bin \
    --write_tensors 0 \
    --model d12 \
    --batch_size 16 \
    --sequence_length 1024 \
    --total_batch_size 131072 \
    --dtype bfloat16 \
    --compile 0 \
    --tensorcores 1 \
    --flash 1 \
    --num_iterations 50 \
    --overfit_single_batch 0

In case you don't want to preprocess all of fineweb10B, you can also run tinystories, just pass in "dev/data/tinystories/TinyStories_train.bin" for -i -j -input_bin flags instead. You'll get different numbers of course, but you should be able to see the two match.

The repository master is currently at this commit.

This is on A100 40GB. So here is what we get.
TLDR the two are remarkably same, within ~1e-3 or less of loss and gradient norm difference, after 50 iterations of training:
LEFT: llm.c. RIGHT: PyTorch

This gives nice confidence for correctness and match to PyTorch!

1 GPU training in BF16

To run in bfloat16, make the following adjustments. Compile llm.c with make train_gpt2cu USE_CUDNN=1 instead, and on PyTorch side use --dtype bfloat16 --flash 1, to use the dtype, and to turn on flash attention.

multiGPU

If you have 4 GPUs, you can now achieve the exact same results in llm.c as follows:

make train_gpt2cu PRECISION=FP32
mpirun -np 4 ./train_gpt2cu \
            -i "dev/data/fineweb10B/fineweb_train_*.bin" \
            -j "dev/data/fineweb10B/fineweb_val_*.bin" \
            -e "d12" \
            -b 16 -t 1024 \
            -d 131072 \
            -l 0.0001 \
            -u 0 \
            -q 1.0 \
            -c 0.0 \
            -x 50 \
            -v 200 -s 200 -g 144 \
            -z 1

I've added -z 1 to use ZeRO-1 (optimizer sharding), which saves GPU memory. This is optional and you can take it out. But basically, because the total batch size remains identical, llm.c will simply decrease the number of gradient accumulation steps, and everything else stays the same, it just goes faster.

In PyTorch, we'd now want to launch as follows:

torchrun --standalone --nproc_per_node=4 train_gpt2.py \
    --input_bin dev/data/fineweb10B/fineweb_train_000001.bin \
    --write_tensors 0 \
    --model d12 \
    --batch_size 16 \
    --sequence_length 1024 \
    --total_batch_size 131072 \
    --dtype float32 \
    --compile 0 \
    --tensorcores 1 \
    --flash 0 \
    --num_iterations 50 \
    --overfit_single_batch 0

But unfortunately this does not work and produces different results. I am tracking this down now. But I just wanted to post a log of where things are for now at least. But in any case, encouraging!!

rosslwheeler · 2024-05-24T06:30:55Z

rosslwheeler
May 24, 2024

1 GPU training in FP32 - got very close #'s to the above. Had to reduce the batch size to 8 though because it ran out of memory.

0 replies

karpathy · 2024-06-04T22:19:53Z

karpathy
Jun 4, 2024
Maintainer Author

A lot of changes went into llm.c recently and llm.c diverged slightly from pytorch. However, with the most recent commits we have brought back parity, even for multi-gpu runs. llm.c and pytorch now basically match exactly, up to floating point error. Example is the 124M run (single-gpu version alternative run command is commented out on the first line for both):

llm.c

make train_gpt2cu PRECISION=FP32
# ./train_gpt2cu \
mpirun -np 4 ./train_gpt2cu \
            -i "dev/data/fineweb10B/fineweb_train_*.bin" \
            -j "dev/data/fineweb10B/fineweb_val_*.bin" \
            -v 250 -s 20000 -g 144 \
            -h 1 \
            -b 16 -t 1024 \
            -d 524288 \
            -r 0 \
            -z 0 \
            -c 0.1 \
            -l 0.0006 \
            -q 0.0 \
            -u 700 \
            -e "d12"

PyTorch

# python train_gpt2.py \
torchrun --standalone --nproc_per_node=4 train_gpt2.py \
    --input_bin "dev/data/fineweb10B/fineweb_train_*.bin" \
    --input_val_bin "dev/data/fineweb10B/fineweb_val_*.bin" \
    --val_loss_every 250 \
    --sample_every 0 \
    --write_tensors 0 \
    --model d12 \
    --batch_size 16 \
    --sequence_length 1024 \
    --total_batch_size 524288 \
    --dtype float32 \
    --compile 0 \
    --tensorcores 1 \
    --flash 0 \
    --num_iterations 18865 \
    --weight_decay 0.1 \
    --zero_stage 0 \
    --learning_rate 0.0006 \
    --warmup_iters 700 \
    --learning_rate_decay_frac 0.0 \
    --overfit_single_batch 0

Both right now print something along the lines of:

step    1/18865 | train loss 11.011438 | norm 15.3509 | lr 8.57e-07 | 2128.39 ms | 34.0% bf16 MFU | 246330 tok/s
step    2/18865 | train loss 10.958548 | norm 15.1489 | lr 1.71e-06 | 2129.94 ms | 33.9% bf16 MFU | 246151 tok/s
step    3/18865 | train loss 10.855062 | norm 14.7024 | lr 2.57e-06 | 2129.27 ms | 33.9% bf16 MFU | 246191 tok/s
step    4/18865 | train loss 10.716197 | norm 13.0617 | lr 3.43e-06 | 2132.86 ms | 33.9% bf16 MFU | 246059 tok/s
step    5/18865 | train loss 10.569618 | norm 10.4681 | lr 4.29e-06 | 2137.16 ms | 33.8% bf16 MFU | 245860 tok/s
step    6/18865 | train loss 10.429079 | norm 8.3953 | lr 5.14e-06 | 2136.96 ms | 33.8% bf16 MFU | 245746 tok/s
step    7/18865 | train loss 10.305037 | norm 7.1427 | lr 6.00e-06 | 2140.11 ms | 33.8% bf16 MFU | 245602 tok/s
step    8/18865 | train loss 10.199297 | norm 6.1665 | lr 6.86e-06 | 2137.32 ms | 33.8% bf16 MFU | 245552 tok/s
step    9/18865 | train loss 10.100072 | norm 5.2624 | lr 7.71e-06 | 2137.47 ms | 33.8% bf16 MFU | 245512 tok/s
step   10/18865 | train loss 9.996773 | norm 4.5325 | lr 8.57e-06 | 2142.81 ms | 33.7% bf16 MFU | 245399 tok/s
step   11/18865 | train loss 9.920048 | norm 3.8897 | lr 9.43e-06 | 2148.07 ms | 33.6% bf16 MFU | 245234 tok/s
step   12/18865 | train loss 9.859230 | norm 3.3680 | lr 1.03e-05 | 2151.15 ms | 33.6% bf16 MFU | 245059 tok/s
step   13/18865 | train loss 9.839042 | norm 2.9023 | lr 1.11e-05 | 2150.02 ms | 33.6% bf16 MFU | 244927 tok/s
step   14/18865 | train loss 9.752934 | norm 2.7127 | lr 1.20e-05 | 2153.84 ms | 33.6% bf16 MFU | 244773 tok/s
step   15/18865 | train loss 9.726624 | norm 2.4654 | lr 1.29e-05 | 2151.72 ms | 33.6% bf16 MFU | 244664 tok/s
step   16/18865 | train loss 9.684608 | norm 2.3268 | lr 1.37e-05 | 2153.90 ms | 33.6% bf16 MFU | 244547 tok/s
step   17/18865 | train loss 9.635990 | norm 2.2889 | lr 1.46e-05 | 2157.08 ms | 33.5% bf16 MFU | 244414 tok/s
step   18/18865 | train loss 9.631121 | norm 2.2016 | lr 1.54e-05 | 2157.34 ms | 33.5% bf16 MFU | 244295 tok/s
step   19/18865 | train loss 9.635237 | norm 2.1755 | lr 1.63e-05 | 2160.26 ms | 33.5% bf16 MFU | 244162 tok/s
step   20/18865 | train loss 9.592276 | norm 2.1633 | lr 1.71e-05 | 2159.41 ms | 33.5% bf16 MFU | 244052 tok/s
step   21/18865 | train loss 9.584684 | norm 2.2000 | lr 1.80e-05 | 2157.87 ms | 33.5% bf16 MFU | 243967 tok/s
step   22/18865 | train loss 9.542389 | norm 2.1574 | lr 1.89e-05 | 2162.76 ms | 33.4% bf16 MFU | 243850 tok/s
step   23/18865 | train loss 9.584456 | norm 2.0270 | lr 1.97e-05 | 2160.95 ms | 33.4% bf16 MFU | 243759 tok/s
step   24/18865 | train loss 9.500460 | norm 2.0526 | lr 2.06e-05 | 2165.09 ms | 33.4% bf16 MFU | 243643 tok/s
step   25/18865 | train loss 9.471136 | norm 2.0165 | lr 2.14e-05 | 2167.89 ms | 33.3% bf16 MFU | 243516 tok/s
step   26/18865 | train loss 9.422887 | norm 2.0270 | lr 2.23e-05 | 2167.17 ms | 33.3% bf16 MFU | 243406 tok/s
step   27/18865 | train loss 9.439906 | norm 1.9625 | lr 2.31e-05 | 2169.64 ms | 33.3% bf16 MFU | 243286 tok/s
step   28/18865 | train loss 9.373874 | norm 1.9612 | lr 2.40e-05 | 2167.32 ms | 33.3% bf16 MFU | 243194 tok/s
step   29/18865 | train loss 9.354122 | norm 1.9346 | lr 2.49e-05 | 2168.45 ms | 33.3% bf16 MFU | 243102 tok/s
step   30/18865 | train loss 9.310455 | norm 2.0373 | lr 2.57e-05 | 2168.13 ms | 33.3% bf16 MFU | 243018 tok/s
step   31/18865 | train loss 9.266327 | norm 1.9292 | lr 2.66e-05 | 2170.87 ms | 33.3% bf16 MFU | 242922 tok/s
step   32/18865 | train loss 9.248405 | norm 1.9193 | lr 2.74e-05 | 2177.22 ms | 33.2% bf16 MFU | 242790 tok/s
step   33/18865 | train loss 9.177135 | norm 2.1396 | lr 2.83e-05 | 2170.71 ms | 33.3% bf16 MFU | 242711 tok/s
step   34/18865 | train loss 9.180231 | norm 2.0938 | lr 2.91e-05 | 2177.49 ms | 33.2% bf16 MFU | 242593 tok/s
step   35/18865 | train loss 9.139348 | norm 1.9721 | lr 3.00e-05 | 2175.11 ms | 33.2% bf16 MFU | 242499 tok/s
step   36/18865 | train loss 9.081238 | norm 1.9562 | lr 3.09e-05 | 2181.86 ms | 33.1% bf16 MFU | 242366 tok/s
step   37/18865 | train loss 9.043838 | norm 1.7777 | lr 3.17e-05 | 2173.89 ms | 33.2% bf16 MFU | 242296 tok/s
step   38/18865 | train loss 9.025318 | norm 1.9203 | lr 3.26e-05 | 2180.70 ms | 33.1% bf16 MFU | 242185 tok/s
step   39/18865 | train loss 8.961943 | norm 1.9526 | lr 3.34e-05 | 2176.81 ms | 33.2% bf16 MFU | 242108 tok/s
step   40/18865 | train loss 8.952801 | norm 1.7021 | lr 3.43e-05 | 2181.55 ms | 33.1% bf16 MFU | 242005 tok/s
step   41/18865 | train loss 8.904408 | norm 1.7020 | lr 3.51e-05 | 2175.03 ms | 33.2% bf16 MFU | 241950 tok/s
step   42/18865 | train loss 8.912678 | norm 1.7107 | lr 3.60e-05 | 2182.98 ms | 33.1% bf16 MFU | 241849 tok/s
step   43/18865 | train loss 8.850172 | norm 1.9487 | lr 3.69e-05 | 2185.14 ms | 33.1% bf16 MFU | 241740 tok/s
step   44/18865 | train loss 8.775472 | norm 1.6864 | lr 3.77e-05 | 2178.31 ms | 33.2% bf16 MFU | 241681 tok/s
step   45/18865 | train loss 8.758118 | norm 1.5999 | lr 3.86e-05 | 2183.48 ms | 33.1% bf16 MFU | 241594 tok/s
step   46/18865 | train loss 8.718966 | norm 1.7488 | lr 3.94e-05 | 2184.70 ms | 33.1% bf16 MFU | 241504 tok/s
step   47/18865 | train loss 8.685601 | norm 1.8436 | lr 4.03e-05 | 2188.19 ms | 33.0% bf16 MFU | 241399 tok/s
step   48/18865 | train loss 8.614417 | norm 1.6102 | lr 4.11e-05 | 2183.58 ms | 33.1% bf16 MFU | 241328 tok/s
step   49/18865 | train loss 8.603649 | norm 1.8090 | lr 4.20e-05 | 2173.32 ms | 33.3% bf16 MFU | 241323 tok/s
step   50/18865 | train loss 8.574068 | norm 2.1948 | lr 4.29e-05 | 2181.76 ms | 33.1% bf16 MFU | 241268 tok/s
step   51/18865 | train loss 8.533483 | norm 1.4734 | lr 4.37e-05 | 2180.68 ms | 33.1% bf16 MFU | 241222 tok/s
step   52/18865 | train loss 8.552753 | norm 2.0471 | lr 4.46e-05 | 2175.85 ms | 33.2% bf16 MFU | 241208 tok/s
step   53/18865 | train loss 8.489027 | norm 1.9447 | lr 4.54e-05 | 2185.83 ms | 33.1% bf16 MFU | 241135 tok/s
step   54/18865 | train loss 8.459027 | norm 1.5780 | lr 4.63e-05 | 2175.25 ms | 33.2% bf16 MFU | 241129 tok/s
step   55/18865 | train loss 8.428958 | norm 2.1126 | lr 4.71e-05 | 2175.29 ms | 33.2% bf16 MFU | 241123 tok/s
step   56/18865 | train loss 8.355870 | norm 1.5970 | lr 4.80e-05 | 2181.71 ms | 33.1% bf16 MFU | 241080 tok/s
step   57/18865 | train loss 8.325478 | norm 1.6630 | lr 4.89e-05 | 2182.56 ms | 33.1% bf16 MFU | 241034 tok/s
step   58/18865 | train loss 8.284739 | norm 1.6274 | lr 4.97e-05 | 2188.87 ms | 33.0% bf16 MFU | 240955 tok/s
step   59/18865 | train loss 8.290489 | norm 1.6129 | lr 5.06e-05 | 2183.21 ms | 33.1% bf16 MFU | 240912 tok/s
step   60/18865 | train loss 8.260609 | norm 1.4284 | lr 5.14e-05 | 2182.21 ms | 33.1% bf16 MFU | 240877 tok/s
step   61/18865 | train loss 8.216501 | norm 1.5179 | lr 5.23e-05 | 2185.09 ms | 33.1% bf16 MFU | 240828 tok/s
step   62/18865 | train loss 8.174072 | norm 1.4660 | lr 5.31e-05 | 2183.27 ms | 33.1% bf16 MFU | 240792 tok/s
step   63/18865 | train loss 8.100355 | norm 1.4948 | lr 5.40e-05 | 2184.69 ms | 33.1% bf16 MFU | 240750 tok/s
step   64/18865 | train loss 8.099587 | norm 1.3255 | lr 5.49e-05 | 2185.16 ms | 33.1% bf16 MFU | 240707 tok/s
step   65/18865 | train loss 8.075101 | norm 1.3025 | lr 5.57e-05 | 2183.13 ms | 33.1% bf16 MFU | 240679 tok/s
step   66/18865 | train loss 7.978133 | norm 1.4024 | lr 5.66e-05 | 2183.98 ms | 33.1% bf16 MFU | 240647 tok/s
step   67/18865 | train loss 8.004409 | norm 1.4018 | lr 5.74e-05 | 2184.97 ms | 33.1% bf16 MFU | 240611 tok/s
step   68/18865 | train loss 7.918394 | norm 1.2363 | lr 5.83e-05 | 2183.50 ms | 33.1% bf16 MFU | 240585 tok/s
step   69/18865 | train loss 7.914156 | norm 1.2056 | lr 5.91e-05 | 2183.62 ms | 33.1% bf16 MFU | 240560 tok/s
step   70/18865 | train loss 7.890930 | norm 1.3401 | lr 6.00e-05 | 2189.79 ms | 33.0% bf16 MFU | 240501 tok/s

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch vs. llm.c cross-checks #454

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

PyTorch vs. llm.c cross-checks #454

karpathy May 24, 2024 Maintainer

1 GPU training in FP32

1 GPU training in BF16

multiGPU

Replies: 2 comments

rosslwheeler May 24, 2024

karpathy Jun 4, 2024 Maintainer Author

karpathy
May 24, 2024
Maintainer

rosslwheeler
May 24, 2024

karpathy
Jun 4, 2024
Maintainer Author