A speedrun on consumer grade cards? #29

fzyzcjy · 2024-11-22T12:13:32Z

Hi thanks for the great repo! I would appreciate it if there can be a speed run on consumer cards e.g. RTX4090. Since it is 125M params, the RTX4090's 24GB memory should fit in the classical way, and thus it is trainable.

KellerJordan · 2024-11-25T01:11:06Z

a suggestion, to reduce memory you could run with a lower sequence length

fzyzcjy · 2024-11-25T01:54:07Z

I think so, thanks :) Just wondering whether there will be a speedrun like the current great one but focus on RTX4090's time, because many more people have consumer grade cards than H100s.

naoro · 2024-11-25T05:24:14Z

I think a Google colab speed runs would be also awesome -
That would greatly commodotize research and experimentation

fzyzcjy · 2024-11-25T05:26:00Z

That looks interesting! (But I guess it may be too hard to have it run in acceptable times...)

alexjc · 2024-11-25T10:47:58Z

Realistically single-card speed-run would need a smaller model too, otherwise it's too slow to experiment.

Thinking:

n_layer = 8
n_embd = 512

The sequence length during training has been a variable factor in the last speed run; for evaluation it's fine if it's the whole document clamped at maximum 1024 window size?

fzyzcjy · 2024-11-25T11:07:30Z

It seems H100 has 2000 TFLOPS for bf16 tensor core, while 4090 is about 330 TFLOPS. Thus 8xH100 5minute = 1x4090 4hour, which looks not bad!

The major problem looks like the memory is only 24GB... So we may not be able to do some optimizations.

alexjc · 2024-11-25T11:11:17Z

4h feels quite high for a speed run. Too hard to test ideas, no?

Working on some memory optimizations now, should help a lot...

fzyzcjy · 2024-11-25T11:19:21Z

Surely faster would be great! But if impossible, then 4h is better than nothing :(

naoro · 2024-11-25T11:33:22Z

It seems H100 has 2000 TFLOPS for bf16 tensor core, while 4090 is about 330 TFLOPS. Thus 8xH100 5minute = 1x4090 4hour, which looks not bad!

The major problem looks like the memory is only 24GB... So we may not be able to do some optimizations.

A100 has got 40GB, cost is about 10usd for ~12 hours with about the same tflops as 4090.
So about 3.3usd per run. Not bad, I'd think.

fzyzcjy · 2024-11-25T11:48:18Z

If we optimize for cost, then 4090 is much cheaper than A100 per hour while having same tflops. Thus as long as we manage to fit in 24GB, then maybe we can further scale down the cost.

KellerJordan · 2024-11-25T22:40:30Z

A note: The current cost per run on an 8xH100 is about $1.90 (since it's about $3/hr for SXM H100s)

Personally, when I don't feel like spending that much, I go back to speedrunning CIFAR-10. But I understand that might not be so interesting to everyone

fzyzcjy · 2024-11-26T00:01:49Z

Looks like 4090 is about $0.3/hr, thus 4hr = $1.2, which looks a bit cheaper. And moreover, some people have 4090s bought in their house (e.g. many people in r/LocalLlama, me, etc), while it seems less people buy A100/H100 in their house, and the bought cost is much cheaper than a cloud.

lapp0 · 2024-11-26T00:31:37Z

I'm also interested in this variant.

Considering the long runtime, perhaps it makes sense to compete to minimize validation loss within a 1 hour run?

KellerJordan · 2024-11-26T03:23:10Z

I would guess that halving the sequence length (and going to batch size 16) will allow fitting the run into 24G memory, without impacting performance very much. or quartering it, if that doesn't still fit

lapp0 · 2024-11-26T08:15:23Z

I achieved < 3.28 in a little under two hours with a few tweaks.

@KellerJordan are you interested in hosting a 1x4090 variant of the competition in this repo? If so, I'll submit a PR for 4090/train_gpt2_4090.py and 4090/run_4090.sh and update the readme.

fzyzcjy · 2024-11-26T08:23:26Z

@lapp0 That looks great - $0.3/hr x 2hr = $0.6, which is 3x cheaper than $1.9 (8xH100), and looking forward to your code!

KellerJordan · 2024-11-28T02:25:32Z

I'd prefer to allow experimentation on 4090s, but still time the final speedrun on 8xH100, like it is now. I'm happy to help with timing for 4090 runs that look promising. That way, the benchmark doesn't encourage techniques which are specific to 1x4090 (e.g., using a much smaller batch size).

lapp0 · 2024-11-29T03:23:59Z

Ah, without smaller batch size I get 2 hours 10 minutes.

https://gist.github.com/lapp0/2740a03a637ec926cf0eea90e541a0a6

The only changes necessary for 130 minute run which is effectively identical to the 8xH100 are

batch_size: int = 16
sequence_length: int = 32 * 1024
y = flex_attention(..., kernel_options={"BLOCK_M": 64, "BLOCK_N": 64, "BLOCK_M1": 32, "BLOCK_N1": 64, "BLOCK_M2": 64, "BLOCK_N2": 32}
update run.sh with --nproc_per_node=1

You can improve runtime to 90 minutes while deviating from 8xH100 learning dynamics by setting batch_size to 1, and tweaking the learning rates.

I'll submit a PR for this setting as an experimentation tool, however I'm really interested in a 4090 competition variant. Please let me know if someone is interested in hosting, otherwise I might create a fork myself :)

fzyzcjy · 2024-11-29T03:47:24Z

however I'm really interested in a 4090 competition variant. Please let me know if someone is interested in hosting, otherwise I might create a fork myself :)

I am also quite interested in it and happy to host it and see it being improved :)

banyan-god · 2024-11-30T18:51:49Z

514e8a53-74e0-4d77-a61e-53a416f3ec3a.txt
I was able to reproduce it on 4x 4090 ~31 minutes
step:1750/1750 val_loss:3.2783 train_time:1889410ms step_avg:1085.87ms

banyan-god · 2024-12-04T00:03:00Z

A note: The current cost per run on an 8xH100 is about $1.90 (since it's about $3/hr for SXM H100s)

Personally, when I don't feel like spending that much, I go back to speedrunning CIFAR-10. But I understand that might not be so interesting to everyone

@KellerJordan where do you rent your 8xH100

LakshyAAAgrawal · 2024-12-04T12:24:26Z

Ah, without smaller batch size I get 2 hours 10 minutes.

https://gist.github.com/lapp0/2740a03a637ec926cf0eea90e541a0a6

The only changes necessary for 130 minute run which is effectively identical to the 8xH100 are
* `batch_size: int = 16`

* `sequence_length: int = 32 * 1024`

* `y = flex_attention(..., kernel_options={"BLOCK_M": 64, "BLOCK_N": 64, "BLOCK_M1": 32, "BLOCK_N1": 64, "BLOCK_M2": 64, "BLOCK_N2": 32}`

* update `run.sh` with `--nproc_per_node=1`
You can improve runtime to 90 minutes while deviating from 8xH100 learning dynamics by setting batch_size to 1, and tweaking the learning rates.

I'll submit a PR for this setting as an experimentation tool, however I'm really interested in a 4090 competition variant. Please let me know if someone is interested in hosting, otherwise I might create a fork myself :)

Hey, can you please point me to whether 2 different runs, one with bsz=8, another with bsz=16 is still comparable in terms of numbers of tokens seen during training, everything else being fixed?

lapp0 · 2024-12-06T18:51:03Z

Hey, can you please point me to whether 2 different runs, one with bsz=8, another with bsz=16 is still comparable in terms of numbers of tokens seen during training, everything else being fixed?

This is true if you split the sequence length respectively. e.g. these runs are all equivalent:

H100:

torchrun --standalone --nproc_per_node=8 train_gpt2.py \
    --train.batch_size 8 --train.sequence_length 65536  # default values

4090:

torchrun --standalone --nproc_per_node=8 train_gpt2.py --gpt.flex_kernel_consumer True \
    --train.batch_size 16 --train.sequence_length --train.sequence_length 32768

The only caveat is that for each sequence in the batch, one sample is split. This occurs twice as frequently when the sequence length is halved.

mosamdabhi · 2024-12-11T05:39:40Z

A note: The current cost per run on an 8xH100 is about $1.90 (since it's about $3/hr for SXM H100s)

Personally, when I don't feel like spending that much, I go back to speedrunning CIFAR-10. But I understand that might not be so interesting to everyone

@KellerJordan Have you thought about doing something for ImageNet1K for quickly doing the same checks that you are doing but for vision?

lapp0 linked a pull request Nov 29, 2024 that will close this issue

Run on RTX 4090 #38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A speedrun on consumer grade cards? #29

A speedrun on consumer grade cards? #29

fzyzcjy commented Nov 22, 2024

KellerJordan commented Nov 25, 2024

fzyzcjy commented Nov 25, 2024

naoro commented Nov 25, 2024

fzyzcjy commented Nov 25, 2024

alexjc commented Nov 25, 2024

fzyzcjy commented Nov 25, 2024

alexjc commented Nov 25, 2024

fzyzcjy commented Nov 25, 2024

naoro commented Nov 25, 2024

fzyzcjy commented Nov 25, 2024

KellerJordan commented Nov 25, 2024

fzyzcjy commented Nov 26, 2024

lapp0 commented Nov 26, 2024

KellerJordan commented Nov 26, 2024 •

edited

Loading

lapp0 commented Nov 26, 2024 •

edited

Loading

fzyzcjy commented Nov 26, 2024 •

edited

Loading

KellerJordan commented Nov 28, 2024

lapp0 commented Nov 29, 2024 •

edited

Loading

fzyzcjy commented Nov 29, 2024

banyan-god commented Nov 30, 2024 •

edited

Loading

banyan-god commented Dec 4, 2024

LakshyAAAgrawal commented Dec 4, 2024

lapp0 commented Dec 6, 2024

mosamdabhi commented Dec 11, 2024

A speedrun on consumer grade cards? #29

A speedrun on consumer grade cards? #29

Comments

fzyzcjy commented Nov 22, 2024

KellerJordan commented Nov 25, 2024

fzyzcjy commented Nov 25, 2024

naoro commented Nov 25, 2024

fzyzcjy commented Nov 25, 2024

alexjc commented Nov 25, 2024

fzyzcjy commented Nov 25, 2024

alexjc commented Nov 25, 2024

fzyzcjy commented Nov 25, 2024

naoro commented Nov 25, 2024

fzyzcjy commented Nov 25, 2024

KellerJordan commented Nov 25, 2024

fzyzcjy commented Nov 26, 2024

lapp0 commented Nov 26, 2024

KellerJordan commented Nov 26, 2024 • edited Loading

lapp0 commented Nov 26, 2024 • edited Loading

fzyzcjy commented Nov 26, 2024 • edited Loading

KellerJordan commented Nov 28, 2024

lapp0 commented Nov 29, 2024 • edited Loading

fzyzcjy commented Nov 29, 2024

banyan-god commented Nov 30, 2024 • edited Loading

banyan-god commented Dec 4, 2024

LakshyAAAgrawal commented Dec 4, 2024

lapp0 commented Dec 6, 2024

mosamdabhi commented Dec 11, 2024

KellerJordan commented Nov 26, 2024 •

edited

Loading

lapp0 commented Nov 26, 2024 •

edited

Loading

fzyzcjy commented Nov 26, 2024 •

edited

Loading

lapp0 commented Nov 29, 2024 •

edited

Loading

banyan-god commented Nov 30, 2024 •

edited

Loading