Run on RTX 4090 #38

lapp0 · 2024-11-29T08:12:58Z

Changes to make the code run on RTX 4090 / 3090.

Fixes #29

Runs in 2 hours 3 minutes, Runs range from 3.275 to 3.285. This finished at 3.2817,

These settings are intended to replicate the training dynamics on 8xH100, not to be optimal. This is accomplished by halving the sequence length and doubling the batch size. You can train in ~90 minutes by setting batch_size of 1.

#38 (comment)

KellerJordan · 2024-11-30T01:31:08Z

Thanks for your work. I will have to think about this and understand the options before merging this.

One thing I would merge immediately would be a reproducible log generated by this. It can be put in the records folder and linked to in the README, to help people who wanna run on 4090. I might prefer that, since it would avoid adding options to the code.

lapp0 · 2024-11-30T16:55:33Z

Perhaps a reasonable alternative is

1. Expose hyperparameters as command arguments via argparse
1. Document the command which enables 4090 trains

vgoklani · 2024-12-02T16:15:56Z

@lapp0 Hey what does do? Thanks!

        flex_kernel_options={
            "BLOCK_M": 64, "BLOCK_N": 64,  # forward
            "BLOCK_M1": 32, "BLOCK_N1": 64, "BLOCK_M2": 64, "BLOCK_N2": 32  # backwards
        }

Presumably you are passing values to the Triton kernels, but how did you come up with those values? Thanks!

lapp0 · 2024-12-02T20:43:46Z

@vgoklani I stole the values from pytorch/pytorch#133254 (comment)

banyan-god · 2024-12-02T22:55:10Z

Perhaps a reasonable alternative is

* 1. Expose hyperparameters as command arguments via argparse

* 2. Document the command which enables 4090 trains

definitely prefer this over how it is implemented now, especially if you want to use multiple 4090 and lets say we want to keep the context length same but change the batch size instead .

lapp0 · 2024-12-04T18:41:58Z

Updated code and docs. You can now specify GPTConfig and Hyperparameters as command line arguments, e.g.

torchrun --standalone --nproc_per_node=4 train_gpt2.py \
    --gpt.flex_kernel_consumer True --train.sequence_length 32768 --train.batch_size 16

Rendered Docs

Unrelated: added n_intermediate to GPTConfig which defaults to n_embd * 4

KellerJordan · 2024-12-05T07:28:47Z

I linked to this from the README. I don't think I'll merge it because I don't want to add any arguments to train_gpt2.py.

lapp0 · 2024-12-05T18:40:06Z

Perhaps we could have train_gpt2.py be

def train(model_config: GPTConfig, args: Hyperparameters):
    ...
    
if __name__ == "__main__":
    train(GPTConfig(), Hyperparameters())

This is equivalent to the current code in master except it puts the trainer in a function, parameterized by the model config and training arguments, cleaning the code and making it easier to employ hyperparameter search tools.

Then we can create train_gpt2_cli.py which imports train from train_gpt2.py and handles arg parsing.

vak · 2024-12-13T12:08:45Z

@lapp0 many thanks for your fork!

I've started your fork successfully on a single RTX 4090 with minor version upgrades and a tiny fix in log-output writing:

➜  modded-nanogpt git:(pr-38) ✗ git diff
diff --git a/Dockerfile b/Dockerfile
index 4ca8600..57fcced 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -1,7 +1,7 @@
-FROM nvidia/cuda:12.6.2-cudnn-devel-ubuntu24.04
+FROM nvidia/cuda:12.6.3-cudnn-devel-ubuntu24.04
 
 ENV DEBIAN_FRONTEND=noninteractive
-ENV PYTHON_VERSION=3.12.7
+ENV PYTHON_VERSION=3.12.8
 ENV PATH=/usr/local/bin:$PATH
 
 RUN apt update && apt install -y --no-install-recommends build-essential libssl-dev zlib1g-dev \
@@ -27,7 +27,7 @@ WORKDIR /modded-nanogpt
 RUN python -m pip install --upgrade pip && \
     pip install -r requirements.txt
 
-RUN pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124 --upgrade
+RUN pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126 --upgrade
 
 CMD ["bash"]
 ENTRYPOINT []
diff --git a/run.sh b/run.sh
index 7bb09f4..04e486d 100755
--- a/run.sh
+++ b/run.sh
@@ -1 +1 @@
-torchrun --standalone --nproc_per_node=8 train_gpt2.py
+torchrun --standalone --nproc_per_node=1 train_gpt2.py
diff --git a/train_gpt2.py b/train_gpt2.py
index 6f7f74e..35b1d87 100644
--- a/train_gpt2.py
+++ b/train_gpt2.py
@@ -437,7 +437,7 @@ def train(model_config: GPTConfig, args: Hyperparameters):
             with open(logfile, "a") as f:
                 if not logonly:
                     print(s)
-                f.write(s+'\n')
+                f.write(str(s)+'\n')
     # log information about the hardware/software environment this is running on
     # and print the full `nvidia-smi` to file
     print0(f"Running pytorch {torch.version.__version__} compiled for CUDA {torch.version.cuda}\nnvidia-smi:")

However I am running out of GPU memory. Probably because I use single GPU setup?

If you have an idea what parameters to tweak to fit in a single RTX 4090 24GB, please let me know,
thank you!

lapp0 · 2024-12-16T07:39:23Z

@vak you can always decrease sequence length further and increase batch size in the same ratio.

lapp0 · 2025-01-04T19:35:55Z

With torch==2.6.0.dev20241231 you don't need to specify the kernel_options. Simply change the halve and double the sequence length and batch size respectively.

BradKML · 2025-01-24T08:53:00Z

@vak and @lapp0 now I am curious how and where to do this (in the case of running these models on a laptop GPU), it is still those two places (halving sequence length, doubling batch size)?

KellerJordan added 30 commits October 21, 2024 04:14

.

3fc7613

Merge branch 'master' of https://github.com/KellerJordan/modded-nanogpt

179023e

Update README.md

df1bc7a

.

5280949

Merge branch 'master' of https://github.com/KellerJordan/modded-nanogpt

48fdba5

Update README.md

d9249c6

Update README.md

35ddcd8

Update README.md

8c8e223

Update README.md

6143123

Update README.md

948c146

Update README.md

1bb361b

.

f1311a6

.

e5d0f4c

.

5984843

.

fc41694

.

a7a064a

.

f0bc125

.

3b74220

.

ffa1868

Update README.md

3ba77ea

Update README.md

0db7cb4

Update README.md

4e83467

.

e319979

.

c24926d

.

32a8bbf

.

7843fc6

.

ceff3ea

.

6c7773a

Update README.md

48d9509

Update README.md

90964a9

KellerJordan and others added 10 commits November 25, 2024 23:08

.

5384d30

.

beb8368

.

2b24502

Update README.md

b51050a

Merge branch 'master' of https://github.com/KellerJordan/modded-nanogpt

013a194

.

341c860

.

c857b1a

.

b9e4d52

Update train_gpt2.py

9e35b93

run on RTX 4090

ab5d1a2

expose training and model parameters as command line arguments

a77f7eb

lapp0 added 2 commits December 4, 2024 13:44

update incorrect batch size for 4090 in docs

2a0d1ba

print configs once

0b0ca81

lapp0 closed this Dec 16, 2024

lapp0 reopened this Dec 16, 2024

KellerJordan force-pushed the master branch from 04855dc to 5fb62cc Compare January 4, 2025 09:39

kschmid mentioned this pull request Jan 4, 2025

Invalid device ordinal #61

Closed

lapp0 closed this Jan 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run on RTX 4090 #38

Run on RTX 4090 #38

lapp0 commented Nov 29, 2024 •

edited

Loading

KellerJordan commented Nov 30, 2024 •

edited

Loading

lapp0 commented Nov 30, 2024

vgoklani commented Dec 2, 2024 •

edited

Loading

lapp0 commented Dec 2, 2024

banyan-god commented Dec 2, 2024

lapp0 commented Dec 4, 2024 •

edited

Loading

KellerJordan commented Dec 5, 2024

lapp0 commented Dec 5, 2024 •

edited

Loading

vak commented Dec 13, 2024

lapp0 commented Dec 16, 2024

lapp0 commented Jan 4, 2025

BradKML commented Jan 24, 2025 •

edited

Loading

Run on RTX 4090 #38

Run on RTX 4090 #38

Conversation

lapp0 commented Nov 29, 2024 • edited Loading

KellerJordan commented Nov 30, 2024 • edited Loading

lapp0 commented Nov 30, 2024

vgoklani commented Dec 2, 2024 • edited Loading

lapp0 commented Dec 2, 2024

banyan-god commented Dec 2, 2024

lapp0 commented Dec 4, 2024 • edited Loading

KellerJordan commented Dec 5, 2024

lapp0 commented Dec 5, 2024 • edited Loading

vak commented Dec 13, 2024

lapp0 commented Dec 16, 2024

lapp0 commented Jan 4, 2025

BradKML commented Jan 24, 2025 • edited Loading

lapp0 commented Nov 29, 2024 •

edited

Loading

KellerJordan commented Nov 30, 2024 •

edited

Loading

vgoklani commented Dec 2, 2024 •

edited

Loading

lapp0 commented Dec 4, 2024 •

edited

Loading

lapp0 commented Dec 5, 2024 •

edited

Loading

BradKML commented Jan 24, 2025 •

edited

Loading