Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLM] llm.c training for GPT 2 #3611

Merged
merged 32 commits into from
May 31, 2024
Merged

[LLM] llm.c training for GPT 2 #3611

merged 32 commits into from
May 31, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented May 29, 2024

This PR adds a reproducible task yaml for karpathy/llm.c#481

Blocked by skypilot-org/skypilot-catalog#71 and #3610, as the g++ version, C++ libraries are more recent on Ubuntu distribution on GCP.

We now use docker image to solve the g++ version issue.

Future TODO:

  • Make bucket more general
  • Add to AI gallery
  • Fix the disconnection issue with ubuntu image (changed to docker instead)

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Only keep GCP: sky launch -c gpt2 llm/gpt-2/gpt2-train.yaml --use-spot --env BUCKET_NAME=gpt2-data-skypilot
    • Only keep lambda: sky launch -c gpt2 llm/gpt-2/gpt2-train.yaml --use-spot --env BUCKET_NAME=gpt2-data-skypilot
    • Only keep aws: sky launch --gpus A10g llm/gpt-2/gpt2-train.yaml --env BUCKET_NAME=gpt2-data-skypilot
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @Michaelvll! 🔥 Left some comments

llm/gpt-2/README.md Outdated Show resolved Hide resolved
llm/gpt-2/README.md Outdated Show resolved Hide resolved
llm/gpt-2/gpt2-train.yaml Outdated Show resolved Hide resolved
llm/gpt-2/gpt2-train.yaml Show resolved Hide resolved
llm/gpt-2/gpt2-data.yaml Outdated Show resolved Hide resolved
llm/gpt-2/gpt2-train.yaml Outdated Show resolved Hide resolved
After the data is processed, you can then train the model on a GPU VM with 8 A100 GPUs (replace `your-bucket-name` with your bucket name):

```bash
sky launch -c gpt2-train --detach-setup gpt2-train.yaml --env BUCKET_NAME=your-bucket-name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since cost + time seems like a big motivation behind this work ("Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20"), should we mention that here? Perhaps we can show the optimizer output?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added the comparison in the sentence. How does it look to you?

wget https://github.com/skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2-train.yaml
```

## Data processing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I like the fact that we have different YAMLs for pre-processing and training, but one concern is it may alienate lambda-only, azure-only or fluidstack-only users since they won't have any cloud object store access to write preprocessed data to.

If it's not too complicated, can we add a one-shot YAML that does all pre-processing and training in a single YAML?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Just separate it into two different sections. One for a combined YAML anothe for the pipeline. Wdyt?

```bash
sky launch -c gpt2-train --detach-setup gpt2-train.yaml --gpu A100 --env BUCKET_NAME=your-bucket-name
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we have graphs similar to karpathy's, it will be nice to put them here :) No worries if not.

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the loss for training. The eval figure requires additional dependency which I unfortunately did not installed before the current training.

@Michaelvll Michaelvll requested a review from romilbhardwaj May 31, 2024 01:25
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll!

@Michaelvll Michaelvll merged commit ea5aa50 into master May 31, 2024
20 checks passed
@Michaelvll Michaelvll deleted the gpt-2 branch May 31, 2024 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants