-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLM] llm.c training for GPT 2 #3611
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome @Michaelvll! 🔥 Left some comments
After the data is processed, you can then train the model on a GPU VM with 8 A100 GPUs (replace `your-bucket-name` with your bucket name): | ||
|
||
```bash | ||
sky launch -c gpt2-train --detach-setup gpt2-train.yaml --env BUCKET_NAME=your-bucket-name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since cost + time seems like a big motivation behind this work ("Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20"), should we mention that here? Perhaps we can show the optimizer output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Added the comparison in the sentence. How does it look to you?
llm/gpt-2/README.md
Outdated
wget https://github.com/skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2-train.yaml | ||
``` | ||
|
||
## Data processing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I like the fact that we have different YAMLs for pre-processing and training, but one concern is it may alienate lambda-only, azure-only or fluidstack-only users since they won't have any cloud object store access to write preprocessed data to.
If it's not too complicated, can we add a one-shot YAML that does all pre-processing and training in a single YAML?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Just separate it into two different sections. One for a combined YAML anothe for the pipeline. Wdyt?
```bash | ||
sky launch -c gpt2-train --detach-setup gpt2-train.yaml --gpu A100 --env BUCKET_NAME=your-bucket-name | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Romil Bhardwaj <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll!
This PR adds a reproducible task yaml for karpathy/llm.c#481
Blocked by skypilot-org/skypilot-catalog#71 and #3610, as the g++ version, C++ libraries are more recent on Ubuntu distribution on GCP.We now use docker image to solve the g++ version issue.
Future TODO:
Tested (run the relevant ones):
bash format.sh
sky launch -c gpt2 llm/gpt-2/gpt2-train.yaml --use-spot --env BUCKET_NAME=gpt2-data-skypilot
sky launch -c gpt2 llm/gpt-2/gpt2-train.yaml --use-spot --env BUCKET_NAME=gpt2-data-skypilot
sky launch --gpus A10g llm/gpt-2/gpt2-train.yaml --env BUCKET_NAME=gpt2-data-skypilot
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh