Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleting Conda/Python as a dependency entirely to dramatically decrease "latency to step" #482

Open
karpathy opened this issue May 28, 2024 · 4 comments

Comments

@karpathy
Copy link
Owner

karpathy commented May 28, 2024

Following up on this tweet, copy pasting, and just creating an Issue as a TODO.

"""
The thing that makes this a bit complicated right now is the start latency. What bloats up the setup time right now is the dataset and its tokenization, which is all done in Python right now. Installing huggingface datasets, downloading FineWeb 10B and tokenizing it is currently ~1 hr. I think I have to look into precomputing all of this and just saving the final .bin files (20GB) of tokens somewhere (S3 or so?). You could imagine fetching data shards asynchronously while the training started. This would completely eliminate any Python dependency.

The next slightly annoying thing is cuDNN, which is a 2GB download and installation, just to get the flash attention kernel. And it compiles for 1.5 minutes. But NVIDIA reached out and mentioned they are trying to bring this down a lot.

In principle, the code should compile and run roughly instantaneously.
"""

TLDR I think I'll pre-tokenize FineWeb100B with GPT-2 tokenizer, zip up the .bin shards, and put them up somewhere (e.g. S3?). And then we could just download, unzip, and directly train without any Python involvement at all.

TODO think through a bit.

@karpathy
Copy link
Owner Author

FineWeb100B is 1010 files total, these are raw .bin shards of 100M tokens each

  • Each is of size 191MB
  • Zipped, each is 134MB

134MB * 1010 files = 135340MB ~= 135GB

@banyan-god
Copy link

Have you played with this streaming parameter ?
load_dataset("HuggingFaceFW/fineweb", name="CC-MAIN-2024-10", split='train', streaming=False,num_proc=28)
I was going to use it but i have already downloaded 500GB of files

@karpathy
Copy link
Owner Author

(I used streaming originally but then started getting some errors in the tokenization workers when a request randomly fails, so I took it out)

@banyan-god
Copy link

I do something like this not very efficient as i am encoding it on the fly but I am planning to implement a thread that tokenizes and buffers it so it is available readily . https://github.com/banyan-god/llama2.c/blob/master/finewebllama2.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants