Deleting Conda/Python as a dependency entirely to dramatically decrease "latency to step" #482

karpathy · 2024-05-28T19:16:38Z

Following up on this tweet, copy pasting, and just creating an Issue as a TODO.

"""
The thing that makes this a bit complicated right now is the start latency. What bloats up the setup time right now is the dataset and its tokenization, which is all done in Python right now. Installing huggingface datasets, downloading FineWeb 10B and tokenizing it is currently ~1 hr. I think I have to look into precomputing all of this and just saving the final .bin files (20GB) of tokens somewhere (S3 or so?). You could imagine fetching data shards asynchronously while the training started. This would completely eliminate any Python dependency.

The next slightly annoying thing is cuDNN, which is a 2GB download and installation, just to get the flash attention kernel. And it compiles for 1.5 minutes. But NVIDIA reached out and mentioned they are trying to bring this down a lot.

In principle, the code should compile and run roughly instantaneously.
"""

TLDR I think I'll pre-tokenize FineWeb100B with GPT-2 tokenizer, zip up the .bin shards, and put them up somewhere (e.g. S3?). And then we could just download, unzip, and directly train without any Python involvement at all.

TODO think through a bit.

karpathy · 2024-05-28T19:32:07Z

FineWeb100B is 1010 files total, these are raw .bin shards of 100M tokens each

Each is of size 191MB
Zipped, each is 134MB

134MB * 1010 files = 135340MB ~= 135GB

banyan-god · 2024-05-28T23:01:56Z

Have you played with this streaming parameter ?
load_dataset("HuggingFaceFW/fineweb", name="CC-MAIN-2024-10", split='train', streaming=False,num_proc=28)
I was going to use it but i have already downloaded 500GB of files

karpathy · 2024-05-28T23:16:51Z

(I used streaming originally but then started getting some errors in the tokenization workers when a request randomly fails, so I took it out)

banyan-god · 2024-05-28T23:24:03Z

I do something like this not very efficient as i am encoding it on the fly but I am planning to implement a thread that tokenizes and buffers it so it is available readily . https://github.com/banyan-god/llama2.c/blob/master/finewebllama2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deleting Conda/Python as a dependency entirely to dramatically decrease "latency to step" #482

Deleting Conda/Python as a dependency entirely to dramatically decrease "latency to step" #482

karpathy commented May 28, 2024 •

edited

Loading

karpathy commented May 28, 2024

banyan-god commented May 28, 2024

karpathy commented May 28, 2024

banyan-god commented May 28, 2024

Deleting Conda/Python as a dependency entirely to dramatically decrease "latency to step" #482

Deleting Conda/Python as a dependency entirely to dramatically decrease "latency to step" #482

Comments

karpathy commented May 28, 2024 • edited Loading

karpathy commented May 28, 2024

banyan-god commented May 28, 2024

karpathy commented May 28, 2024

banyan-god commented May 28, 2024

karpathy commented May 28, 2024 •

edited

Loading