You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following up on this tweet, copy pasting, and just creating an Issue as a TODO.
"""
The thing that makes this a bit complicated right now is the start latency. What bloats up the setup time right now is the dataset and its tokenization, which is all done in Python right now. Installing huggingface datasets, downloading FineWeb 10B and tokenizing it is currently ~1 hr. I think I have to look into precomputing all of this and just saving the final .bin files (20GB) of tokens somewhere (S3 or so?). You could imagine fetching data shards asynchronously while the training started. This would completely eliminate any Python dependency.
The next slightly annoying thing is cuDNN, which is a 2GB download and installation, just to get the flash attention kernel. And it compiles for 1.5 minutes. But NVIDIA reached out and mentioned they are trying to bring this down a lot.
In principle, the code should compile and run roughly instantaneously.
"""
TLDR I think I'll pre-tokenize FineWeb100B with GPT-2 tokenizer, zip up the .bin shards, and put them up somewhere (e.g. S3?). And then we could just download, unzip, and directly train without any Python involvement at all.
TODO think through a bit.
The text was updated successfully, but these errors were encountered:
Have you played with this streaming parameter ? load_dataset("HuggingFaceFW/fineweb", name="CC-MAIN-2024-10", split='train', streaming=False,num_proc=28)
I was going to use it but i have already downloaded 500GB of files
Following up on this tweet, copy pasting, and just creating an Issue as a TODO.
"""
The thing that makes this a bit complicated right now is the start latency. What bloats up the setup time right now is the dataset and its tokenization, which is all done in Python right now. Installing huggingface datasets, downloading FineWeb 10B and tokenizing it is currently ~1 hr. I think I have to look into precomputing all of this and just saving the final .bin files (20GB) of tokens somewhere (S3 or so?). You could imagine fetching data shards asynchronously while the training started. This would completely eliminate any Python dependency.
The next slightly annoying thing is cuDNN, which is a 2GB download and installation, just to get the flash attention kernel. And it compiles for 1.5 minutes. But NVIDIA reached out and mentioned they are trying to bring this down a lot.
In principle, the code should compile and run roughly instantaneously.
"""
TLDR I think I'll pre-tokenize FineWeb100B with GPT-2 tokenizer, zip up the .bin shards, and put them up somewhere (e.g. S3?). And then we could just download, unzip, and directly train without any Python involvement at all.
TODO think through a bit.
The text was updated successfully, but these errors were encountered: