Surpassing(?) FineWeb-edu with the DCLM-Baseline Dataset #664
robbiegwald
started this conversation in
Show and tell
Replies: 1 comment
-
Thank you so much for testing this dataset robbie, I also wondered if dclm could surpass fineweb, good to know that it might not be worth it to use compared to fineweb-edu. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey all!
I've recently discovered the 4T token dclm-baseline-1.0 dataset which claims to outperform fineweb-edu. I thought I'd run a small training run to see what I find.
DCLM is a benchmark that tests data-filtering methods by training LLMs in a standardized way. The team behind it used techniques similar to the FW team (training a data quality classifier) to filter out to 4T tokens. These 4T tokens are the dclm-baseline-1.0 dataset. They claim significant performance improvements, and trained a 7B parameter model that attained similar performance to Llama 3 8B at just 2.6T tokens. They also showed it significantly outperforming fineweb-edu:

The first problem I ran into when using this dataset is that it's in a compressed json format. I figured it would be easier to simply convert part of it to parquet and upload it to HF, rather than rewriting the data-prep scripts. I converted several hundred billion tokens (given the file size), then randomly selected 10 billion tokens from that. You can find this parquetified dataset here.
After converting it to parquet format, it was as simple as subbing it in for fineweb-edu in the fineweb data prep script (and removing the name variable).
After a pretty uneventful 5 hour training run on a 4x4090 rig, I get these results:

This left me a little disappointed, but it wasn't too surprising. The FW team found that filtering for even higher quality resulted in better performance in academic benchmarks, but worse for more "plain english" benchmarks like HellaSwag. The real question is if the performance improves in other benchmarks, and by how much.
So, I converted the model to safetensors (the dclm model can be found here) and ran it through a wide array of model evals using Eleuther's eval harness. I'm comparing it to the fineweb-edu model found here. Keep in mind that these are all small rudimentary models, so the evals are a little different than they would be for large, instruction-tuned models:

It appears that the performance isn't as good. This is obviously on a very small model so I have no reason to doubt their numbers for larger models, but it is disappointing nonetheless. Still, it achieves similar performance to fineweb-edu while having significantly more tokens (4T). Unfortunately it isn't the unquestioned best dataset I was hoping for, but it is still a welcome addition to the open source community. As a disclaimer, I am very new to this, so please let me know if I did anything wrong, or if you have any theories as to why it performs this way.
To close, I want to thank Andrej and everyone else working on this project. It's seriously incredible that we can train LLMs with such low-cost GPU hardware and so quickly too. This obviously wouldn't have been possible without this project, and I've learned so much (and had so much fun) just playing with it. So again, thank you everyone!
-robbie
Beta Was this translation helpful? Give feedback.
All reactions