-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducing the OctoCoder model #17
Comments
I think this is the exact dataset we used for OctoCoder: https://huggingface.co/datasets/bigcode/guanaco-commits Yes, we used LoRA for OctoCoder. cc @ArmelRandy |
Hi @Muennighoff , @ArmelRandy: |
We did not find a significant difference between LoRA and full fine-tuning, thus we use LoRA for all experiments. Sorry for that. I have added the above as a note in |
Hi @Muennighoff ,
The above dataset contains 13K samples. However, from the paper it seems ~23K samples were used for training OctoCoder. Am I missing something? |
For OctoCoder, we use OASST + CommitPackFT so 8587 + 5000 ~ 13,000 |
Thanks! :) |
Great! Appreciate the response. Could you also clarify the environments used in evaluation? We are seeing discrepancies between the paper and our eval results up by 10% on OctoCoder. Perhaps you could specify the build versions of languages? I see you just specify the latest stable Rust in the code, for example. |
Sure, these are the evaluation versions: Python: Also HumanEval performance is noisy as there's only 164 samples per task per subset. You may find that a different seed or checkpoint from a different step could make up for that 10% relative difference on Python HumanEvalSynthesize. Other people have been able to reproduce the results, someone even got |
@Muennighoff From the Appendix, there is a description that "OCTOCODER was trained for 35 steps with a sequence length of 2048". |
Note that it's 2.2 million total finetuning tokens due to the batch size of 32. The steps & seq len is correct - You usually do not need many tokens for instruction tuning, see e.g. the below graph from prior work |
@Muennighoff How did you decide on the number of samples from CommitPackFT to use during fine tuning? i.e. where did the 5k number come from? Your graph above seems to indicate increased performance for the BLOOMZ model during fine-tuning well into the 100s of millions of tokens, and I've seen other fine-tunings of Llama-2 using training sets that vary from ~5k all the way up to ~80k, for similar-ish tasks. I am curious what insights/experiences you used to come up with 5k. |
The 5K was mostly arbitrary. Our filtered OASST dataset had around 5K so we just decided to fix it at 5K for CommitPackFT, too. You can probably use more. You are right that perf improves into the 100s of millions for BLOOMZ; mT0 seems to saturate earlier. It could be that fine-tuning OctoCoder for longer would lead to better performance. |
Hello, I have a few questions about OctoCoder.
For this part in the paper:
Could you please provide the exact training data and the launch script to fine-tune StarCoder into OctoCoder?
Or, the seeds that you used for selecting 5,000 instructions from CommitPackFT?
For a second question, was OctoCoder and the results in the paper produced using the
finetuning/starcoder/finetune.py
with LoRA/peft?Thanks!
Btw, fantastic results @Muennighoff and team :)
The text was updated successfully, but these errors were encountered: