-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make CodeEval respect device_eval_batch_size #2969
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall lgtm, the metric thing is a little gross, but the current design does not afford any other options that i see, and i think its safe, so im good with it.
9069663
to
31782da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you include in the PR description evidence that before and after this PR, results don't change (assuming you carefully set the hparams to be the same). Or if that isn't possible for some reason, at least a number on a popular model that we can match to something. Otherwise, LGTM and will approve once you have that
Added experiments on public models to the PR description |
I think we have a couple of tests that have the right intention but are testing the wrong thing:
The idea is to make sure that inputs aren't overly left padded. The issue is that the ICL dataset maps using To be fair, the test is testing for the ideal behaviour, the issue is that our ICL code doesn't do that and that contribution feels out of scope for this PR. |
Reworked some of the tests. However, I'd like to know whether we can remove unnecessary left padding in a per batch basis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@josejg I think it looks good to me and you are right that the tests were simply incorrect before. Are you saying you want to trim the extra left padding from each individual batch (since padding is determined based on the full dataset as opposed to each individual batch)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
What does this PR do?
Re-implementation of our CodeEval so that it respects the device eval batch size
What issue(s) does this change relate to?
Since the previous implementation silently overrode device_eval_batch_size to be generations_per_sample, there are two clear benefits of the rewrite:
Regression testing
These are some regression tests using the PR code and the previous implementation (plus a patch to use temperature 0.2) which was not possible before @maxisawesome foundry PR (which was not merged at the time I ran these experiments). This experiments are with generations_per_sample=20 so some variance is expected.
Before submitting
pre-commit
on your change? (see thepre-commit
section of prerequisites)