-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add xnli task #258
Add xnli task #258
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @yongzx! Thanks for putting this together.
Related Issue: #199 |
Codecov Report
@@ Coverage Diff @@
## master #258 +/- ##
==========================================
+ Coverage 95.35% 95.60% +0.24%
==========================================
Files 48 47 -1
Lines 3986 3825 -161
==========================================
- Hits 3801 3657 -144
+ Misses 185 168 -17
Continue to review full report at Codecov.
|
…into xnli Conflicts: lm_eval/tasks/xnli.py
How are the "Question" and "Answer" strings in the various languages obtained? I know there were some issues with the same thing earlier in a different task: #200 (review) |
These were sourced from a mix of Google Translate, Yandex Translate, and wordreference. I'll try to confirm these with native speakers (maybe I can find some in the discord server). |
ll_true = rf.loglikelihood_rolling(ctx.replace("[MASK]", "Yes"))[0] | ||
ll_neither = rf.loglikelihood_rolling(ctx.replace("[MASK]", "Also"))[0] | ||
ll_false = rf.loglikelihood_rolling(ctx.replace("[MASK]", "No"))[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove [0]
s, see #262
I think this will be fine to merge after we get the translations sorted out; once we do, we can also copy the new translations over to #200 and merge that too. I think it might be worth asking bigscience to help with this - they seem to have the organizational capacity and breadth of multilingual contributors necessary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few new things that need to be addressed.
lm_eval/models/gpt2.py
Outdated
@@ -100,4 +100,4 @@ def _model_generate(self, context, max_length, eos_token_id): | |||
|
|||
|
|||
# for backwards compatibility | |||
GPT2LM = HFLM | |||
GPT2LM = HFLM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert this change to keep the end of file newline.
# False = contradiction | ||
# Neither = neutral | ||
return " " + [self.TRUE, self.NEITHER, self.FALSE][doc['label']] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning behind the prompt format change from the common natural language inference style we had before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two reasons:
- This is following the prompt format in the XGLM paper.
- The previous prompt format yields ~33.3% for all languages. This means that the previous prompt format is no better than a random model. This new prompt format is able to reproduce the result of the XGLM paper.
ll_true = rf.loglikelihood_rolling(ctx.replace("[MASK]", "Yes"))[0] | ||
ll_neither = rf.loglikelihood_rolling(ctx.replace("[MASK]", "Also"))[0] | ||
ll_false = rf.loglikelihood_rolling(ctx.replace("[MASK]", "No"))[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use rf.loglikelihood_rolling
as opposed to rf.loglikelihood
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rf.loglikelihood_rolling
computes the loglikelihood of the entire sentence, and the XGLM paper uses the loglikelihood of the entire sentence to evaluate XNLI.
Add these following task names:
When I run
I got the output