Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add xnli task #258

Closed
wants to merge 16 commits into from
Closed

Add xnli task #258

wants to merge 16 commits into from

Conversation

yongzx
Copy link

@yongzx yongzx commented Jan 22, 2022

Add these following task names:

  • "xnli_all_languages"
  • "xnli_ar"
  • "xnli_bg"
  • "xnli_de"
  • "xnli_el"

When I run

python3 -m scripts.write_out \
    --output_base_path ~/Desktop \
    --tasks xnli_de \
    --sets train \
    --num_fewshot 0 \
    --num_examples 1

I got the output

!!@@##@@!! -- Example 0
Konzeptionell cream abschöpfen hat zwei grundlegende Dimensionen - Produkt und Geographie .
Question: Produkt und Geographie sind das , was creme abschöpfen Arbeit macht . True, False, or Neither?
Answer:

@jon-tow jon-tow self-requested a review January 22, 2022 07:25
Copy link
Member

@jon-tow jon-tow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @yongzx! Thanks for putting this together.

lm_eval/tasks/__init__.py Outdated Show resolved Hide resolved
lm_eval/tasks/xnli.py Outdated Show resolved Hide resolved
@jon-tow
Copy link
Member

jon-tow commented Jan 22, 2022

Related Issue: #199

@codecov
Copy link

codecov bot commented Jan 22, 2022

Codecov Report

Merging #258 (e59075d) into master (fcaea0e) will increase coverage by 0.24%.
The diff coverage is 100.00%.

❗ Current head e59075d differs from pull request most recent head 7e89586. Consider uploading reports for the commit 7e89586 to get more accurate results

@@            Coverage Diff             @@
##           master     #258      +/-   ##
==========================================
+ Coverage   95.35%   95.60%   +0.24%     
==========================================
  Files          48       47       -1     
  Lines        3986     3825     -161     
==========================================
- Hits         3801     3657     -144     
+ Misses        185      168      -17     
Impacted Files Coverage Δ
lm_eval/tasks/__init__.py 88.57% <100.00%> (-0.17%) ⬇️
lm_eval/tasks/xnli.py 100.00% <100.00%> (ø)
lm_eval/tasks/asdiv.py 98.41% <0.00%> (-1.59%) ⬇️
lm_eval/tasks/lambada.py 92.85% <0.00%> (-0.17%) ⬇️
lm_eval/tasks/lambada_multilingual.py 94.00% <0.00%> (-0.12%) ⬇️
lm_eval/tasks/hendrycks_test.py 93.44% <0.00%> (-0.11%) ⬇️
lm_eval/tasks/coqa.py 91.66% <0.00%> (-0.10%) ⬇️
lm_eval/tasks/translation.py 93.05% <0.00%> (-0.10%) ⬇️
lm_eval/tasks/wsc273.py 94.20% <0.00%> (-0.09%) ⬇️
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b0acb33...7e89586. Read the comment docs.

@leogao2
Copy link
Contributor

leogao2 commented Feb 13, 2022

How are the "Question" and "Answer" strings in the various languages obtained? I know there were some issues with the same thing earlier in a different task: #200 (review)

@jon-tow
Copy link
Member

jon-tow commented Feb 13, 2022

How are the "Question" and "Answer" strings in the various languages obtained? I know there were some issues with the same thing earlier in a different task: #200 (review)

These were sourced from a mix of Google Translate, Yandex Translate, and wordreference. I'll try to confirm these with native speakers (maybe I can find some in the discord server).

lm_eval/base.py Outdated Show resolved Hide resolved
Comment on lines +75 to +77
ll_true = rf.loglikelihood_rolling(ctx.replace("[MASK]", "Yes"))[0]
ll_neither = rf.loglikelihood_rolling(ctx.replace("[MASK]", "Also"))[0]
ll_false = rf.loglikelihood_rolling(ctx.replace("[MASK]", "No"))[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove [0]s, see #262

lm_eval/base.py Outdated Show resolved Hide resolved
lm_eval/evaluator.py Outdated Show resolved Hide resolved
lm_eval/base.py Outdated Show resolved Hide resolved
lm_eval/models/__init__.py Outdated Show resolved Hide resolved
@leogao2
Copy link
Contributor

leogao2 commented Feb 13, 2022

I think this will be fine to merge after we get the translations sorted out; once we do, we can also copy the new translations over to #200 and merge that too.

I think it might be worth asking bigscience to help with this - they seem to have the organizational capacity and breadth of multilingual contributors necessary

Copy link
Contributor

@leogao2 leogao2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

Copy link
Member

@jon-tow jon-tow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are a few new things that need to be addressed.

@@ -100,4 +100,4 @@ def _model_generate(self, context, max_length, eos_token_id):


# for backwards compatibility
GPT2LM = HFLM
GPT2LM = HFLM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this change to keep the end of file newline.

# False = contradiction
# Neither = neutral
return " " + [self.TRUE, self.NEITHER, self.FALSE][doc['label']]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind the prompt format change from the common natural language inference style we had before?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two reasons:

  1. This is following the prompt format in the XGLM paper.
  2. The previous prompt format yields ~33.3% for all languages. This means that the previous prompt format is no better than a random model. This new prompt format is able to reproduce the result of the XGLM paper.

lm_eval/models/__init__.py Outdated Show resolved Hide resolved
lm_eval/base.py Outdated Show resolved Hide resolved
Comment on lines +75 to +77
ll_true = rf.loglikelihood_rolling(ctx.replace("[MASK]", "Yes"))[0]
ll_neither = rf.loglikelihood_rolling(ctx.replace("[MASK]", "Also"))[0]
ll_false = rf.loglikelihood_rolling(ctx.replace("[MASK]", "No"))[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use rf.loglikelihood_rolling as opposed to rf.loglikelihood?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rf.loglikelihood_rolling computes the loglikelihood of the entire sentence, and the XGLM paper uses the loglikelihood of the entire sentence to evaluate XNLI.

@yongzx yongzx requested a review from StellaAthena as a code owner May 27, 2022 15:17
@yongzx yongzx closed this by deleting the head repository Sep 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants