Add xnli task #258

yongzx · 2022-01-22T01:58:17Z

Add these following task names:

"xnli_all_languages"
"xnli_ar"
"xnli_bg"
"xnli_de"
"xnli_el"

When I run

python3 -m scripts.write_out \
    --output_base_path ~/Desktop \
    --tasks xnli_de \
    --sets train \
    --num_fewshot 0 \
    --num_examples 1

I got the output

!!@@##@@!! -- Example 0
Konzeptionell cream abschöpfen hat zwei grundlegende Dimensionen - Produkt und Geographie .
Question: Produkt und Geographie sind das , was creme abschöpfen Arbeit macht . True, False, or Neither?
Answer:

jon-tow

Hey @yongzx! Thanks for putting this together.

lm_eval/tasks/__init__.py

lm_eval/tasks/xnli.py

jon-tow · 2022-01-22T07:34:08Z

Related Issue: #199

codecov · 2022-01-22T09:55:19Z

Codecov Report

Merging #258 (e59075d) into master (fcaea0e) will increase coverage by 0.24%.
The diff coverage is 100.00%.

❗ Current head e59075d differs from pull request most recent head 7e89586. Consider uploading reports for the commit 7e89586 to get more accurate results

@@            Coverage Diff             @@
##           master     #258      +/-   ##
==========================================
+ Coverage   95.35%   95.60%   +0.24%     
==========================================
  Files          48       47       -1     
  Lines        3986     3825     -161     
==========================================
- Hits         3801     3657     -144     
+ Misses        185      168      -17

Impacted Files	Coverage Δ
lm_eval/tasks/__init__.py	`88.57% <100.00%> (-0.17%)`	⬇️
lm_eval/tasks/xnli.py	`100.00% <100.00%> (ø)`
lm_eval/tasks/asdiv.py	`98.41% <0.00%> (-1.59%)`	⬇️
lm_eval/tasks/lambada.py	`92.85% <0.00%> (-0.17%)`	⬇️
lm_eval/tasks/lambada_multilingual.py	`94.00% <0.00%> (-0.12%)`	⬇️
lm_eval/tasks/hendrycks_test.py	`93.44% <0.00%> (-0.11%)`	⬇️
lm_eval/tasks/coqa.py	`91.66% <0.00%> (-0.10%)`	⬇️
lm_eval/tasks/translation.py	`93.05% <0.00%> (-0.10%)`	⬇️
lm_eval/tasks/wsc273.py	`94.20% <0.00%> (-0.09%)`	⬇️
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b0acb33...7e89586. Read the comment docs.

…into xnli Conflicts: lm_eval/tasks/xnli.py

into xnli

leogao2 · 2022-02-13T05:35:20Z

How are the "Question" and "Answer" strings in the various languages obtained? I know there were some issues with the same thing earlier in a different task: #200 (review)

jon-tow · 2022-02-13T05:57:26Z

How are the "Question" and "Answer" strings in the various languages obtained? I know there were some issues with the same thing earlier in a different task: #200 (review)

These were sourced from a mix of Google Translate, Yandex Translate, and wordreference. I'll try to confirm these with native speakers (maybe I can find some in the discord server).

lm_eval/base.py

leogao2 · 2022-02-13T06:06:16Z

lm_eval/tasks/xnli.py

+        ll_true = rf.loglikelihood_rolling(ctx.replace("[MASK]", "Yes"))[0]
+        ll_neither = rf.loglikelihood_rolling(ctx.replace("[MASK]", "Also"))[0]
+        ll_false = rf.loglikelihood_rolling(ctx.replace("[MASK]", "No"))[0]


Remove [0]s, see #262

lm_eval/base.py

lm_eval/evaluator.py

lm_eval/base.py

lm_eval/models/__init__.py

leogao2 · 2022-02-13T06:10:34Z

I think this will be fine to merge after we get the translations sorted out; once we do, we can also copy the new translations over to #200 and merge that too.

I think it might be worth asking bigscience to help with this - they seem to have the organizational capacity and breadth of multilingual contributors necessary

leogao2

see above

jon-tow

Looks like there are a few new things that need to be addressed.

jon-tow · 2022-02-13T06:05:38Z

lm_eval/models/gpt2.py

@@ -100,4 +100,4 @@ def _model_generate(self, context, max_length, eos_token_id):


 # for backwards compatibility
-GPT2LM = HFLM
+GPT2LM = HFLM


Revert this change to keep the end of file newline.

jon-tow · 2022-02-13T06:14:54Z

lm_eval/tasks/xnli.py

+        # False = contradiction
+        # Neither = neutral
+        return " " + [self.TRUE, self.NEITHER, self.FALSE][doc['label']]
+


What's the reasoning behind the prompt format change from the common natural language inference style we had before?

Two reasons:

This is following the prompt format in the XGLM paper.

The previous prompt format yields ~33.3% for all languages. This means that the previous prompt format is no better than a random model. This new prompt format is able to reproduce the result of the XGLM paper.

lm_eval/models/__init__.py

lm_eval/base.py

jon-tow · 2022-02-13T07:17:56Z

lm_eval/tasks/xnli.py

+        ll_true = rf.loglikelihood_rolling(ctx.replace("[MASK]", "Yes"))[0]
+        ll_neither = rf.loglikelihood_rolling(ctx.replace("[MASK]", "Also"))[0]
+        ll_false = rf.loglikelihood_rolling(ctx.replace("[MASK]", "No"))[0]


Why use rf.loglikelihood_rolling as opposed to rf.loglikelihood?

I think rf.loglikelihood_rolling computes the loglikelihood of the entire sentence, and the XGLM paper uses the loglikelihood of the entire sentence to evaluate XNLI.

…harness into xnli

yongzx and others added 2 commits January 21, 2022 20:54

add xnli task

59b12d7

Flatten xnli_all subset into separate tasks

71eafdb

jon-tow self-requested a review January 22, 2022 07:25

jon-tow reviewed Jan 22, 2022

View reviewed changes

lm_eval/tasks/__init__.py Outdated Show resolved Hide resolved

lm_eval/tasks/xnli.py Outdated Show resolved Hide resolved

Add xnli testdata

e59075d

jon-tow and others added 8 commits January 30, 2022 19:43

Flatten xnli_all subset into separate tasks

9cfc504

Translate prompt delimiters

f2a29f3

Update option naming and testdata

29a7be8

Fix Turkish prompt delimiter translation

927e3e3

update prompt template

6f72cfb

Merge branch 'xnli' of https://github.com/yongzx/lm-evaluation-harness …

b8689f3

…into xnli Conflicts: lm_eval/tasks/xnli.py

fix merges incoming changes

f29199d

Merge branch 'master' of https://github.com/yongzx/lm-evaluation-harness

1692696

into xnli

yongzx mentioned this pull request Feb 3, 2022

Bug with loglikelihood_rolling: TypeError: 'float' object is not subscriptable #262

Closed

jon-tow reviewed Feb 13, 2022

View reviewed changes

lm_eval/base.py Outdated Show resolved Hide resolved

Merge branch 'master' into xnli

e2da063

leogao2 reviewed Feb 13, 2022

View reviewed changes

lm_eval/base.py Outdated Show resolved Hide resolved

leogao2 reviewed Feb 13, 2022

View reviewed changes

lm_eval/evaluator.py Outdated Show resolved Hide resolved

leogao2 reviewed Feb 13, 2022

View reviewed changes

lm_eval/base.py Outdated Show resolved Hide resolved

leogao2 reviewed Feb 13, 2022

View reviewed changes

lm_eval/models/__init__.py Outdated Show resolved Hide resolved

leogao2 requested changes Feb 13, 2022

View reviewed changes

jon-tow requested changes Feb 13, 2022

View reviewed changes

jon-tow reviewed Feb 13, 2022

View reviewed changes

Merge branch 'master' of https://github.com/EleutherAI/lm-evaluation-…

8a17335

…harness into xnli

yongzx requested a review from StellaAthena as a code owner May 27, 2022 15:17

jon-tow added 3 commits May 27, 2022 11:30

Format with black

bc371e6

Remove non-existent LMs

767fb96

Remove custom formatting

7e89586

malteos mentioned this pull request Aug 15, 2022

Evaluation for XNLI dataset OpenGPTX/lm-evaluation-harness#18

Closed

yongzx closed this by deleting the head repository Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add xnli task #258

Add xnli task #258

yongzx commented Jan 22, 2022 •

edited

Loading

jon-tow left a comment

jon-tow commented Jan 22, 2022

codecov bot commented Jan 22, 2022 •

edited

Loading

leogao2 commented Feb 13, 2022

jon-tow commented Feb 13, 2022

leogao2 Feb 13, 2022

leogao2 commented Feb 13, 2022

leogao2 left a comment

jon-tow left a comment •

edited

Loading

jon-tow Feb 13, 2022

jon-tow Feb 13, 2022

yongzx Feb 13, 2022

jon-tow Feb 13, 2022

yongzx Feb 13, 2022

Add xnli task #258

Add xnli task #258

Conversation

yongzx commented Jan 22, 2022 • edited Loading

jon-tow left a comment

Choose a reason for hiding this comment

jon-tow commented Jan 22, 2022

codecov bot commented Jan 22, 2022 • edited Loading

Codecov Report

leogao2 commented Feb 13, 2022

jon-tow commented Feb 13, 2022

leogao2 Feb 13, 2022

Choose a reason for hiding this comment

leogao2 commented Feb 13, 2022

leogao2 left a comment

Choose a reason for hiding this comment

jon-tow left a comment • edited Loading

Choose a reason for hiding this comment

jon-tow Feb 13, 2022

Choose a reason for hiding this comment

jon-tow Feb 13, 2022

Choose a reason for hiding this comment

yongzx Feb 13, 2022

Choose a reason for hiding this comment

jon-tow Feb 13, 2022

Choose a reason for hiding this comment

yongzx Feb 13, 2022

Choose a reason for hiding this comment

yongzx commented Jan 22, 2022 •

edited

Loading

codecov bot commented Jan 22, 2022 •

edited

Loading

jon-tow left a comment •

edited

Loading