Add tasks for German Embedding Evaluation #214

guenthermi · 2024-01-24T09:23:16Z

We added a few tasks to evaluate German embedding models:

PawsX (https://huggingface.co/datasets/paws-x): This dataset contains pairs of an original text and a paraphrased version of it. The labels determine whether it is a real paraphrase or the content has a different semantic. We add this as a pair classification task to MTEB

MIRACL (https://huggingface.co/datasets/miracl/miracl): This is originally retrieval dataset. It contains a corpus of passages from Wikipedia articles and human-annotated relevance judgements for some of them. We only took passages where relevance judgements are available since those are suitable for an MTEB re-ranking task.

GerDaLIR (https://github.com/lavis-nlp/GerDaLIR): GerDaLIR is a legal information retrieval dataset created from the Open Legal Data platform with the same name.

GermanDPR (https://www.deepset.ai/germanquad): GermanDPR is a passage retrieval tasks for question answering, i.e. it contains questions associated to passages which contain the relevant answer.

XMarket (https://xmrec.github.io/): This is originally a ecommerce product dataset which contains titles and descriptions of products which are associated to very fine-granular categories. We created a category to product description retrieval task out of it.

GermanSTSBenchmarkSTS (https://github.com/t-systems-on-site-services-gmbh/german-STSbenchmark): This dataset is an STS dataset created by T-Systems. It contains translations of English STS datasets.

I also updated the script in scripts/run_mteb_german.py to include those additional tasks, as well as the exiting German tasks for clustering and STS

Co-authored-by: Saba Sturua <[email protected]>

Fixes mismatch between description and HuggingFace dataset

guenthermi · 2024-01-24T15:28:03Z

We also prepared a cross lingual English/German retrieval tasks: WikiCLIR, however, it is pretty large (>200k queries and more than 1M documents). So the evaluation is not really feasible. We reduced the amount of queries but it still takes a lot of time and the results might not be that comparable because of the reduced query set. Therefore, I excluded it from this PR, but if you think it is useful, we can also add it:
https://github.com/jina-ai/mteb-de/blob/main/mteb/tasks/Retrieval/WikiCLIRRetrieval.py

guenthermi · 2024-01-25T12:38:55Z

Seems also be related to #183

yvesloy · 2024-01-25T20:27:32Z

Do you already have some results on what embedding model works best for german atm?

guenthermi · 2024-01-25T22:10:20Z

Do you already have some results on what embedding model works best for german atm?

I have shared some results here:
https://twitter.com/michael_g_u/status/1747293709849227618

guenthermi · 2024-01-29T08:33:59Z

@Muennighoff is it possible to assign someone to review this PR?

Muennighoff · 2024-01-29T08:37:12Z

guenthermi

Taking a look! Also invited you to the org so you can request reviews in the future 👍
Amazing work on this!

Muennighoff

LGTM amazing work!

Muennighoff · 2024-01-29T08:57:53Z

mteb/tasks/PairClassification/PawsX.py

+            "category": "s2s",
+            "type": "PairClassification",
+            "eval_splits": ["test"],
+            "eval_langs": ["de"],


Can we also add its other langs?

Suggested change

"eval_langs": ["de"],

"eval_langs": ["de", "en", "es", "fr", "ja", "ko", "zh"],

mteb/tasks/Reranking/__init__.py

mteb/tasks/STS/__init__.py

mteb/tasks/Reranking/MIRACLReranking.py

Co-authored-by: Niklas Muennighoff <[email protected]>

…a-ai/mteb-de into feat-contribute-german-tasks

Muennighoff · 2024-01-29T11:42:57Z

Let's merge?

guenthermi · 2024-01-29T12:16:41Z

Let's merge?

It did't work out with just adding more languages for PawsX so I had to adjust the code in AbsTaskPairClassification to support multilingual datasets. If it looks ok to you, we can merge it

I evaluated some multilingual e5 models on the PawsX task:

	de	en	es	fr	ja	ko	zh
e5-multilingual-small	0.538849	0.548835	0.537453	0.556763	0.493585	0.512998	0.54783
e5-multilingual-base	0.544398	0.540388	0.550975	0.569298	0.497352	0.511751	0.556259
e5-multilingual-large	0.575009	0.595169	0.563678	0.585035	0.49574	0.516579	0.57002

Muennighoff · 2024-01-29T13:25:17Z

Yes i think it's good, the same modif is also in this PR: #174

One thing that may be confusing is that there's now PawsX as PairCLF while there's also PAWSX for STS (Chinese only). I think PawsX makes more sense for PairCLF, so I suggest evaluation to mostly focus on the PairCLF setting.
I think we need to leave the Chinese STS one in for now as it's part of the Chinese leaderboard, but we can phase it out at some point.

Added 6x2 points for guenthermi for datasets and 1 point to Muennighoff for review I have not accounted for bonus points as I am not sure was what available at the time.

@staoxiao

* docs: Added missing points for #214 Added 6x2 points for guenthermi for datasets and 1 point to Muennighoff for review I have not accounted for bonus points as I am not sure was what available at the time. * docs: added point for #197 Added 2 points for rasdani and 2 bonus points for the first german retrieval (I believe). Added one point for each of the reviewers * docs: added points for #116 This includes 6 points for 3 datasets to slvnwhrl +2 for first german clustering task also added points for reviews * Added points for #134 cmteb This includes 29 datasets (38 points) and 6x2 bonus points (12 points) for the 6 taskXlanguage which was not previously included. All the points are attributed to @staoxiao, though we can split them if needed. We also added points for review. * docs: Added points for #137 polish This includes points for 12 datasets (24) across 4 tasks (8). These points are given to rafalposwiata and then one point for review * docs: Added points for #27 (spanish) These include 9 datasets (18 points) across 4 news tasks (8) for spanish. Points are given to violenil as the contributor, and one points for reviewers. Points can be split up if needed. * docs: Added points for #224 Added points 2 points for the dataset. I could imagine that I might have missed some bonus points as well. Also added one point for review. * docs: Added points for #210 (korean) This include 3 datasets (6 points) across 1 new task (+2 bonus) for korean. Also added 1 points for reviewers. * Add contributor --------- Co-authored-by: Niklas Muennighoff <[email protected]>

guenthermi and others added 23 commits January 24, 2024 10:45

chore: solve merge conflict

27ca589

fix: gerdalir dataset

53d76bd

fix: lang from en to de

f5c11e0

chore: solve merge conflict

98b8555

chore: add ir datasets to requirements

f443e6b

refactor: limit queries to 10k

0705bc3

refactor: update description of task with limit

5c05850

revert style changes

30b756c

feat: add german stsbenchmarksts task

a549e87

feat: update revision id

fba6e21

refactor: update revision id after changes in scores

4dee6e2

add XMarket dataset

3e16abb

add xmarket to init file

5b816eb

feat: add revision id

3de4477

add paws x dataset

1a345e0

Add ir_datasets as dependency

1506755

add GermanDPR dataset

fe2c286

fix loading

53e03e0

Update mteb/tasks/Retrieval/GermanDPRRetrieval.py

05cb3cb

Co-authored-by: Saba Sturua <[email protected]>

feat: add miracl reranking task for german

f3f7e06

refactor: cleanup task

ae6903e

prevent duplicate pos docs

73c0e16

fix: use test split in MIRACL (#13)

d92ba05

Fixes mismatch between description and HuggingFace dataset

guenthermi force-pushed the feat-contribute-german-tasks branch from 94e8aec to d92ba05 Compare January 24, 2024 09:55

guenthermi added 4 commits January 24, 2024 10:59

refactor: remove WikiCLIR

43f93e9

fix: double import; xmarket name

16c933b

add German tasks to run_mteb_german script

31ae09d

fupdate revisions and style

375cfd2

guenthermi force-pushed the feat-contribute-german-tasks branch from 21371f1 to 375cfd2 Compare January 24, 2024 14:52

guenthermi added 2 commits January 24, 2024 17:26

update MIRACL to work with latest version

6c06b20

revert adding ir_datasets

50b87ef

guenthermi marked this pull request as ready for review January 24, 2024 16:30

Muennighoff self-requested a review January 29, 2024 08:34

Muennighoff approved these changes Jan 29, 2024

View reviewed changes

guenthermi and others added 6 commits January 29, 2024 12:02

support multilingual pair classification

ab23dd1

remove print statement

0487cc2

Apply suggestions from code review

3028491

Co-authored-by: Niklas Muennighoff <[email protected]>

fix monolingual pair classification

18bf3ee

Merge branch 'feat-contribute-german-tasks' of https://github.com/jin…

ec8dde9

…a-ai/mteb-de into feat-contribute-german-tasks

remove lang for monolingual tasks

c1bb54c

guenthermi merged commit 9aba9ee into embeddings-benchmark:main Jan 29, 2024
3 checks passed

Muennighoff mentioned this pull request Apr 4, 2024

Adding French team contribution points #302

Merged

KennethEnevoldsen added a commit that referenced this pull request Apr 11, 2024

docs: Added missing points for #214

a249fff

Added 6x2 points for guenthermi for datasets and 1 point to Muennighoff for review I have not accounted for bonus points as I am not sure was what available at the time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tasks for German Embedding Evaluation #214

Add tasks for German Embedding Evaluation #214

guenthermi commented Jan 24, 2024 •

edited

Loading

guenthermi commented Jan 24, 2024

guenthermi commented Jan 25, 2024

yvesloy commented Jan 25, 2024

guenthermi commented Jan 25, 2024

guenthermi commented Jan 29, 2024

Muennighoff commented Jan 29, 2024

Muennighoff left a comment

Muennighoff Jan 29, 2024

Muennighoff commented Jan 29, 2024

guenthermi commented Jan 29, 2024

Muennighoff commented Jan 29, 2024

	"eval_langs": ["de"],
	"eval_langs": ["de", "en", "es", "fr", "ja", "ko", "zh"],

Add tasks for German Embedding Evaluation #214

Add tasks for German Embedding Evaluation #214

Conversation

guenthermi commented Jan 24, 2024 • edited Loading

guenthermi commented Jan 24, 2024

guenthermi commented Jan 25, 2024

yvesloy commented Jan 25, 2024

guenthermi commented Jan 25, 2024

guenthermi commented Jan 29, 2024

Muennighoff commented Jan 29, 2024

Muennighoff left a comment

Choose a reason for hiding this comment

Muennighoff Jan 29, 2024

Choose a reason for hiding this comment

Muennighoff commented Jan 29, 2024

guenthermi commented Jan 29, 2024

Muennighoff commented Jan 29, 2024

guenthermi commented Jan 24, 2024 •

edited

Loading