Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tasks for German Embedding Evaluation #214

Merged

Conversation

guenthermi
Copy link
Member

@guenthermi guenthermi commented Jan 24, 2024

We added a few tasks to evaluate German embedding models:

PawsX (https://huggingface.co/datasets/paws-x): This dataset contains pairs of an original text and a paraphrased version of it. The labels determine whether it is a real paraphrase or the content has a different semantic. We add this as a pair classification task to MTEB

MIRACL (https://huggingface.co/datasets/miracl/miracl): This is originally retrieval dataset. It contains a corpus of passages from Wikipedia articles and human-annotated relevance judgements for some of them. We only took passages where relevance judgements are available since those are suitable for an MTEB re-ranking task.

GerDaLIR (https://github.com/lavis-nlp/GerDaLIR): GerDaLIR is a legal information retrieval dataset created from the Open Legal Data platform with the same name.

GermanDPR (https://www.deepset.ai/germanquad): GermanDPR is a passage retrieval tasks for question answering, i.e. it contains questions associated to passages which contain the relevant answer.

XMarket (https://xmrec.github.io/): This is originally a ecommerce product dataset which contains titles and descriptions of products which are associated to very fine-granular categories. We created a category to product description retrieval task out of it.

GermanSTSBenchmarkSTS (https://github.com/t-systems-on-site-services-gmbh/german-STSbenchmark): This dataset is an STS dataset created by T-Systems. It contains translations of English STS datasets.


I also updated the script in scripts/run_mteb_german.py to include those additional tasks, as well as the exiting German tasks for clustering and STS

@guenthermi guenthermi force-pushed the feat-contribute-german-tasks branch from 94e8aec to d92ba05 Compare January 24, 2024 09:55
@guenthermi guenthermi force-pushed the feat-contribute-german-tasks branch from 21371f1 to 375cfd2 Compare January 24, 2024 14:52
@guenthermi
Copy link
Member Author

We also prepared a cross lingual English/German retrieval tasks: WikiCLIR, however, it is pretty large (>200k queries and more than 1M documents). So the evaluation is not really feasible. We reduced the amount of queries but it still takes a lot of time and the results might not be that comparable because of the reduced query set. Therefore, I excluded it from this PR, but if you think it is useful, we can also add it:
https://github.com/jina-ai/mteb-de/blob/main/mteb/tasks/Retrieval/WikiCLIRRetrieval.py

@guenthermi guenthermi marked this pull request as ready for review January 24, 2024 16:30
@guenthermi
Copy link
Member Author

Seems also be related to #183

@yvesloy
Copy link

yvesloy commented Jan 25, 2024

Do you already have some results on what embedding model works best for german atm?

@guenthermi
Copy link
Member Author

Do you already have some results on what embedding model works best for german atm?

I have shared some results here:
https://twitter.com/michael_g_u/status/1747293709849227618

@guenthermi
Copy link
Member Author

@Muennighoff is it possible to assign someone to review this PR?

@Muennighoff Muennighoff self-requested a review January 29, 2024 08:34
@Muennighoff
Copy link
Contributor

guenthermi

Taking a look! Also invited you to the org so you can request reviews in the future 👍
Amazing work on this!

Copy link
Contributor

@Muennighoff Muennighoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM amazing work!

"category": "s2s",
"type": "PairClassification",
"eval_splits": ["test"],
"eval_langs": ["de"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add its other langs?

Suggested change
"eval_langs": ["de"],
"eval_langs": ["de", "en", "es", "fr", "ja", "ko", "zh"],

mteb/tasks/Reranking/__init__.py Outdated Show resolved Hide resolved
mteb/tasks/STS/__init__.py Outdated Show resolved Hide resolved
mteb/tasks/Reranking/MIRACLReranking.py Outdated Show resolved Hide resolved
@Muennighoff
Copy link
Contributor

Let's merge?

@guenthermi
Copy link
Member Author

Let's merge?

It did't work out with just adding more languages for PawsX so I had to adjust the code in AbsTaskPairClassification to support multilingual datasets. If it looks ok to you, we can merge it

I evaluated some multilingual e5 models on the PawsX task:

de en es fr ja ko zh
e5-multilingual-small 0.538849 0.548835 0.537453 0.556763 0.493585 0.512998 0.54783
e5-multilingual-base 0.544398 0.540388 0.550975 0.569298 0.497352 0.511751 0.556259
e5-multilingual-large 0.575009 0.595169 0.563678 0.585035 0.49574 0.516579 0.57002

@Muennighoff
Copy link
Contributor

Yes i think it's good, the same modif is also in this PR: #174

One thing that may be confusing is that there's now PawsX as PairCLF while there's also PAWSX for STS (Chinese only). I think PawsX makes more sense for PairCLF, so I suggest evaluation to mostly focus on the PairCLF setting.
I think we need to leave the Chinese STS one in for now as it's part of the Chinese leaderboard, but we can phase it out at some point.

@guenthermi guenthermi merged commit 9aba9ee into embeddings-benchmark:main Jan 29, 2024
3 checks passed
KennethEnevoldsen added a commit that referenced this pull request Apr 11, 2024
Added 6x2 points for guenthermi for datasets and 1 point to  Muennighoff for review

I have not accounted for bonus points as I am not sure was what available at the time.
KennethEnevoldsen added a commit that referenced this pull request Apr 11, 2024
* docs: Added missing points for #214

Added 6x2 points for guenthermi for datasets and 1 point to  Muennighoff for review

I have not accounted for bonus points as I am not sure was what available at the time.

* docs: added point for #197

Added 2 points for rasdani and 2 bonus points for the first german retrieval (I believe). Added one point for each of the reviewers

* docs: added points for #116

This includes 6 points for 3 datasets to slvnwhrl +2 for first german clustering task also added points for reviews

* Added points for #134 cmteb

This includes 29 datasets (38 points) and 6x2 bonus points (12 points) for the 6 taskXlanguage which was not previously included.

All the points are attributed to @staoxiao, though we can split them if needed.

We also added points for review.

* docs: Added points for #137 polish

This includes points for 12 datasets (24) across 4 tasks (8). These points are given to rafalposwiata and then one point for review

* docs: Added points for #27 (spanish)

These include 9 datasets (18 points) across 4 news tasks (8) for spanish.

Points are given to violenil as the contributor, and one points for reviewers. Points can be split up if needed.

* docs: Added points for #224

Added points 2 points for the dataset. I could imagine that I might have missed some bonus points as well. Also added one point for review.

* docs: Added points for #210 (korean)

This include 3 datasets (6 points) across 1 new task (+2 bonus) for korean. Also added 1 points for reviewers.

* Add contributor

---------

Co-authored-by: Niklas Muennighoff <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants