-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tasks for German Embedding Evaluation #214
Add tasks for German Embedding Evaluation #214
Conversation
Co-authored-by: Saba Sturua <[email protected]>
Fixes mismatch between description and HuggingFace dataset
94e8aec
to
d92ba05
Compare
21371f1
to
375cfd2
Compare
We also prepared a cross lingual English/German retrieval tasks: WikiCLIR, however, it is pretty large (>200k queries and more than 1M documents). So the evaluation is not really feasible. We reduced the amount of queries but it still takes a lot of time and the results might not be that comparable because of the reduced query set. Therefore, I excluded it from this PR, but if you think it is useful, we can also add it: |
Seems also be related to #183 |
Do you already have some results on what embedding model works best for german atm? |
I have shared some results here: |
@Muennighoff is it possible to assign someone to review this PR? |
Taking a look! Also invited you to the org so you can request reviews in the future 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM amazing work!
"category": "s2s", | ||
"type": "PairClassification", | ||
"eval_splits": ["test"], | ||
"eval_langs": ["de"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add its other langs?
"eval_langs": ["de"], | |
"eval_langs": ["de", "en", "es", "fr", "ja", "ko", "zh"], |
Co-authored-by: Niklas Muennighoff <[email protected]>
…a-ai/mteb-de into feat-contribute-german-tasks
Let's merge? |
It did't work out with just adding more languages for PawsX so I had to adjust the code in AbsTaskPairClassification to support multilingual datasets. If it looks ok to you, we can merge it I evaluated some multilingual e5 models on the PawsX task:
|
Yes i think it's good, the same modif is also in this PR: #174 One thing that may be confusing is that there's now PawsX as PairCLF while there's also PAWSX for STS (Chinese only). I think PawsX makes more sense for PairCLF, so I suggest evaluation to mostly focus on the PairCLF setting. |
Added 6x2 points for guenthermi for datasets and 1 point to Muennighoff for review I have not accounted for bonus points as I am not sure was what available at the time.
* docs: Added missing points for #214 Added 6x2 points for guenthermi for datasets and 1 point to Muennighoff for review I have not accounted for bonus points as I am not sure was what available at the time. * docs: added point for #197 Added 2 points for rasdani and 2 bonus points for the first german retrieval (I believe). Added one point for each of the reviewers * docs: added points for #116 This includes 6 points for 3 datasets to slvnwhrl +2 for first german clustering task also added points for reviews * Added points for #134 cmteb This includes 29 datasets (38 points) and 6x2 bonus points (12 points) for the 6 taskXlanguage which was not previously included. All the points are attributed to @staoxiao, though we can split them if needed. We also added points for review. * docs: Added points for #137 polish This includes points for 12 datasets (24) across 4 tasks (8). These points are given to rafalposwiata and then one point for review * docs: Added points for #27 (spanish) These include 9 datasets (18 points) across 4 news tasks (8) for spanish. Points are given to violenil as the contributor, and one points for reviewers. Points can be split up if needed. * docs: Added points for #224 Added points 2 points for the dataset. I could imagine that I might have missed some bonus points as well. Also added one point for review. * docs: Added points for #210 (korean) This include 3 datasets (6 points) across 1 new task (+2 bonus) for korean. Also added 1 points for reviewers. * Add contributor --------- Co-authored-by: Niklas Muennighoff <[email protected]>
We added a few tasks to evaluate German embedding models:
PawsX (https://huggingface.co/datasets/paws-x): This dataset contains pairs of an original text and a paraphrased version of it. The labels determine whether it is a real paraphrase or the content has a different semantic. We add this as a pair classification task to MTEB
MIRACL (https://huggingface.co/datasets/miracl/miracl): This is originally retrieval dataset. It contains a corpus of passages from Wikipedia articles and human-annotated relevance judgements for some of them. We only took passages where relevance judgements are available since those are suitable for an MTEB re-ranking task.
GerDaLIR (https://github.com/lavis-nlp/GerDaLIR): GerDaLIR is a legal information retrieval dataset created from the Open Legal Data platform with the same name.
GermanDPR (https://www.deepset.ai/germanquad): GermanDPR is a passage retrieval tasks for question answering, i.e. it contains questions associated to passages which contain the relevant answer.
XMarket (https://xmrec.github.io/): This is originally a ecommerce product dataset which contains titles and descriptions of products which are associated to very fine-granular categories. We created a category to product description retrieval task out of it.
GermanSTSBenchmarkSTS (https://github.com/t-systems-on-site-services-gmbh/german-STSbenchmark): This dataset is an STS dataset created by T-Systems. It contains translations of English STS datasets.
I also updated the script in
scripts/run_mteb_german.py
to include those additional tasks, as well as the exiting German tasks for clustering and STS