Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Update results #51

Merged
merged 4 commits into from
Oct 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -171,4 +171,5 @@ model2vec_models
results/*
counts/*
results_old/*
local/*
lightning_logs/*
44 changes: 25 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,9 +96,10 @@ Model2vec creates a small, fast, and powerful model that outperforms other stati

It works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using zipf weighting. During inference, we simply take the mean of all token embeddings occurring in a sentence.

Model2vec has 2 modes:
- **Output**: behaves much a like a real sentence transformer, i.e., it uses a subword tokenizer and encodes all wordpieces. This is really quick to create, very small (30 MB), but might be less performant on some tasks.
- **Vocab**: behaves much like GloVe or regular word2vec vectors, albeit with much better performance. These models are a bit bigger, depending on your vocabulary size, but still very fast, and are useful in situations in which you have a bit more RAM, but still need to go fast.
Model2vec has 3 modes:
- **Output**: behaves much like a real sentence transformer, i.e., it uses a subword tokenizer and simply encodes all wordpieces in its vocab. This is really quick to create (30 seconds on a CPU), very small (30 MB in float32), but might be less performant on some tasks.
- **Vocab (word level)**: creates a word-level tokenizer and only encodes words that are in the vocabulary. This is a bit slower to create and creates a larger model, but might be more performant on some tasks. Note that this model can go out-of-vocabulary, which might be beneficial if your domain is very noisy
- **Vocab (subword)**: a combination of the two methods above. In this mode, you can pass your own vocabulary, but it also uses the subword vocabulary to create representations for words not in the passed vocabulary.

## Main Features

Expand Down Expand Up @@ -224,20 +225,24 @@ print(make_leaderboard(task_scores))
|------------------------|-------------|-----------------------------------------------------------------------|----------------|-----------------------|--------------|
| [M2V_base_glove](https://huggingface.co/minishlab/M2V_base_glove) | English | Flagship embedding model based on GloVe vocab. | GloVe | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 102M |
| [M2V_base_output](https://huggingface.co/minishlab/M2V_base_output) | English | Flagship embedding model based on bge-base-en-v1.5 vocab. Uses a subword tokenizer. | Output | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 7.5M |
| [M2V_base_glove_subword](https://huggingface.co/minishlab/M2V_base_glove_subword) | English | Flagship embedding model based on bge-base-en-v1.5 + glove vocab. Uses a subword tokenizer. | Output + GloVe | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 103M |
| [M2V_multilingual_output](https://huggingface.co/minishlab/M2V_multilingual_output) | Multilingual | Flagship multilingual embedding model based on LaBSE vocab. Uses a subword tokenizer. | Output | [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) | 471M |
## Results

### Main Results

Model2Vec is evaluated on MTEB, as well as two additional tasks: [PEARL](https://github.com/tigerchen52/PEARL) (a phrase representation task) and WordSim (a collection of _word_ similarity tasks). The results are shown in the table below.

| Model | Avg (All) | Avg (MTEB) | Class | Clust | PairClass | Rank | Ret | STS | Sum | PEARL | WordSim |
|------------------|-------------|------------|-------|-------|-----------|-------|-------|-------|-------|-------|---------|
| all-MiniLM-L6-v2 | 56.08 | 56.09 | 62.62 | 41.94 | 82.37 | 58.04 | 41.95 | 78.90 | 30.81 | 60.83 | 49.91 |
| M2V_base_glove | 48.58 | 47.60 | 61.35 | 30.52 | 75.34 | 48.50 | 29.26 | 70.31 | 31.50 | 50.28 | 54.29 |
| M2V_base_output | 46.79 | 45.34 | 61.25 | 25.58 | 74.90 | 47.63 | 26.14 | 68.58 | 29.20 | 54.02 | 49.18 |
| GloVe_300d | 42.84 | 42.36 | 57.31 | 27.66 | 72.48 | 43.30 | 22.78 | 61.90 | 28.81 | 45.65 | 43.05 |
| WL256* | 48.88 | 49.36 | 58.98 | 33.34 | 74.00 | 52.03 | 33.12 | 73.34 | 29.05 | 48.81 | 45.16 |

| Model | Avg (All) | Avg (MTEB) | Class | Clust | PairClass | Rank | Ret | STS | Sum | Pearl | WordSim |
|------------------------|-----------|------------|--------|--------|-----------|--------|--------|--------|--------|--------|---------|
| all-MiniLM-L6-v2 | 56.08 | 56.09 | 62.62 | 41.94 | 82.37 | 58.04 | 41.95 | 78.90 | 30.81 | 60.83 | 49.91 |
| M2V_base_glove_subword | 49.06 | 46.69 | 61.27 | 30.03 | 74.71 | 49.15 | 27.16 | 69.09 | 30.08 | 56.82 | 57.99 |
| M2V_base_glove | 48.58 | 47.6 | 61.35 | 30.52 | 75.34 | 48.5 | 29.26 | 70.31 | 31.5 | 50.28 | 54.29 |
| M2V_base_output | 46.79 | 45.34 | 61.25 | 25.58 | 74.9 | 47.63 | 26.14 | 68.58 | 29.2 | 54.02 | 49.18 |
| GloVe_300d | 42.84 | 42.36 | 57.31 | 27.66 | 72.48 | 43.3 | 22.78 | 61.9 | 28.81 | 45.65 | 43.05 |
| WL256* | 48.88 | 49.36 | 58.98 | 33.34 | 74.00 | 52.03 | 33.12 | 73.34 | 29.05 | 48.81 | 45.16 |


<details>
<summary> Task Abbreviations </summary>
Expand All @@ -259,21 +264,22 @@ For readability, the MTEB task names are abbreviated as follows:

In addition to the MTEB evaluation, we evaluate Model2Vec on a number of classification datasets. These are used as additional evidence to avoid overfitting to the MTEB dataset and to benchmark the speed of the model. The results are shown in the table below.

| model | Average | sst2 | imdb | trec | ag_news |
|:-----------------|----------:|---------:|-------:|---------:|----------:|
| bge-base-en-v1.5 | 90.00 | 91.54 | 91.88 | 85.16 | 91.45 |
| all-MiniLM-L6-v2 | 84.10 | 83.95 | 81.36 | 81.31 | 89.77 |
| M2V_base_output | 82.23 | 80.92 | 84.56 | 75.27 | 88.17 |
| M2V_base_glove | 80.76 | 83.07 | 85.24 | 66.12 | 88.61 |
| WL256 | 78.48 | 76.88 | 80.12 | 69.23 | 87.68 |
| GloVe_300d | 77.77 | 81.68 | 84.00 | 55.67 | 89.71 |
| Model | Average | SST2 | IMDB | TREC | AG News |
|:-----------------------|:-------:|:------:|:-----:|:------:|:-------:|
| bge-base-en-v1.5 | 90.00 | 91.54 | 91.88 | 85.16 | 91.45 |
| all-MiniLM-L6-v2 | 84.10 | 83.95 | 81.36 | 81.31 | 89.77 |
| M2V_base_output | 82.23 | 80.92 | 84.56 | 75.27 | 88.17 |
| M2V_base_glove_subword | 81.95 | 82.84 | 85.96 | 70.51 | 88.49 |
| M2V_base_glove | 80.76 | 83.07 | 85.24 | 66.12 | 88.61 |
| WL256 | 78.48 | 76.88 | 80.12 | 69.23 | 87.68 |
| GloVe_300d | 77.77 | 81.68 | 84.00 | 55.67 | 89.71 |

As can be seen, Model2Vec models outperform the GloVe and WL256 models on all classification tasks, and are competitive with the all-MiniLM-L6-v2 model, while being much faster.

The figure below shows the relationship between the number of sentences per second and the average classification score. The circle sizes correspond to the number of parameters in the models (larger = more parameters).
This plot shows that the Model2Vec models are much faster than the other models, while still being competitive in terms of classification performance with the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model.

| ![Description](assets/images/speed_vs_accuracy.png) |
| ![Description](assets/images/speed_vs_accuracy_v2.png) |
|:--:|
|*Figure: The average accuracy over all classification datasets plotted against sentence per second. The circle size indicates model size.*|

Expand Down
Binary file added assets/images/speed_vs_accuracy_v2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.