setfit usage #83

davidberenstein1957 · 2024-10-14T18:04:15Z

davidberenstein1957
Oct 14, 2024

I played around a bit but sadly the performance is too poor.

import sklearn
from datasets import load_dataset
from setfit import SetFitModel, Trainer, TrainingArguments, sample_dataset
from sklearn.linear_model import LogisticRegression
from torch import device
from transformers import AutoModel, AutoTokenizer

from model2vec.distill import distill

# Load a dataset from the Hugging Face Hub
dataset = load_dataset("sst2")

# Simulate the few-shot regime by sampling 8 examples per class
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=8)
eval_dataset = dataset["validation"].select(range(100))
test_dataset = dataset["validation"].select(range(100, len(dataset["validation"])))

model = SetFitModel.from_pretrained("TaylorAI/bge-micro-v2", labels=["negative", "positive"])

args = TrainingArguments(
    batch_size=16, num_epochs=1, eval_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    metric="accuracy",
    column_mapping={"sentence": "text", "label": "label"},  # Map dataset columns to text/label expected by trainer
)
trainer.train()

# distill the model to a smaller model
path_to_disk = "tester"
model.model_body.save_pretrained(path_to_disk)
m2v_model = distill(model_name=path_to_disk, pca_dims=256 * 2)

# Evaluate without m2v.encode
X_original = model.model_body.encode(train_dataset["sentence"])
model_head_original = sklearn.linear_model.LogisticRegression()
model_head_original.fit(X_original, train_dataset["label"])
model.model_head = model_head_original

metrics_original = trainer.evaluate(test_dataset)
print("Metrics without m2v.encode:", metrics_original)

# Evaluate with m2v.encode
model.model_body.encode = m2v_model.encode
X_m2v = model.model_body.encode(train_dataset["sentence"])
model_head_m2v = sklearn.linear_model.LogisticRegression()
model_head_m2v.fit(X_m2v, train_dataset["label"])
model.model_head = model_head_m2v

metrics_m2v = trainer.evaluate(test_dataset)
print("Metrics with m2v.encode:", metrics_m2v)

Pringled · 2024-10-14T18:33:05Z

Pringled
Oct 14, 2024
Maintainer

Hi @davidberenstein1957!

This is something we also tried previously, and we got to the same conclusion. SetFit works much better with very little data since it essentially uses the strength of the fine-tuned (Transformer) model. With Model2Vec, you don't have that luxury anymore since the model is fully static.

However, the m2v encoder should be much faster, which is the main draw. I tested it with your code example by benchmarking the time for encoding the entire training set (timing embedding = model.model_body.encode(dataset["train"]["sentence"]) before and after replacing the encode with model2vec, which gives 23.9 seconds for the SetFit model vs. 0.53 seconds for the Model2Vec model, so roughly a 45x increase in speed.

In this case, you essentially trade off ~14% accuracy (68% to 54%) for a 45x speedup. It differs a bit per usecase and task what the tradeoff is, but we believe that the speedup is substantial enough to enable usecases that are not possible with existing models, while still maintaining decent performance.

0 replies

davidberenstein1957 · 2024-10-14T19:35:47Z

davidberenstein1957
Oct 14, 2024
Author

Yes I agree it could be worth it. Just to note I initially fine tune the model, then convert it to a static model and on top of the static model I then fine tune the classification head, which i would have expected to work slightly slightly better for a 2 label use case (did not look into class in balance).On 14 Oct 2024, at 20:33, Thomas van Dongen ***@***.***> wrote: Hi @davidberenstein1957! This is something we also tried previously, and we got to the same conclusion. SetFit works much better with very little data since it essentially uses the strength of the fine-tuned (Transformer) model. With Model2Vec, you don't have that luxury anymore since the model is fully static. However, the m2v encoder should be much faster, which is the main draw. I tested it with your code example by benchmarking the time for encoding the entire training set (timing embedding = model.model_body.encode(dataset["train"]["sentence"]) before and after replacing the encode with model2vec, which gives 23.9 seconds for the SetFit model vs. 0.53 seconds for the Model2Vec model, so roughly a 45x increase in speed. In this case, you essentially trade off ~14% accuracy (68% to 54%) for a 45x speedup. It differs a bit per usecase and task what the tradeoff is, but we believe that the speedup is substantial enough to enable usecases that are not possible with existing models, while still maintaining decent performance. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

stephantul Oct 14, 2024
Maintainer

Hi @davidberenstein1957, nice to see you on here, thanks for checking out Model2Vec and taking the time to open a discussion, really appreciated.

This is something we've also been struggling with. We expected the fine-tuning phase of Setfit to also work for model2vec models and/or the distillation of Setfit fine-tuned models to work for model2vec models. Both of these do not, for the reasons @Pringled mentioned above. So, our current thinking is that Setfit and model2vec are actually largely complementary.

Setfit is something you can use when you have very little data to train on, and you therefore want to spend a little bit more money on inference. Model2vec, on the other hand, while offering decent performance out of the box, needs a lot of data to fine-tune. But once you have that data, it works really well. Our view is that you could use Setfit for an initial setup phase of a project. Train it on very little data, and then use it to inference thousands or tens of thousands of documents. And then, when you have a much larger set of (weakly) annotated documents, you distill the Setfit model using model2vec, and then train a logistic regression head, or fine-tune the entire model2vec model.

Fine-tuning an entire model2vec model is not supported yet, but is coming very soon. Basically, we show that, with enough data, a fine-tuned model2vec model can reach performance that is a little bit worse than a Setfit classifier, but still more than an order of magnitude faster.

Again, thanks for checking it out. Let us know if you have other ideas or use-cases, we're always eager to learn more about what people want to use model2vec for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

setfit usage #83

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

setfit usage #83

davidberenstein1957 Oct 14, 2024

Replies: 2 comments · 1 reply

Pringled Oct 14, 2024 Maintainer

davidberenstein1957 Oct 14, 2024 Author

stephantul Oct 14, 2024 Maintainer

davidberenstein1957
Oct 14, 2024

Replies: 2 comments 1 reply

Pringled
Oct 14, 2024
Maintainer

davidberenstein1957
Oct 14, 2024
Author

stephantul Oct 14, 2024
Maintainer