-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers #3164
Conversation
Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers
86b32a1
to
b72788d
Compare
Co-authored-by: Agnieszka Marzec <[email protected]>
177cc24
to
82af1be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…riever training with sentence-transformers (#3164) * Add option to use MultipleNegativesRankingLoss Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers * Move out losses into separate retriever/_losses.py module * Remove unused import in retriever/_losses.py * Apply documentation suggestions from code review Co-authored-by: Agnieszka Marzec <[email protected]> Co-authored-by: Agnieszka Marzec <[email protected]>
Hey @bglearning, I just realized that |
Hi @vblagoje, The default change on this EmbeddingRetriever is primarily motivated to support the more straightforward use-case of directly training the retriever without the pseudo-labeling (through GPL) or some other intermediate process to come up with the scores. So line of reasoning:
But ya, realized I missed updating the GPL tutorial to explicitly set the loss to We could update the GPL tutorial or update the default. Either way seems okay to me. |
Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers.
Related Issues
Proposed Changes:
Expose an option
train_loss
onEmbeddingRetriever.train
. It is propagated to_SentenceTransformersEmbeddingEncoder.train
and determines the loss that's used. Also influences the training data format that can be used/is expected.So the API is:
How did you test it?
A Colab notebook with a sample run. Qualitative comparison against the baseline model.
Notes for the reviewer
Some possible aspects up for discussion
sentence-transformers
losses (might even be able to say can use any loss in there as long as the data format is correct but would be a big jump). Plus might tie us to keep doing so (which may or may not be a problem). With the string we have the flexibility to later internally swap things out.sentence-transformers
losses. But probably not needed at this stage.EmbeddingRetriever
docstring, which I went with.Checklist