Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers #3164

Merged
merged 4 commits into from
Sep 12, 2022

Conversation

bglearning
Copy link
Contributor

@bglearning bglearning commented Sep 5, 2022

Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers.

Related Issues

Proposed Changes:

Expose an option train_loss on EmbeddingRetriever.train. It is propagated to _SentenceTransformersEmbeddingEncoder.train and determines the loss that's used. Also influences the training data format that can be used/is expected.

So the API is:

embedding_retriever = EmbeddingRetriever(...)
training_data = [{"question": ..., "pos_doc": ...}, ...]
# Loss can be specified when calling `train`. Options currently: 'mnrl' or 'margin_mse'
embedding_retriever.train(training_data=training_data, train_loss='mnrl') 

How did you test it?

A Colab notebook with a sample run. Qualitative comparison against the baseline model.

Notes for the reviewer

Some possible aspects up for discussion

  • I exposed the loss option as a string.
    • Another option could be to directly accept sentence-transformers losses (might even be able to say can use any loss in there as long as the data format is correct but would be a big jump). Plus might tie us to keep doing so (which may or may not be a problem). With the string we have the flexibility to later internally swap things out.
    • We could also construct our own enums or probably better yet loss classes which are thin wrappers around the sentence-transformers losses. But probably not needed at this stage.
  • Slightly tricky to provide users with list of possible losses. I guess best place is in the EmbeddingRetriever docstring, which I went with.
  • Any unit or integration tests needed? Any other experiments needed?

Checklist

Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever
training with sentence-transformers
@bglearning bglearning force-pushed the 3136-multiple-negatives-ranking-loss branch from 86b32a1 to b72788d Compare September 6, 2022 12:52
@bglearning bglearning marked this pull request as ready for review September 6, 2022 17:52
@bglearning bglearning requested review from a team as code owners September 6, 2022 17:52
@bglearning bglearning requested review from bogdankostic and vblagoje and removed request for a team September 6, 2022 17:52
haystack/nodes/retriever/_embedding_encoder.py Outdated Show resolved Hide resolved
haystack/nodes/retriever/dense.py Outdated Show resolved Hide resolved
@bglearning bglearning force-pushed the 3136-multiple-negatives-ranking-loss branch from 177cc24 to 82af1be Compare September 7, 2022 12:05
Copy link
Contributor

@bogdankostic bogdankostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bglearning bglearning merged commit 21aedc6 into main Sep 12, 2022
@bglearning bglearning deleted the 3136-multiple-negatives-ranking-loss branch September 12, 2022 07:38
brandenchan pushed a commit that referenced this pull request Sep 21, 2022
…riever training with sentence-transformers (#3164)

* Add option to use MultipleNegativesRankingLoss

Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever
training with sentence-transformers

* Move out losses into separate retriever/_losses.py module

* Remove unused import in retriever/_losses.py

* Apply documentation suggestions from code review

Co-authored-by: Agnieszka Marzec <[email protected]>

Co-authored-by: Agnieszka Marzec <[email protected]>
@vblagoje
Copy link
Member

Hey @bglearning, I just realized that mnrl is now the default loss for GPL rather than margin_mse! Why make a change if the original paper used margin_mse?

@bglearning
Copy link
Contributor Author

bglearning commented Oct 28, 2022

Hi @vblagoje,

The default change on this EmbeddingRetriever is primarily motivated to support the more straightforward use-case of directly training the retriever without the pseudo-labeling (through GPL) or some other intermediate process to come up with the scores.

So line of reasoning:

  • User has some data (just query and pos-doc) and they want to start training directly (default case). mnrl (the "simpler" loss, margin_mse needs neg-doc and scores)
  • User also wants to do GPL (or some other method) to come up with the pseudo-labels before training the model. User sets the loss to margin_mse.

But ya, realized I missed updating the GPL tutorial to explicitly set the loss to margin_mse which is definitely an issue.
And also generally should have highlighted the change in default as a big change as it's an API change.

We could update the GPL tutorial or update the default. Either way seems okay to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers
4 participants