Skip to content

Latest commit

 

History

History
44 lines (35 loc) · 2.6 KB

File metadata and controls

44 lines (35 loc) · 2.6 KB

Mixture-of-Experts-Sentence-Similarity

This repository serves as the official code base for the paper Contrastive Learning and Mixture of Experts Enables Precise Vector Embeddings

Logan Hallee, Rohan Kapur, Arjun Patel, Jason P. Gleghorn, and Bohdan Khomtchouk

Preprint: Contrastive Learning and Mixture of Experts Enables Precise Vector Embeddings

Peer review: preparing submission

Data and models

Huggingface

Main findings

  • Extending BERT models with N experts copied from their MLP section is highly effective for fine-tuning on downstream tasks, including multitask or multidomain data.
  • N experts are almost as effective as N individual models trained on N domains for sentence similarity tasks, we have tried up to five.
  • Enforced routing of experts can be handled with added special tokens for sentence-wise routing or designed token type IDs for token-wise routing, even when the router is a single linear layer. Enforced routing can also be accomplished by passing a list of desired indices. Mutual information based loss with top-k outputs also works well to correlate expert activation with specific types of data.
  • Cocitation networks are highly effective for gathering similar niche papers.
  • Using dot product with a learned temperature may be a more effective contrastive loss than standard Multiple Negatives Ranking loss.

Surprising findings

  • Small BERT models are not more effective with N experts, perhaps due to small shared attention layers. Our data supports that this threshold may be roughly 100 million parameters.
  • MoE extension with a SINGLE MoE layer gets 85% of the full MoE extension on average.
  • Token-wise routing is better than sentence-wise routing in general, even for sentence-wise tasks.

Applications of this work

  • Better vector databases / retrieval augmentation
  • Extending any sufficiently large BERT model with N experts for N tasks.
  • Vocabulary extension of BERT models with N experts for N vocabularies.

Coming soon

Please cite

@article{hallee2024contrastive,
      title={Contrastive Learning and Mixture of Experts Enables Precise Vector Embeddings}, 
      author={Logan Hallee and Rohan Kapur and Arjun Patel and Jason P. Gleghorn and Bohdan Khomtchouk},
      year={2024},
      eprint={2401.15713},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

and upvote on Huggingface!