support for Matryoshka embedding models #932

mattf · 2025-02-03T16:22:27Z

🚀 Describe the new functionality needed

add a dimensions parameter to /v1/inference/embeddings.

💡 Why is this needed? What if we don't build it?

Matryoshka embedding models allow the user to control the number of embedding dimensions. for instance, an embedding model that can produce 8192 dimensions may be able to constrain its output to 1024 or 512 dimensions.

this reduction in dimension impacts storage costs, e.g. storing 512 dims instead of 8192 is a 16x reduction in storage cost.

dimension reduction often impacts application accuracy.

allowing applications to find their own storage v accuracy trade-off is an important feature.

Other thoughts

No response

ashwinb · 2025-02-06T20:41:45Z

This also makes sense. I will update the API to account for this parameter.

) # What does this PR do? add /v1/inference/embeddings implementation to NVIDIA provider **open topics** - - *asymmetric models*. NeMo Retriever includes asymmetric models, which are models that embed differently depending on if the input is destined for storage or lookup against storage. the /v1/inference/embeddings api does not allow the user to indicate the type of embedding to perform. see #934 - *truncation*. embedding models typically have a limited context window, e.g. 1024 tokens is common though newer models have 8k windows. when the input is larger than this window the endpoint cannot perform its designed function. two options: 0. return an error so the user can reduce the input size and retry; 1. perform truncation for the user and proceed (common strategies are left or right truncation). many users encounter context window size limits and will struggle to write reliable programs. this struggle is especially acute without access to the model's tokenizer. the /v1/inference/embeddings api does not allow the user to delegate truncation policy. see #933 - *dimensions*. "Matryoshka" embedding models are available. they allow users to control the number of embedding dimensions the model produces. this is a critical feature for managing storage constraints. embeddings of 1024 dimensions what achieve 95% recall for an application may not be worth the storage cost if a 512 dimensions can achieve 93% recall. controlling embedding dimensions allows applications to determine their recall and storage tradeoffs. the /v1/inference/embeddings api does not allow the user to control the output dimensions. see #932 ## Test Plan - `llama stack run llama_stack/templates/nvidia/run.yaml` - `LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/inference/test_embedding.py --embedding-model baai/bge-m3` ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [x] Wrote necessary unit or integration tests. --------- Co-authored-by: Ashwin Bharambe <[email protected]>

We need to support: - asymmetric embedding models (#934) - truncation policies (#933) - varying dimensional output (#932) ## Test Plan ```bash $ cd llama_stack/providers/tests/inference $ pytest -s -v -k fireworks test_embeddings.py \ --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784 $ pytest -s -v -k together test_embeddings.py \ --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784 $ pytest -s -v -k ollama test_embeddings.py \ --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784 ```

mattf mentioned this issue Feb 3, 2025

feat(providers): add NVIDIA Inference embedding provider and tests #935

Merged

5 tasks

ashwinb self-assigned this Feb 6, 2025

hardikjshah added this to the v0.1.4 milestone Feb 12, 2025

ashwinb linked a pull request Feb 21, 2025 that will close this issue

feat(api): Add options for supporting various embedding models #1192

Merged

ashwinb mentioned this issue Feb 21, 2025

feat(api): Add options for supporting various embedding models #1192

Merged

ashwinb closed this as completed in #1192 Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for Matryoshka embedding models #932

support for Matryoshka embedding models #932

mattf commented Feb 3, 2025

ashwinb commented Feb 6, 2025

support for Matryoshka embedding models #932

support for Matryoshka embedding models #932

Comments

mattf commented Feb 3, 2025

🚀 Describe the new functionality needed

💡 Why is this needed? What if we don't build it?

Other thoughts

ashwinb commented Feb 6, 2025