This module levereges chromaDB to cache embeddings for reusability. It helps storing and fetching embeddings of chromaDB.
In a nutshell :
- it creates a vector store with the model name as a collection name
- it encodes any sentence using the specified embedding function using the
encode()
method - among the sentences provided in the
encode
method, it calls the model (or the api) to embed those for which the embedding are not already available in chromaDB
Here are the installation steps:
-
- If you haven't already, clone this repository.
-
- Activate your python environment (or shell)
-
- when your are in this repository, run
pip install .
to install this repo as a package
- when your are in this repository, run
pip install fairseq git+https://github.com/liyaodev/fairseq.git
TODO : Add instruction for installation as pypi package
from chromacache import ChromaCache
from chromacache.embedding_functions import OpenAIEmbeddingFunction
MODEL_NAME = "text-embedding-3-small" # or any embedding model name
emb_function = OpenAIEmbeddingFunction() # or any embedding function available
cc = ChromaCache(OpenAIEmbeddingFunction(MODEL_NAME)) # creates a collection in chroma
embeddings = cc.encode(["my sentence", "my other sentence"])
The ChromaCache
supports extra arguments :
- batch_size: int = 32, the batch size at which sentences are processed. If the model's provider API raises an error due to the size of the request being exceeded, it might be a good idea to decrease this
- save_embbedings: bool = True, whether or not the embeddings should be saved
- path_to_chromadb: str = "./Chroma", where the chromadb should be stored
All embedding functions also support the max_token_length
argument. This can be used to crop each sentence to the max token size supported by the model's provider API
Example usage :
emb_function = MistralAIEmbeddingFunction("mistral-embed", max_token_length=4000)
cc = ChromaCache(
emb_func,
batch_size=4,
save_embedding=False,
path_to_chromdb="./my_favorite_directory"
)
Moreover, all capabilities of the chromaDB collections can be leveraged directly using the collection
attribute of the ChromaCache.
For example, to query the collection for the 5 documents:
cc = ChromaCache(VoyageAIEmbeddingFunction("voyage-code-2"))
relevant_documents = cc.collection.query(
query_texts=["my query1", "thus spake zarathustra", ...],
n_results=5,
)