Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: smoother transition between EmbeddingsBuilder build() and vector store interface #61

Closed
marieaurore123 opened this issue Oct 16, 2024 · 1 comment
Assignees
Milestone

Comments

@marieaurore123
Copy link
Contributor

marieaurore123 commented Oct 16, 2024

Feature Request

There should be a more "out of the box" transition between the embeddings generated from the EmbeddingsBuilder's build() method and adding these to a vector store.

Motivation

Improve the developer experience with working with embeddings and vector stores. Currently, it looks like this:

let embeddings = EmbeddingsBuilder::new(model.clone())
        .documents(vec![
            FakeDefinition {
                id: "doc0".to_string(),
                word: "flurbo".to_string(),
                definitions: vec![
                    "A green alien that lives on cold planets.".to_string(),
                ]
            }
        ])?
        .build()
        .await?;

  let index = InMemoryVectorStore::default()
      .add_documents(
          embeddings
              .into_iter()
              .map(|(fake_definition, embedding_vec)| {
                  (fake_definition.id.clone(), fake_definition, embedding_vec)
              })
              .collect(),
      )?
      .index(model);

As you can see, the user needs to do more manipulation on the embeddings before adding them to the in memory vector store.

This applies less to mongodb and lancedb vector stores because we do not have full control over what goes in the vector store but maybe something can be done.

Proposal

  • In memory store - going from embeddings builder build() to add_documents should be out of the box.
  • LanceDb / MongoDb - maybe provide a default mapping between embeddings builder build() and the type that the store expects (ie. Document for mongodb, RecordBatch for lancedb).
@cvauclair
Copy link
Contributor

I think this can be done on a case-by-case basis for the different vector store (I don't think there is a one-size-fits-all solution here). For instance, the InMemoryVectorStore could have the following helper method:

impl<T> InMemoryVectorStore<T> {
    pub fn with_embeddings(embeddings: Vec<(T, Embedding)>) -> InMemoryVectorStore<T> {...}
}

which would allow the developer to easily populate an InMemoryVectorStore with the result of EmbeddingsBuilder::build.

But for other vector stores, this will highly depend on the complexity of T (e.g.: is it a flat struct that can easily be converted to a single table? Or is it a nested struct?), as well as how the user is integrating Rig in their wider application (e.g.: do they want the embeddings and documents to be in the same collection/table? Or do they want to separate them and link them with a foreign key?).

@mateobelanger mateobelanger added this to the v0.5 milestone Nov 12, 2024
@mateobelanger mateobelanger closed this as not planned Won't fix, can't repro, duplicate, stale Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants