Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Embed modes and named vectors #123

Merged
merged 30 commits into from
Jul 9, 2024
Merged

Conversation

pwalski
Copy link
Contributor

@pwalski pwalski commented Jul 3, 2024

Added named vector support to qdrant. A pipeline can now have its embed mode configured, either per field, chunk and metadata combined (default) or both. Vectors need to be configured on the qdrant client side.

See examples/store_multiple_vectors.rs for an example.

Shoutout to @pwalski for the contribution. Closes #62.

@timonv
Copy link
Member

timonv commented Jul 4, 2024

Look good already, nice! Maybe there's a way to clean up the qdrant configuration. If you always need an EmbeddableType and sometimes a Vector config, you could also have the function signature like:

    pub fn with_vector(
        mut self,
        vector: impl Into<VectorConfig>,
    ) 

And then implent Into<VectorConfig> for embedabletype.

One requirement I do have is that if needed, qdrant can be utilized fully. What happens if a vector config is provided, but a custom client is build?

What I really like about your implementation is that it also sets swiftide up to infer the schema from the pipeline at a later stage.

@pwalski pwalski force-pushed the feat/embed_modes branch from b1ce1a2 to 2b54129 Compare July 7, 2024 11:02
@pwalski
Copy link
Contributor Author

pwalski commented Jul 7, 2024

One requirement I do have is that if needed, qdrant can be utilized fully. What happens if a vector config is provided, but a custom client is build?

I do not understand. Qdrant client implementation from this library limits user's ability to configure Vectors by providing only few config options in VectorConfigBuilder (name, size, distance). New Qdrant client implementation would need to come with new implementation of configuring Vectors.
The same goes with other Qdrant config options. QdrantBuilder provides only a subset of qdrant_client::Qdrant config options.

@timonv
Copy link
Member

timonv commented Jul 7, 2024

One requirement I do have is that if needed, qdrant can be utilized fully. What happens if a vector config is provided, but a custom client is build?

I do not understand. Qdrant client implementation from this library limits user's ability to configure Vectors by providing only few config options in VectorConfigBuilder (name, size, distance). New Qdrant client implementation would need to come with new implementation of configuring Vectors. The same goes with other Qdrant config options. QdrantBuilder provides only a subset of qdrant_client::Qdrant config options.

Exactly, I don't see the need to replace / reinvent full apis from other libraries that already work. Most (if not all) integrations support setting whatever it wraps manually, allowing for more fine grained configuration. So we provide some basic stuff for sensible defaults, otherwise it's RTM on the integrations' documentation.

I thought the current setup might go wrong when overriding the client and accidentally also providing vector configuration.

@pwalski pwalski changed the title Named vectors support feat: Embed modes and named vectors Jul 7, 2024
@pwalski pwalski force-pushed the feat/embed_modes branch from 88f184b to eecd5c3 Compare July 7, 2024 22:23
@pwalski pwalski force-pushed the feat/embed_modes branch from eecd5c3 to 3cf13a1 Compare July 7, 2024 22:30
@pwalski pwalski marked this pull request as ready for review July 7, 2024 22:31
@pwalski pwalski marked this pull request as draft July 8, 2024 08:49
@pwalski pwalski force-pushed the feat/embed_modes branch from cc3f765 to a56a6ef Compare July 8, 2024 12:11
@pwalski pwalski marked this pull request as ready for review July 8, 2024 12:24
@pwalski
Copy link
Contributor Author

pwalski commented Jul 8, 2024

@timonv I cannot assign a reviewer.

Copy link
Member

@timonv timonv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid work! I have a couple of comments and some suggestions.

I think EmbeddedType could be more apptly named as EmbeddedField or EmbedField, as that seems to be more what it is.

Maybe there's also a way to clean up the api for .with_vector, .with_vector(EmbeddableType::Metadata(metadata_summary::NAME.into())), do you have any ideas?

On a more practical note, this looks very close to merging and I would love to get it in for 0.6! Do you have any social handles I can mention on release? 🎉

@pwalski pwalski requested a review from timonv July 8, 2024 23:16
@pwalski
Copy link
Contributor Author

pwalski commented Jul 8, 2024

Do you have any social handles I can mention on release?

I do not have any actively updated profiles.

Maybe there's also a way to clean up the api for .with_vector, .with_vector(EmbeddableType::Metadata(metadata_summary::NAME.into())), do you have any ideas?

These NAME fields got created because I had no idea how to configure Metadata and share info about metadata/embedding name with vector config.
The fact user needs to specify same Metadata once as a Transformer and once as a VectorConfig name is already a bit awkward.

The quoted line could be cleaned up a bit by providing following impl for EmbeddedField:

impl From<&str> for EmbeddedField {
    fn from(value: &str) -> Self {
        Self::Metadata(value.into())
    }
}

The problem is I am not sure how long such From impl would hold. It could become a problem even during implementation of Sparse vectors #63.
That task will introduce another usecase for named vectors so it could be a good opportunity to make some updates to vectors config.

Making EmbeddedField generic over some T: Into<String> + Display would cause propagation of generic type across codebase.
Making EmbeddedField::Metadata accept &str would require usage of a lifetime, which creates same problem as generic.

@timonv
Copy link
Member

timonv commented Jul 9, 2024

Do you have any social handles I can mention on release?

I do not have any actively updated profiles.

Maybe there's also a way to clean up the api for .with_vector, .with_vector(EmbeddableType::Metadata(metadata_summary::NAME.into())), do you have any ideas?

These NAME fields got created because I had no idea how to configure Metadata and share info about metadata/embedding name with vector config. The fact user needs to specify same Metadata once as a Transformer and once as a VectorConfig name is already a bit awkward.

The quoted line could be cleaned up a bit by providing following impl for EmbeddedField:

impl From<&str> for EmbeddedField {
    fn from(value: &str) -> Self {
        Self::Metadata(value.into())
    }
}

The problem is I am not sure how long such From impl would hold. It could become a problem even during implementation of Sparse vectors #63. That task will introduce another usecase for named vectors so it could be a good opportunity to make some updates to vectors config.

Making EmbeddedField generic over some T: Into<String> + Display would cause propagation of generic type across codebase. Making EmbeddedField::Metadata accept &str would require usage of a lifetime, which creates same problem as generic.

Fully agree, I don't see a direct option here either. I have hunch that when we get further down the line, a nicer model will pop up. I.e. a pipeline could infer a schema.

Feel it's ready to merge? 0.6 will probably be released this week. Need to update and rewrite a bunch of documentation on swiftide.rs for it, it was setup a bit in a hurry.

@pwalski
Copy link
Contributor Author

pwalski commented Jul 9, 2024

Feel it's ready to merge?

Yes.
There are issues like the fact with_embed_mode() needs to be placed before then(impl Transformer), and multiple then_store_with() do not work, but it is not related directly to this PR.

@timonv
Copy link
Member

timonv commented Jul 9, 2024 via email

@timonv timonv merged commit 699cfe4 into bosun-ai:master Jul 9, 2024
14 checks passed
@github-actions github-actions bot mentioned this pull request Jul 9, 2024
@pwalski pwalski deleted the feat/embed_modes branch July 9, 2024 19:11
@pwalski
Copy link
Contributor Author

pwalski commented Jul 10, 2024

That’s a bug, could you open an issue?

The issue I remembered was really only a problem with ignoring some of the EmbeddedFields.
I implemented named vectors with an assumption that we want to store in Qdrant every created embedding, but internal code did not really blocked me from implementing handling skipping some of embeddings in case they were not configured in Qdrant as vectors.
I reported it as a bug and I have opened a PR
#138

@github-actions github-actions bot mentioned this pull request Jul 12, 2024
timonv added a commit that referenced this pull request Jul 12, 2024
## 🤖 New release
* `swiftide`: 0.5.0 -> 0.6.0

<details><summary><i><b>Changelog</b></i></summary><p>

<blockquote>

## [0.6.0](https://github.com/bosun-ai/swiftide/releases/tag/0.6.0) -
2024-07-12

### Features

-
[70ea268](70ea268)
*(prompts)* Add prompts as first class citizens
([#145](#145))

-
[699cfe4](699cfe4)
*(uncategorized)* Embed modes and named vectors
([#123](#123))

### Bug Fixes

-
[9334934](9334934)
*(chunkcode)* Use correct chunksizes
([#122](#122))

-
[7357fea](7357fea)
*(deps)* Update rust crate spider to v1.98.6
([#119](#119))

-
[353cd9e](353cd9e)
*(qdrant)* Upgrade and better defaults
([#118](#118))

-
[b53636c](b53636c)
*(uncategorized)* Inability to store only some of `EmbeddedField`s
([#139](#139))

### Documentation

-
[5691ac9](5691ac9)
*(readme)* Add preproduction warning

-
[4c40e27](4c40e27)
*(readme)* Add back coverage badge

-
[3e447fe](3e447fe)
*(readme)* Link to CONTRIBUTING
([#114](#114))

-
[37af322](37af322)
*(rustdocs)* Rewrite the initial landing page
([#149](#149))

-
[7686c2d](7686c2d)
*(uncategorized)* Templated prompts are now a major feature

### Performance

-
[ea8f823](ea8f823)
*(uncategorized)* Improve local build performance and crate cleanup
([#148](#148))

### Miscellaneous Tasks

-
[364e13d](364e13d)
*(swiftide)* Loosen up dependencies
([#140](#140))

-
[d2a9ea1](d2a9ea1)
*(uncategorized)* Enable clippy pedantic
([#132](#132))

-
[51c114c](51c114c)
*(uncategorized)* Various tooling & community improvements
([#131](#131))

-
[84dd65d](84dd65d)
*(uncategorized)* Rename all mentions of ingest to index
([#130](#130))


**Full Changelog**:
0.1.0...0.6.0


<!-- generated by git-cliff -->
</blockquote>


</p></details>

---
This PR was generated with
[release-plz](https://github.com/MarcoIeni/release-plz/).

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Timon Vonk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Named / multiple vector support for Qdrant
2 participants