Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(query): Add custom SQL query generation for pgvector search #478

Merged
merged 6 commits into from
Dec 30, 2024

Conversation

shamb0
Copy link
Contributor

@shamb0 shamb0 commented Dec 9, 2024

Hi @timonv,

I’d like to align quickly on this.

This PR addresses our recent discussion, particularly the idea:

"Or we could go lazy (not so lazy, it's not trivial) and add a custom strategy that takes an SQLx query, or a closure that returns an SQLx query with the nodes yielded. That would enable a lot of doors in fully controlling the search process."

As part of this PR, I’ve implemented a Dynamic Search Strategy to enable dynamic SQL query generation for pgvector search.

Key benefits:

  • Combines vector similarity with metadata filtering
  • Allows users to fully customize query construction
  • Provides precise control over filtering and ordering

Looking forward to your feedback on how this can be refined further!

@timonv
Copy link
Member

timonv commented Dec 9, 2024

Hey @shamb0, the concern I have with this, is that it only works with Sql and I suspect it's an uphill battle to implement every conceivable query method for every supported provider. I was thinking of a broader solution such that Swiftide makes several big wins:

  1. More complex solutions might require custom queries
  2. Can't provide defaults for every use case
  3. Still allowing for the underlying library to be used is, I believe, a very good practice.

It also removes the need / pressure to implement every search strategy for every storage, as there's always the manual option. Might be a bit harder, but huge relieve when it works.

For instance, say we have a CustomSearchStrategy<Q>, where Q is the internal query type of the storage provider. Then only that has to be implemented, everything is possible, and we can add it to providers on our own pace.

For instance, for what we have now, Q could be:

sqlx -> QueryBuilder https://docs.rs/sqlx/latest/sqlx/struct.QueryBuilder.html
qdrant -> SearchPointsBuilder https://docs.rs/sqlx/latest/sqlx/struct.QueryBuilder.html
lancedb -> Query https://docs.rs/sqlx/latest/sqlx/struct.QueryBuilder.html

In pseudo code, something like this might work:

struct CustomQuery<Q> {
  query: Q
}

impl<Q: Send + Sync + Clone> CustomQuery<Q> {
    pub fn from_query(
        query: impl Fn(&CustomQuery<Q>, &Query<states::Pending>) -> Q + Send + Sync + 'static,
    ) -> Self {
        CustomQuery {
            query: Arc::new(query),
        }
    }

    pub fn build_query(&self, query_node: &Query<states::Pending>) -> Q {
        (*self.query)(self, query_node)
    }
}

  

impl Retrieve<CustomQuery<QueryPoints>> for PgVector {
    async fn retrieve(
        &self,
        search_strategy: &CustomQuery<sqlx::QueryBuilder>,
        query: Query<states::Pending>,
    ) -> Result<Query<states::Retrieved>> {
       ...
       let builder = search_strategy.build_query(&query);

       ... // anything else and execute
    }
}

// A user can then use it as follows:

let strat = CustomQuery::from_query(|strat, query_node| sqls::QueryBuilder::new ... })

This way, there's no uphill battle in trying to implement every conceivable query method, and users can do whatever they want. Other query strategies are there for sane and quick defaults. We could also let the provider (i.e. PGVector) provide a build_custom_query to make it a bit easier. What do you think?

@timonv
Copy link
Member

timonv commented Dec 10, 2024

Just to be clear, I'd be happy to merge an implementation that only supports SqlxBuilder at the moment, then I or someone else can expand on it in the future 👍

@shamb0
Copy link
Contributor Author

shamb0 commented Dec 11, 2024

Just to be clear, I'd be happy to merge an implementation that only supports SqlxBuilder at the moment, then I or someone else can expand on it in the future 👍

Hi @timonv,

Apologies for the delay :), and thank you for sharing the detailed design guidelines! I'll aim to complete the SqlxBuilder implementation this week.

Next week, I'll start working on the search strategy for the remaining storage. 👍🏾

@shamb0
Copy link
Contributor Author

shamb0 commented Dec 14, 2024

Hi @timonv,

I’ve implemented the custom search strategy for sqlx::QueryBuilder based on our recent discussion. The key updates include:

  • Dynamic SQL query generation for vector similarity search
  • Integration with pgvector
  • Comprehensive integration test coverage

Please review and let me know if further adjustments are needed.

Thank you! :)

Copy link
Member

@timonv timonv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really nice! Why did you opt for wrapping the CustomQuery with a concrete type? If it's possible to use the CustomQuery instead (let's call it CustomStrategy, to avoid confusion with the Query pipeline), then it removes one layer of abstraction.

@@ -0,0 +1,49 @@
//! Custom search strategy implementation for `PostgreSQL` vector similarity search.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have two files with the same name, one with a typo

@shamb0
Copy link
Contributor Author

shamb0 commented Dec 17, 2024

Hi @timonv,

Thank you for your review comments :). I have reworked the code and cleaned it up as suggested.

  • Initially, I wrapped CustomQuery<Q> inside a new type PgVecCustomStrategy as I thought it would make the code more concise and Rust-idiomatic. However, as part of this patch, I have removed that additional layer.

    pub struct PgVecCustomStrategy(pub CustomQuery<sqlx::QueryBuilder<'static, sqlx::Postgres>>);  
  • I believe the PR is now ready for intake. Please let me know if any further enhancements are needed. :)

@shamb0 shamb0 force-pushed the feat/dynamic-search-strategy branch from 3b396fb to 39156c6 Compare December 17, 2024 02:20
Copy link
Member

@timonv timonv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, looks really good. Two more small comments, otherwise this looks very useful and looks ready to merge. Nice work (again) 🚀

/// `CustomQuery` provides a flexible way to generate provider-specific search queries.
///
/// # Type Parameters
/// * `Q` - The provider-specific query type (e.g., `sqlx::QueryBuilder` for `PostgreSQL`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct name here is retriever instead of provider

Self {
query: None,
top_k: super::DEFAULT_TOP_K,
vector_field: EmbeddedField::Combined,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like top_k and vector_field aren't used in the retrieval implementation. Since it builds from a closure anyway, and any data can be moved there, how about simplifying by removing the fields, having only query and _marker here?

@@ -38,6 +42,7 @@ struct VectorSearchResult {
/// - Performs a similarity-based vector search on the database and validates the retrieved results.
///
/// Ensures correctness of end-to-end data flow, including table management, vector storage, and query execution.
#[ignore]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two tests here are ignored

shamb0 and others added 6 commits December 20, 2024 21:05
Introduce a flexible, type-safe strategy for custom SQL query generation in pgvector-based similarity search.

Key features:
- Type-safe closures for custom SQL queries
- Dynamic row retrieval with user-defined conditions
- Configurable vector fields and metadata filters
- Enhanced integration with pgvector's similarity search

This update empowers users to combine vector similarity with metadata, customize query construction, and control filtering and ordering.

Component: search_strategies/dynamic_vector_search.rs
Signed-off-by: shamb0 <[email protected]>
* `swiftide`: 0.14.3 -> 0.14.4 (✓ API compatible changes)
* `swiftide-agents`: 0.14.3 -> 0.14.4
* `swiftide-core`: 0.14.3 -> 0.14.4 (✓ API compatible changes)
* `swiftide-macros`: 0.14.3 -> 0.14.4
* `swiftide-integrations`: 0.14.3 -> 0.14.4 (✓ API compatible changes)
* `swiftide-indexing`: 0.14.3 -> 0.14.4 (✓ API compatible changes)
* `swiftide-query`: 0.14.3 -> 0.14.4 (✓ API compatible changes)

<details><summary><i><b>Changelog</b></i></summary><p>

<blockquote>

[0.14.4](bosun-ai/swiftide@v0.14.3...v0.14.4)
- 2024-12-11

-
[7211559](bosun-ai@7211559)
*(agents)* **EXPERIMENTAL** Agents in Swiftide (bosun-ai#463)

````text
Agents are coming to Swiftide! We are still ironing out all the kinks,
  while we make it ready for a proper release. You can already experiment
  with agents, see the rustdocs for documentation, and an example in
  `/examples`, and feel free to contact us via github or discord. Better
  documentation, examples, and tutorials are coming soon.

  Run completions in a loop, define tools with two handy macros, customize
  the agent by hooking in on lifecycle events, and much more.

  Besides documentation, expect a big release for what we build this for
  soon! 🎉
````

-
[3751f49](bosun-ai@3751f49)
*(query)* Add support for single embedding retrieval with PGVector
(bosun-ai#406)

-
[5ce4d21](bosun-ai@5ce4d21)
Clippy and deps fixes for 1.83 (bosun-ai#467)

**Full Changelog**:
bosun-ai/swiftide@0.14.3...0.14.4
</blockquote>

</p></details>

---
This PR was generated with
[release-plz](https://github.com/release-plz/release-plz/).
Implements flexible query generation for PostgreSQL vector similarity search with:
- Introduces PgVecCustomStrategy for custom vector search implementations
- Adds builder pattern support for configuring search parameters and filters
- Enables dynamic SQL generation with metadata filtering and vector similarity

This change allows users to create customized vector similarity searches while
maintaining type safety and query optimization capabilities.

Signed-off-by: shamb0 <[email protected]>
Implements flexible query generation for PostgreSQL vector similarity search with:
- Introduces PgVecCustomStrategy for custom vector search implementations
- Adds builder pattern support for configuring search parameters and filters
- Enables dynamic SQL generation with metadata filtering and vector similarity

This change allows users to create customized vector similarity searches while
maintaining type safety and query optimization capabilities.

Signed-off-by: shamb0 <[email protected]>
* `swiftide`: 0.14.3 -> 0.14.4 (✓ API compatible changes)
* `swiftide-agents`: 0.14.3 -> 0.14.4
* `swiftide-core`: 0.14.3 -> 0.14.4 (✓ API compatible changes)
* `swiftide-macros`: 0.14.3 -> 0.14.4
* `swiftide-integrations`: 0.14.3 -> 0.14.4 (✓ API compatible changes)
* `swiftide-indexing`: 0.14.3 -> 0.14.4 (✓ API compatible changes)
* `swiftide-query`: 0.14.3 -> 0.14.4 (✓ API compatible changes)

<details><summary><i><b>Changelog</b></i></summary><p>

<blockquote>

[0.14.4](bosun-ai/swiftide@v0.14.3...v0.14.4)
- 2024-12-11

-
[7211559](bosun-ai@7211559)
*(agents)* **EXPERIMENTAL** Agents in Swiftide (bosun-ai#463)

````text
Agents are coming to Swiftide! We are still ironing out all the kinks,
  while we make it ready for a proper release. You can already experiment
  with agents, see the rustdocs for documentation, and an example in
  `/examples`, and feel free to contact us via github or discord. Better
  documentation, examples, and tutorials are coming soon.

  Run completions in a loop, define tools with two handy macros, customize
  the agent by hooking in on lifecycle events, and much more.

  Besides documentation, expect a big release for what we build this for
  soon! 🎉
````

-
[3751f49](bosun-ai@3751f49)
*(query)* Add support for single embedding retrieval with PGVector
(bosun-ai#406)

-
[5ce4d21](bosun-ai@5ce4d21)
Clippy and deps fixes for 1.83 (bosun-ai#467)

**Full Changelog**:
bosun-ai/swiftide@0.14.3...0.14.4
</blockquote>

</p></details>

---
This PR was generated with
[release-plz](https://github.com/release-plz/release-plz/).
Implements flexible query generation for PostgreSQL vector similarity search with:
- Introduces PgVecCustomStrategy for custom vector search implementations
- Adds builder pattern support for configuring search parameters and filters
- Enables dynamic SQL generation with metadata filtering and vector similarity

This change allows users to create customized vector similarity searches while
maintaining type safety and query optimization capabilities.

Signed-off-by: shamb0 <[email protected]>
@shamb0 shamb0 force-pushed the feat/dynamic-search-strategy branch from 4c9b793 to 0741eae Compare December 20, 2024 15:35
@shamb0
Copy link
Contributor Author

shamb0 commented Dec 20, 2024

Thank you @timonv for your thorough and constructive feedback :). I appreciate your patience in guiding me through these improvements.

I have addressed both key points from your review:

  1. Updated the terminology from "provider" to "retriever" throughout the documentation to better reflect the architectural intent
  2. Simplified the CustomStrategy implementation by moving configuration into the query generator closure, reducing structural complexity

The documentation has been enhanced with:

  • Clearer examples demonstrating the new configuration pattern
  • Detailed explanation of the query generator's responsibilities
  • Updated implementation notes for users of the custom search strategy API

I believe these changes have made the code more maintainable and the API more intuitive to use. Please let me know if you'd like to see any further improvements or have additional suggestions.

@timonv
Copy link
Member

timonv commented Dec 23, 2024

Looks good! I want to go over the docs one more time before merging, I'll do the changes. I will merge this asap.

@timonv timonv changed the title feat(search): Add dynamic SQL query generation for pgvector search feat(query): Add custom SQL query generation for pgvector search Dec 30, 2024
@timonv timonv merged commit 584695e into bosun-ai:master Dec 30, 2024
10 checks passed
timonv pushed a commit that referenced this pull request Jan 4, 2025
## 🤖 New release
* `swiftide`: 0.15.0 -> 0.16.0 (✓ API compatible changes)
* `swiftide-agents`: 0.15.0 -> 0.16.0 (✓ API compatible changes)
* `swiftide-core`: 0.15.0 -> 0.16.0 (⚠️ API breaking changes)
* `swiftide-macros`: 0.15.0 -> 0.16.0
* `swiftide-integrations`: 0.15.0 -> 0.16.0 (✓ API compatible changes)
* `swiftide-indexing`: 0.15.0 -> 0.16.0 (✓ API compatible changes)
* `swiftide-query`: 0.15.0 -> 0.16.0 (✓ API compatible changes)

### ⚠️ `swiftide-core` breaking changes

```
--- failure enum_missing: pub enum removed or renamed ---

Description:
A publicly-visible enum cannot be imported by its prior path. A `pub use` may have been removed, or the enum itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.38.0/src/lints/enum_missing.ron

Failed in:
  enum swiftide_core::querying::states::AnsweredBuilderError, previously in file /tmp/.tmpfXNg04/swiftide-core/src/query.rs:202
  enum swiftide_core::querying::states::RetrievedBuilderError, previously in file /tmp/.tmpfXNg04/swiftide-core/src/query.rs:179

--- failure struct_missing: pub struct removed or renamed ---

Description:
A publicly-visible struct cannot be imported by its prior path. A `pub use` may have been removed, or the struct itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.38.0/src/lints/struct_missing.ron

Failed in:
  struct swiftide_core::querying::states::RetrievedBuilder, previously in file /tmp/.tmpfXNg04/swiftide-core/src/query.rs:179
  struct swiftide_core::querying::states::AnsweredBuilder, previously in file /tmp/.tmpfXNg04/swiftide-core/src/query.rs:202

--- failure trait_added_supertrait: non-sealed trait added new supertraits ---

Description:
A non-sealed trait added one or more supertraits, which breaks downstream implementations of the trait
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#generic-bounds-tighten
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.38.0/src/lints/trait_added_supertrait.ron

Failed in:
  trait swiftide_core::querying::QueryState gained Default in file /tmp/.tmp5vx3MP/swiftide/swiftide-core/src/query.rs:178

--- failure trait_no_longer_object_safe: trait no longer object safe ---

Description:
Trait is no longer object safe, which breaks `dyn Trait` usage.
        ref: https://doc.rust-lang.org/stable/reference/items/traits.html#object-safety
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.38.0/src/lints/trait_no_longer_object_safe.ron

Failed in:
  trait QueryState in file /tmp/.tmp5vx3MP/swiftide/swiftide-core/src/query.rs:178
```

<details><summary><i><b>Changelog</b></i></summary><p>

## `swiftide`
<blockquote>

##
[0.16.0](v0.15.0...v0.16.0)
- 2025-01-02

### New features

-
[52e341e](52e341e)
*(lancedb)* Public method for opening table (#514)

-
[3254bd3](3254bd3)
*(query)* Generic templates with document rendering (#520)

````text
Reworks `PromptTemplate` to a more generic `Template`, such that they
  can also be used elsewhere. This deprecates `PromptTemplate`.

  As an example, an optional `Template` in the `Simple` answer
  transformer, which can be used to customize the output of retrieved
  documents. This has excellent synergy with the metadata changes in #504.
````

-
[235780b](235780b)
*(query)* Documents as first class citizens (#504)

````text
For simple RAG, just adding the content of a retrieved document might be
  enough. However, in more complex use cases, you might want to add
  metadata as well, as is or for conditional formatting.

  For instance, when dealing with large amounts of chunked code, providing
  the path goes a long way. If generated metadata is good enough, could be
  useful as well.

  With this retrieved Documents are treated as first class citizens,
  including any metadata as well. Additionally, this also paves the way
  for multi retrieval (and multi modal).
````

-
[584695e](584695e)
*(query)* Add custom SQL query generation for pgvector search (#478)

````text
Adds support for custom retrieval queries with the sqlx query builder for PGVector. Puts down the fundamentals for custom query building for any retriever.

  ---------
````

-
[b55bf0b](b55bf0b)
*(redb)* Public database and table definition (#510)

-
[176378f](176378f)
Implement traits for all Arc dynamic dispatch (#513)

````text
If you use i.e. a `Persist` or a `NodeCache` outside swiftide as well, and you already have it Arc'ed, now it just works.
````

-
[dc9881e](dc9881e)
Allow opt out of pipeline debug truncation

### Bug fixes

-
[2831101](2831101)
*(lancedb)* Metadata should be nullable in lancedb (#515)

-
[c35df55](c35df55)
*(macros)* Explicit box dyn cast fixing Rust Analyzer troubles (#523)

### Miscellaneous

-
[1bbbb0e](1bbbb0e)
Clippy


**Full Changelog**:
0.15.0...0.16.0
</blockquote>


</p></details>

---
This PR was generated with
[release-plz](https://github.com/release-plz/release-plz/).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants