Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: postgresql store #157

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

milancermak
Copy link

Resolves #5

I've modeled the crate on the SQLite one.

On the DB side, the crate is relying on the pgvector extension. It creates an HSNW index and uses the cosine distane operator to compare embeddings.

Please let me know it this looks ok or if there are any parts of the code that should be changed.

@0xMochan
Copy link
Contributor

0xMochan commented Dec 18, 2024

Thanks for your contribution, nearly done with my review but I'll go ahead and approve and run the workflow. I think you might need to cargo fmt and pass cargo clippy as well.

@b0xtch
Copy link

b0xtch commented Dec 20, 2024

Hey great work, any update on this?

@0xMochan
Copy link
Contributor

Hey great work, any update on this?

I'm still in my review, I got wrapped up in some other responsibilities, will get it out ASAP!

Copy link
Contributor

@0xMochan 0xMochan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay on this review! Mostly looking good, a couple of changes I've requested here.

One note: I noticed that 'static was used a good bit here which was likely inherited from the sqlite implementation. I think this is due to how the connection requires an closure with an async block which makes borrowing and using such things really odd.

Recently, proper async closures which will get into Rust stable sometime in 2025 which should help us alleviate our usage of 'static here (i think!)

rig-postgres/examples/vector_search_postgres.rs Outdated Show resolved Hide resolved
rig-postgres/src/lib.rs Show resolved Hide resolved
rig-postgres/src/lib.rs Show resolved Hide resolved
rig-postgres/src/lib.rs Show resolved Hide resolved
rig-postgres/README.md Outdated Show resolved Hide resolved
rig-postgres/README.md Show resolved Hide resolved
rig-postgres/src/lib.rs Show resolved Hide resolved
@milancermak
Copy link
Author

Thanks for the review. I'm still in holiday mode until New Year's, will address the comments on Jan 2 or 3.

@milancermak
Copy link
Author

Hey @0xMochan ready for another round. I've addressed the comments you brought up, optimistically resolved the ones I think are done, but feel free to reopen if you think otherwise.

Copy link
Contributor

@0xMochan 0xMochan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I've tested locally and it works fine for me. Thanks for this PR, will be included in the next release!

CI / CD will be fixed soon, I'll rerun and then we'll merge to main afterwards.


## PostgreSQL usage

The crate utilizes the [pgvector](https://github.com/pgvector/pgvector) extension, which is available for PostgreSQL version 13 and later. Use any of the [official](https://www.postgresql.org/download/) or alternative methods to install psql. The `pgvector` extension will be automatically installed by the crate if it's not present yet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing, I had to go an add the pgvector extension myself. Not sure if I did something wrong, I had installed postgres via brew

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, curious. You mean by brew installing it separately or something like that?

In any case, I guess the confusing term here in the README is "installed" - what I meant is that the crate will run CREATE EXTENSION which is the installation of the extension into the database, but of course there's also the installation of the extension library itself into the computer.

@milancermak
Copy link
Author

Rebased with upstream main, hopefully the CI will go green now 🤞

@caenguidanos
Copy link

Hi,

I would like to be able to use Rig in the future when the integration with Postgres is implemented but there are several points to comment:

  • You're using tokio-postgres crate for handling postgres connection but the current wide used driver in the ecosystem is sqlx that provides many tools like connection builder, connection pooler, query builder, derive macros and so on.
  • pgvector crate has an integration with sqlx that provides you a custom type to represent the vector inside Rust and you will not need the serde_json crazy stuff.
  • The impl of Column::new is redundant. Not custom logic inside the impl. It's better to create the exposed struct Column thoroughly.
  • The impl of PostgresVectorStore<E, t> has a redundant async block inside an async fn. It's better to re-think the error mapping strategy.
  • There is a better way to do this: let id: String = row.get(0); // assuming id is first column. This assumptions breaks programs.
  • In a production environment, the CREATE EXTENSION IF NOT EXISTS vector; must be delegated to the user.
  • In a production environment, the creation of the table or required column types must also delegated to the user (and documented). If not, this will make collisions with existing SQL schema migration flows. Also, it's better to use a tested query builder like sqlx for query building.
  • The HNSW is the recommended general usage index type from pgvector, yes, but is not the unique type of index that the user can select. This cann't be hardcoded inside Rig's impl. The same with the similarity search operator variant: there are more than cosine distance (dot inner product is very recommenden also for RAG usages).
  • As the pgvector Postgres extension says, the index has parameters that the user can use and tune. Here are not represented.
  • Also, try to avoid dynamic dispatching. It's more simple than that.

In short, most of this implementation should be delegated to the user. Let the user choose which existing column the embedding should be saved on, nothing more.

I understand that to make it 'easy' for new users it is done this way, but in real scenarios I would not like to have so little control and assumptions over this library implementation.

@milancermak
Copy link
Author

@caenguidanos All fair points.

As mentioned at the top, I've modeled this according to the sqlite crate. I'm not claiming it's the right abstraction (and you have good counterpoints why it's not), but it does help to get one started, as you indicated.

For someone like you who clearly knows what they're doing, this won't be the right abstraction to use. In fact, I don't think any will and it's better to plug in postgres to Rig in a way you see fit.

@cvauclair
Copy link
Contributor

cvauclair commented Jan 13, 2025

Hey @milancermak @caenguidanos thank you both for your contributions and comments!

This PR has been sitting here for a while and Postgres integration is something we want sooner than later, yet I agree with a lot of what @caenguidanos said. How about this: we keep using tokio-postgres for this integration crate, but we can create a new integration crate rig-sqlx which uses sqlx under the hood.

The thing with integration crates is that we not only want to target the most DBs possible, but also the most clients possible. Having both a rig-postgres (using tokio-postgres) and rig-sqlx (using sqlx) gives users who are using either libs an easy way to integrate with Rig!

That being said, @milancermak if you could address @caenguidanos's comments above (apologies if that has already been done, I have not looked at your most recent changes). Once that's done we can do a final review and merge!

Cheers!

@milancermak
Copy link
Author

@cvauclair There's no easy way to address the comment from @caenguidanos in this PR, it would take a whole different approach and impl. That will have to be done by someone else. I initially built this for myself, but in a somewhat generic way similar to sqlite (as mentioned above). I've since reevaluated my approach and I'm doing stuff differently now and don't have the time to contribute. Feel free to reject this PR and do the integration as you see fit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Add support for PostgreSQL vector store
5 participants