feat(new sink): new `postgres` sink #22481

Ichmed · 2025-02-20T13:18:32Z

Summary

A zero-copy postgres sink that requires no new dependencies (it adds one feature on tokio-postgres).
The sink uses a prepared statement to insert the data in pure SQL instead of serializing the data to JSON and deserializing it in the database.

For now the sink can only handle Logs and Traces.

Tests are still missing but it can be E2E tested it using this setup:

sources:
  stdin:
    type: stdin

transforms:
  foobar:
    type: remap
    inputs:
      - stdin
    source: |-
      .json_field = del(.)
      .array_field = [true, true, true]
      .id = "some_id"
      .ignored_field = 1324

sinks:
  posti:
    type: postgres
    host: localhost
    port: 5432
    table: jsontest
    inputs:
      - foobar

CREATE TABLE IF NOT EXISTS public.jsontest
(
    id character varying(255) COLLATE pg_catalog."default",
    json_field json,
    array_field boolean[]
)

Change Type

Bug fix
New feature
Non-functional (chore, refactoring, docs)
Performance

Is this a breaking change?

Yes
No

How did you test this PR?

Does this PR include user facing changes?

Yes. Please add a changelog fragment based on our guidelines.
No. A maintainer will apply the "no-changelog" label to this PR.

Checklist

Please read our Vector contributor resources.
- make check-all is a good command to run locally. This check is
  defined here. Some of these
  checks might not be relevant to your PR. For Rust changes, at the very least you should run:
  - cargo fmt --all
  - cargo clippy --workspace --all-targets -- -D warnings
  - cargo nextest run --workspace (alternatively, you can run cargo test --all)
If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
run dd-rust-license-tool write to regenerate the license inventory and commit the changes (if any). More details here.

References

bits-bot · 2025-02-20T13:18:37Z

All committers have signed the CLA.

pront · 2025-02-20T15:09:01Z

Hi @Ichmed, thank you this PR! There is an existing PR that introduces a postgres sink which is almost there: #21248

isbm · 2025-02-20T15:42:24Z

@pront We are fully aware of it and analysed it. 😉 And yet we don't think it is the right way to do. Please take a closer look at the code. Features we can add, no worries. But we have a really enormous data and we need it in large amounts into Postgres and TimescaleDB. We specifically need that optimised for cloud usage (mem/CPU matters!).

In a worst case you will have two sinks! 😆 Call it "lightweight PgSink".

pront · 2025-02-20T16:17:14Z

@pront We are fully aware of it and analysed it. 😉 And yet we don't think it is the right way to do. Please take a closer look at the code. Features we can add, no worries.

Sure will do. It will take some time though so please bear with me.

But we have a really enormous data and we need it in large amounts into Postgres and TimescaleDB. We specifically need that optimised for cloud usage (mem/CPU matters!).

Did you compare both implementations against some benchmarks?

In a worst case you will have two sinks! 😆 Call it "lightweight PgSink".

Having two sinks doing the same thing is probably not what we want. I do like that #21248 has support for all telemetry data, Vector features such as ACKs and good UX. And most importantly, a lot of testing.

Again, I didn't dive into the differences and I need some time to do so. I wonder, since you looked the existing PR, can you work on optimizing that after it lands?

isbm · 2025-02-20T16:36:15Z

Sure will do. It will take some time though so please bear with me.

Thanks!

Having two sinks doing the same thing is probably not what we want. I do like that #21248 has support for all telemetry data, Vector features such as ACKs and good UX. And most importantly, a lot of testing.

To our defence, our day one Chapter 1 is not half-year Chapter 128 😛. We specifically focused on having it zero-copy, no dependencies generic micro-sink. Adding features is not a problem, ACKs are coming, as it is a necessity.

Again, I didn't dive into the differences and I need some time to do so. I wonder, since you looked the existing PR, can you work on optimizing that after it lands?

We would definitely support and maintain ours — that's for sure, because it will go into production straight away. Alternatively, it can land in "contrib" section: more options to choose from is always better. We are interested in bringing more sinks/transforms in a near future.

Ichmed · 2025-02-21T19:48:28Z

Hi, @pront with these changes we should have feature parity with the other PR aside from configuration.

Is there a nice way to do benchmarks? I looked at the benches directory but didn't really understand how to apply that to this use case.
AFAIK this implementation should be faster than the other, since we are simply doing less work, should have zero allocations per event, are using a prepared statement and have no deserialization happening on the DB side, so if we are slower in any usecase I would consider that a bug that can be fixed.

jorgehermo9 · 2025-02-22T14:20:31Z

Hi, I would like to drop my opinion on this.

AFAIK this implementation should be faster than the other, since we are simply doing less work

I'm not really sure about this and claiming about performance improvements and optimizations without measuring it, is a mistake.

I see that you are not batching events and every ingested event results in a network trip. I would be surprised to see that this approach results in a higher throughput than batching them.

Moreover, as you are loading the table column's on sink's startup

vector/src/sinks/postgres/mod.rs

Line 90 in e9b0c8c

    
           let columns: Vec<_> = client.query("SELECT column_name from INFORMATION_SCHEMA.COLUMNS WHERE table_name = $1 AND table_schema = $2", &[&table, &schema]).await?.into_iter().map(|x| x.get(0)).collect();

your implementation does not allow to alter tables and insert new columns while runnning (which #21248 does), you must have to restart the sink so new columns are taken into account. Also, deleting columns while running would cause all events to fail until Vector is restarted.

And also, I'm not sure your implementation works for Composite types (maybe does, but I'm currently not sure if it does).

The implementations are not feature-wise equal so I don't think that a performance comparison makes sense in this case though (whichever would be the fastest).

should have zero allocations per event

No allocation does not always imply to be faster. It generally is faster to not allocate, but does not imply to be faster.

are using a prepared statement

so does #21248. https://docs.rs/sqlx/latest/sqlx/fn.query.html
The connection will transparently prepare and cache the statement, which means it only needs to be parsed once in the connection’s lifetime

more options to choose from is always better

This is also a fallacy. From a new user experience, not having a single solution is actually worse, as users would struggle deciding which one to use, for example. Moreover, it is a maintenance overhead for maintainers to have multiple implementations for nearly the same.

From my point of view, we should not be talking about should be faster and actually measuring it, but as I think this is not feature-wise equal to #21248, I don't know if it makes sense to just choose the fastest

jorgehermo9 · 2025-02-22T14:27:06Z

Also, you state that

should have zero allocations per event

but clearly using a BytesMut to encode every event's field

vector/src/sinks/postgres/mod.rs

Line 192 in e9b0c8c

.map(Wrapper);

, you are potentially doing several allocations per event. Those statements about no allocation done should come from validating it with tools like valgrind. Stating that allocations are not done purely based on your written code and not on your dependencies' code (which also must be taken into account) is wrong.

isbm · 2025-02-24T16:03:10Z

@pront, @jorgehermo9 Hey, thanks for taking a look at it and adding your thoughts! OK, so I've spent kind of a day for this to measure it thoroughly and found that "native approach" is still faster than @jorgehermo9 's, however it is paper-thin faster when preparing the statements and using JSONB and net-win is pretty much useless. My tests was made on PgSQL 17.3. Meanwhile, adding batching to all this turns a bit difficult because the interface is a bit weird.

So I think that it would be indeed a good idea to go with @jorgehermo9 's solution!

@pront since @jorgehermo9 is about to finish that, would be nice to finally merge that. 😉 We will then take this sink and might also look after it, as we "badly" need PostgreSQL sink in Vector.

That said, have a nice day.

pront · 2025-02-24T16:57:20Z

So I think that it would be indeed a good idea to go with @jorgehermo9 's solution!

@pront since @jorgehermo9 is about to finish that, would be nice to finally merge that. 😉 We will then take this sink and might also look after it, as we "badly" need PostgreSQL sink in Vector.

That said, have a nice day.

Thank you @isbm, this sounds like a good plan! You are very welcome to optimize the postgres sink after it lands. And thanks for using Vector :)

jorgehermo9 · 2025-02-24T21:51:59Z

So I think that it would be indeed a good idea to go with @jorgehermo9 's solution!

Thank you @isbm. I'm sorry that PR got delayed this much, I'm doing it outside of work hours and it's difficult to find enough time. I think all I have left is to write some documentation about the usage, I plan to do it this through this week!

isbm · 2025-02-25T12:44:59Z

Thank you @isbm, this sounds like a good plan! You are very welcome to optimize the postgres sink after it lands.

@pront after it lands. 😉 We tried also valgrind, but it doesn't do required measures. I spent a day of tuning/finding/narrowing down possible performance issues, but this is not that easy, apparently. For example, earlier versions of Postgres dealt with JSON/B quite poorly and caching the statement wasn't that helpful. But seems v17 is quite improved, which was an impressive discovery on its own. Since we have quite high cardinality of telemetry, let's see how that performs in real life, i.e. memory/CPU consumption, throughput etc etc.

And thanks for using Vector :)

Did you guys thought about "contrib" repo anyways? Because indeed not all modules should go into the main distribution, and some of those I would like not to have — after all it is ISO standard: if you don't use it, remove it. While I've managed to get vector in "just" 60M binary, it is still huge, if we are talking about embedded devices. While binary size doesn't necessarily affects the performance (only loading time), yet we have cases where our storage is strictly limited to a bare minimum, so we count almost every byte...

isbm · 2025-02-25T12:55:04Z

Thank you for you @isbm. I'm sorry that PR got delayed this much, I'm doing it outside of work hours and it's difficult to find enough time.

@pront actually another very good point ☝️ by @jorgehermo9 why contrib approach would be actually beneficial for the rest of us: as we do all that outside of working hours, we need iterations. It would tremendously help if @jorgehermo9 just wrote a basic HOWTO/Readme and let it all get released ASAP, and then we all could just iterate improving it, rather then bubble-up one huge PR since September last year, isn't it? Downside: could be many plugins/modules "half-baked" or not ideal. But you could just label them as "good enough" or "needs improvement here/there" etc. E.g. I dont have much time to make it sales-perfect at one go, but overtime it will be...

Ichmed added 10 commits February 20, 2025 13:37

Create postgres sink

022406f

Fix docstring

8f93c52

Remove unwrap

13a5fe6

Move prepare statement to sink construction

7de11c6

Make Value serializaion zero-copy

9744380

Remove potential panics

67bdc46

Add functionality for traces

cb7266d

Add proper healthcheck

441466f

Distinguish storage for traces and logs

3162544

Make Clippy happy

b7551a9

Ichmed requested a review from a team as a code owner February 20, 2025 13:18

github-actions bot added the domain: sinks Anything related to the Vector's sinks label Feb 20, 2025

Ichmed added 3 commits February 20, 2025 21:19

Add missing status update

42e62af

Add metric storage

875873f

Extract Wrappers

e9b0c8c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(new sink): new `postgres` sink #22481

feat(new sink): new `postgres` sink #22481

Ichmed commented Feb 20, 2025

bits-bot commented Feb 20, 2025 •

edited

Loading

pront commented Feb 20, 2025

isbm commented Feb 20, 2025 •

edited

Loading

pront commented Feb 20, 2025

isbm commented Feb 20, 2025

Ichmed commented Feb 21, 2025

jorgehermo9 commented Feb 22, 2025 •

edited

Loading

jorgehermo9 commented Feb 22, 2025 •

edited

Loading

isbm commented Feb 24, 2025

pront commented Feb 24, 2025

jorgehermo9 commented Feb 24, 2025 •

edited

Loading

isbm commented Feb 25, 2025 •

edited

Loading

isbm commented Feb 25, 2025

feat(new sink): new postgres sink #22481

Are you sure you want to change the base?

feat(new sink): new postgres sink #22481

Conversation

Ichmed commented Feb 20, 2025

Summary

Change Type

Is this a breaking change?

How did you test this PR?

Does this PR include user facing changes?

Checklist

References

bits-bot commented Feb 20, 2025 • edited Loading

pront commented Feb 20, 2025

isbm commented Feb 20, 2025 • edited Loading

pront commented Feb 20, 2025

isbm commented Feb 20, 2025

Ichmed commented Feb 21, 2025

jorgehermo9 commented Feb 22, 2025 • edited Loading

jorgehermo9 commented Feb 22, 2025 • edited Loading

isbm commented Feb 24, 2025

pront commented Feb 24, 2025

jorgehermo9 commented Feb 24, 2025 • edited Loading

isbm commented Feb 25, 2025 • edited Loading

isbm commented Feb 25, 2025

feat(new sink): new `postgres` sink #22481

feat(new sink): new `postgres` sink #22481

bits-bot commented Feb 20, 2025 •

edited

Loading

isbm commented Feb 20, 2025 •

edited

Loading

jorgehermo9 commented Feb 22, 2025 •

edited

Loading

jorgehermo9 commented Feb 22, 2025 •

edited

Loading

jorgehermo9 commented Feb 24, 2025 •

edited

Loading

isbm commented Feb 25, 2025 •

edited

Loading