Merge debezium outputs #762

mwylde · 2024-10-21T23:35:53Z

This PR reworks several details of Arroyo's updating SQL support to improve the UX and set up for supporting more complex updating queries.

Internally, Arroyo represents updates as a flat event with a "retract" field that may be set to true or false. However, Debezium formatted data (whether being consumed or produced) has a nested structure with a "before", "after", and "op" field. When we consume debezium, we need to first "unroll" it into our flattened format, so each update becomes a sequence of deletes and creates.

Similarly, when an updating operator (like an updating aggregate) produce an update for a row, they cannot produce an update (because we don't have a native way to represent that) but only a sequence of retract and create rows.

Currently, a query like

SELECT count(*) from table;

will produce a Debezium result stream like this

reflecting the underlying representation.

However, this is undeseriable—we've turned atomic updates into non-atomic delete/create pairs, leading to the potential for data loss in the consuming system if a delete gets consumed but not the corresponding create. It's also double the events and extra cognitive load for what is truly an update.

With this PR, we now merge the deletes and creates into updates, so we get this:

There is also a small breaking change; in order to support more efficient updating operations off of Debezium sources, we now require that sources be annotated with at least one PRIMARY KEY field:

CREATE TABLE debezium_source (
    id INT PRIMARY KEY,
    customer_id INT,
    price FLOAT,
    order_date TIMESTAMP,
    status TEXT
) WITH (
    connector = 'kafka',
    format = 'debezium_json',
    type = 'source',
    ...
);

This must reflect the primary keys in the underlying tables.

mwylde force-pushed the merge_updates branch from 36a124d to 8840193 Compare October 22, 2024 00:58

mwylde added 12 commits October 21, 2024 17:58

checkpoint

f4004c0

non-primtive id

3ba5dcc

work

68559c8

work

b6080f5

work on tests

dbe07a9

Add primary keys to inputs

8cfbb11

tests passing

942b460

Update deps

c2529c5

cippy

d89a8ec

fmt

bd0a41a

Update to 30-core machine for CI

c2b5111

Restore datafusion-functions override

e587b24

mwylde force-pushed the merge_updates branch 2 times, most recently from f07c33b to 399fbf6 Compare October 22, 2024 01:06

mwylde enabled auto-merge (squash) October 22, 2024 01:07

Don't parallelize debezium source in tests

44c43ec

mwylde force-pushed the merge_updates branch from 399fbf6 to 44c43ec Compare October 22, 2024 01:07

mwylde merged commit 843e327 into master Oct 22, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge debezium outputs #762

Merge debezium outputs #762

mwylde commented Oct 21, 2024 •

edited

Loading

Merge debezium outputs #762

Merge debezium outputs #762

Conversation

mwylde commented Oct 21, 2024 • edited Loading

mwylde commented Oct 21, 2024 •

edited

Loading