feat: unique sync ID metadata column #1787

pnadolny13 · 2023-06-26T13:48:04Z

Feature scope

Taps (catalog, state, stream maps, tests, etc.)

Description

We have the normal _sdc_* properties and we also have the activate version mechanism but we dont have a reliable id for a particular sync. The common recommendation is to use the extracted and/or batched at timestamps in a group by clause within dbt staging models to deduplicate tables and create tables with only the newest records. The challenge with this is that those timestamps aren't unique to a sync so you have to write a query that selects a small range of those timestamps to represent the whole sync. Sometimes its only a few seconds or minutes different but if the sync is hours long then it becomes tricky to hard code your group by logic to capture all records.

Currently a recommended workaround is to use mappers to populate a sync id but thats more overhead if not using an SDK tap/target. Also this seems common enough where it would be useful for most people, so fixing this work around would be ideal.

Potential features that might solve this:

taps or target metadata: add a new metadata column to taps by default thats a static ID or timestamp for the sync. Something like _sdc_sync_started_at. This would work on either end.
leverage the activate version property by default somehow. It doesnt look like the SDK targets write anything into the version column by default, maybe if that column gets auto populated with an ID even if an activate version message is never received then it can be used for grouping

Relevant slack threads:

The text was updated successfully, but these errors were encountered:

tayloramurphy · 2023-06-26T17:58:37Z

@pnadolny13 do you think this needs to be an SDK feature or should this be part of the runner? I think having a generic solve would be useful for this, but I want something in Meltano that says "this was run by a Meltano process with X metadata". Perhaps we have a couple of optional "runner metadata" columns that can be specified be custom for the runner?

pnadolny13 · 2023-06-26T21:39:34Z

@tayloramurphy I was thinking this would either be populated by the tap or the target like the existing metadata columns. I dont know the details but if Meltano wanted to add properties to streams then wouldn't it need to start read and edit every record between the tap and target. I guess meltano could inject a custom mapper to every sync behind the scenes that adds these properties 🤔 .

Also as far as I know batch wouldnt be able to support this in any case because its not reading each record.

tayloramurphy · 2023-06-26T21:45:57Z

@pnadolny13 I was thinking that for SDK-based connectors the metadata could be generated by the tap or externally provided via the config or environment variables. You're right I wouldn't want us to intercept the records.

pnadolny13 · 2023-07-12T16:07:21Z

This is related to a discussion in #1199. I think if we added a _sdc_sync_started_at that was unique to the sync like at the meltano run level then it could serve this purpose as a unique sync id. The challenge today is that we dont have any attribute thats static across all streams in a whole sync.

edgarrmondragon · 2023-07-27T20:51:07Z

Closed by #1878

pnadolny13 added kind/Feature New feature or request valuestream/SDK labels Jun 26, 2023

edgarrmondragon closed this as completed Jul 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: unique sync ID metadata column #1787

feat: unique sync ID metadata column #1787

pnadolny13 commented Jun 26, 2023

tayloramurphy commented Jun 26, 2023

pnadolny13 commented Jun 26, 2023

tayloramurphy commented Jun 26, 2023

pnadolny13 commented Jul 12, 2023 •

edited

Loading

edgarrmondragon commented Jul 27, 2023

feat: unique sync ID metadata column #1787

feat: unique sync ID metadata column #1787

Comments

pnadolny13 commented Jun 26, 2023

Feature scope

Description

tayloramurphy commented Jun 26, 2023

pnadolny13 commented Jun 26, 2023

tayloramurphy commented Jun 26, 2023

pnadolny13 commented Jul 12, 2023 • edited Loading

edgarrmondragon commented Jul 27, 2023

pnadolny13 commented Jul 12, 2023 •

edited

Loading