Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: unique sync ID metadata column #1787

Closed
pnadolny13 opened this issue Jun 26, 2023 · 5 comments
Closed

feat: unique sync ID metadata column #1787

pnadolny13 opened this issue Jun 26, 2023 · 5 comments
Labels
kind/Feature New feature or request valuestream/SDK

Comments

@pnadolny13
Copy link
Contributor

Feature scope

Taps (catalog, state, stream maps, tests, etc.)

Description

We have the normal _sdc_* properties and we also have the activate version mechanism but we dont have a reliable id for a particular sync. The common recommendation is to use the extracted and/or batched at timestamps in a group by clause within dbt staging models to deduplicate tables and create tables with only the newest records. The challenge with this is that those timestamps aren't unique to a sync so you have to write a query that selects a small range of those timestamps to represent the whole sync. Sometimes its only a few seconds or minutes different but if the sync is hours long then it becomes tricky to hard code your group by logic to capture all records.

Currently a recommended workaround is to use mappers to populate a sync id but thats more overhead if not using an SDK tap/target. Also this seems common enough where it would be useful for most people, so fixing this work around would be ideal.

Potential features that might solve this:

  • taps or target metadata: add a new metadata column to taps by default thats a static ID or timestamp for the sync. Something like _sdc_sync_started_at. This would work on either end.
  • leverage the activate version property by default somehow. It doesnt look like the SDK targets write anything into the version column by default, maybe if that column gets auto populated with an ID even if an activate version message is never received then it can be used for grouping

Relevant slack threads:

@pnadolny13 pnadolny13 added kind/Feature New feature or request valuestream/SDK labels Jun 26, 2023
@tayloramurphy
Copy link
Collaborator

@pnadolny13 do you think this needs to be an SDK feature or should this be part of the runner? I think having a generic solve would be useful for this, but I want something in Meltano that says "this was run by a Meltano process with X metadata". Perhaps we have a couple of optional "runner metadata" columns that can be specified be custom for the runner?

@pnadolny13
Copy link
Contributor Author

@tayloramurphy I was thinking this would either be populated by the tap or the target like the existing metadata columns. I dont know the details but if Meltano wanted to add properties to streams then wouldn't it need to start read and edit every record between the tap and target. I guess meltano could inject a custom mapper to every sync behind the scenes that adds these properties 🤔 .

Also as far as I know batch wouldnt be able to support this in any case because its not reading each record.

@tayloramurphy
Copy link
Collaborator

@pnadolny13 I was thinking that for SDK-based connectors the metadata could be generated by the tap or externally provided via the config or environment variables. You're right I wouldn't want us to intercept the records.

@pnadolny13
Copy link
Contributor Author

pnadolny13 commented Jul 12, 2023

This is related to a discussion in #1199. I think if we added a _sdc_sync_started_at that was unique to the sync like at the meltano run level then it could serve this purpose as a unique sync id. The challenge today is that we dont have any attribute thats static across all streams in a whole sync.

@edgarrmondragon
Copy link
Collaborator

Closed by #1878

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/Feature New feature or request valuestream/SDK
Projects
None yet
Development

No branches or pull requests

3 participants