-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: unique sync ID metadata column #1787
Comments
@pnadolny13 do you think this needs to be an SDK feature or should this be part of the runner? I think having a generic solve would be useful for this, but I want something in Meltano that says "this was run by a Meltano process with X metadata". Perhaps we have a couple of optional "runner metadata" columns that can be specified be custom for the runner? |
@tayloramurphy I was thinking this would either be populated by the tap or the target like the existing metadata columns. I dont know the details but if Meltano wanted to add properties to streams then wouldn't it need to start read and edit every record between the tap and target. I guess meltano could inject a custom mapper to every sync behind the scenes that adds these properties 🤔 . Also as far as I know batch wouldnt be able to support this in any case because its not reading each record. |
@pnadolny13 I was thinking that for SDK-based connectors the metadata could be generated by the tap or externally provided via the config or environment variables. You're right I wouldn't want us to intercept the records. |
This is related to a discussion in #1199. I think if we added a |
Closed by #1878 |
Feature scope
Taps (catalog, state, stream maps, tests, etc.)
Description
We have the normal
_sdc_*
properties and we also have the activate version mechanism but we dont have a reliable id for a particular sync. The common recommendation is to use the extracted and/or batched at timestamps in a group by clause within dbt staging models to deduplicate tables and create tables with only the newest records. The challenge with this is that those timestamps aren't unique to a sync so you have to write a query that selects a small range of those timestamps to represent the whole sync. Sometimes its only a few seconds or minutes different but if the sync is hours long then it becomes tricky to hard code your group by logic to capture all records.Currently a recommended workaround is to use mappers to populate a sync id but thats more overhead if not using an SDK tap/target. Also this seems common enough where it would be useful for most people, so fixing this work around would be ideal.
Potential features that might solve this:
_sdc_sync_started_at
. This would work on either end.Relevant slack threads:
The text was updated successfully, but these errors were encountered: