-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-2280] Tracking changes in the non-chronological insertions of data in dbt snapshots (check strategy) #7138
Comments
Thanks for opening this @HansalShah007 ! And thanks for a fantastic write-up 🤩 You are very correct that non-chronological snapshots are not currently supported and will lead to undesirable output 😢 We'd be open to considering an update to snapshots that could support any order. The ideal implementation would be backwards-compatible, work across (most) databases, and have a similar (or better) execution time to the current implementation. QuestionsHow are you feeding the data to the dbt snapshots in your example? Do you have one table per day similar to the following (with a date or timestamp at the end of the table name)?
If so, presumably you'd have some way of parsing the applicable datetime and passing it to the Or is there some other way you are feeding the data in? An alternativeThis is the 2nd problem listed in the discussion in #7018 -- one alternative would be to detect if a snapshot is out-of-order and raise an exception rather than inserting. (Would love your feedback on any of the unrelated problems/solutions too!) The upside of raising an error is that no snapshots are inserted out-of-order. But the downside is that adding data that is older isn't possible (which would preclude the situation you described where an organization gets access to older data). Something relatedYou discussed the |
Hey @dbeatty10! Thanks for responding back. Answers
Views on the alternative
Updates made in solution since the issue was opened
Snapshot Table
Now, if the data for unique_id 1 was missing from the batch of data belonging to the 2022-07-16T03:58:00.000+0000 timestamp, then the snapshot table should now look something like this.
I am not very sure about how this logic will work out for timestamp strategy.
|
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Any updates on this issue? |
No updates yet, from the DBT team. However, I have successfully implemented the changes I have mentioned in the replies above and everything seems to work fine. I have been using those changes made in the macro code for quite a while now and I have not come across a logical error. |
Sorry for the radio silence here! We'd definitely be interested in supporting non-chronological snapshots.
The considerations are complicated enough that we're likely to take this work on ourselves rather than accept a community-submitted PR, so I'm not going to label this as See below for some of the most important acceptance criteria for this feature. 🏆 @HansalShah007 has done some pioneering work described here that can inform the implementation 🏆 Acceptance criteria
|
Is this your first time submitting a feature request?
Describe the feature
Current Behaviour
Check strategy of snapshots is only able to correctly store the changes in the data if they come in chronological order. It is not able to handle non-chronological insertion of data. For instance, assume we have an entry for a unique_id in the snapshot table as shown in the table below.
Current Snapshot Table
When a new row with the same unique_key is introduced for a timestamp 2022-07-15T03:58:00.000+0000 and values for one of the check_cols is different from that of the currently valid row, then it will go ahead and change the snapshot table as follow.
New row in the source table
Updated snapshot table
Apparent from the changes made to the snapshots table, check strategy is not able to handle the non-chronological changes made to the data.
Expected Behaviour
For the above example, the expected output for the snapshot table should have been as shown in the table below.
Expected snapshot table
Describe alternatives you've considered
Updates in the source code
To solve this problem, I have made some changes to the source code of the snapshots, specifically in the following macros and the snapshot materialization strategy:
I have introduced changes in the way inserts and updates are identified in the default__snapshot_staging_table macro, for tracking the changes in the non-chronologically incoming data.
The new code introduces the following dbt_change_types.
Types of Inserts
Insertion Type 1
When a latest record arrives for an already existing unique_id that has some changes or a previously unseen unique_id is encountered.
Scenario 1
Data already existing in snapshot table for unique_id 1
New Data (with some changes)
Final Snapshot (Insertion Type 1)
Scenario 2
Data already existing in snapshot table
New Data (with some changes)
Final Snapshot (Insertion Type 1)
Insertion Type 2
When an older record of the same unique_id arrives, which is different from the its nearest future version.
Data already existing in snapshot table for unique_id 1
New Data (with some changes)
Final Snapshot (Insertion Type 2)
Types of Updates
Update Type 1
When a later record of the same unique_id arrives, which is different from the its nearest past version and has a dbt_valid_from < dbt_valid_to (snapshot table record).
Update Type 2
When an older record of the same unique_id arrives, which is not different from the its future version.
Data already existing in snapshot table for unique_id 1
New Data (with some changes)
Final Snapshot (Update Type 1 and 2 performed simultaneously)
Source Code Changes
Snapshot Materialization
Snapshot Staging Table
Snapshot Merge SQL
Who will this benefit?
This feature will help to snapshot historical data in any order, it does not matter in what order they feed the data to the dbt snapshots. There is always a possibility that the organizations get access to data that is older than the data with which they started to generate the snapshots. This feature will allow them to feed the historical data in any order and they would still be able to generate snapshots as if the changes in the data were captured in the chronological order.
Are you interested in contributing this feature?
Already made some changes in the source code as described above.
Anything else?
No response
The text was updated successfully, but these errors were encountered: