-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-3252] [Feature] Add a way to refresh Snapshots in the database #8885
Comments
What I don't understand about this request is: how dbt would know what source data was 11/1/22, 11/2/22, 11/3/22, and so on so it could re-compute the snapshot from a start timestamp(e.g. 11/1/22) through the current timestamp? Ignorable thoughts:
|
@alison985 Not sure what you mean by:
This request is not about re-computing the snapshot from a start timestamp, it's about rebuilding the snapshot disregarding the previous snapshotted data by basically dropping the table and re-creating it just as when you run As I mentioned above, users of this feature should be aware that it can potentially delete important historic data if not used correctly, just like is the case currently with incremental models. However, if your snapshot definition was flawed from the start, then it makes sense wanting to start from a clean slate. About the points you suggested:
|
I apologize, I missed reading that part. As long as that's okay in a scenario, then this makes sense. (I always touch accounting data and it would be a no, no, no go there. At least, not without a full archive and extra audit work.)
Yes, a snapshot doesn't ever remove columns, which is what you'd want in a snapshot since it's purpose is to not lose data. I admit, I did think a column type change would also make a new column(with a number appended?) if it was anything other than to a varchar that it could cast, so thank you for helping teach me! That makes a huge difference and the maintenance overhead to handle it makes me cringe. Off the top of my head, you'd have to turn off running the snapshot, change the name of the existing snapshot table, give the new snapshot table a second new name, turn on snapshotting again starting from that, then have a model with the original name of the snapshot to union the old snapshot and new snapshot table together, then manually update the original snapshot table timestamps so there's no lost time between the old snapshot and the new snapshot. Wow! I feel like filing a feature request about handling this.
I now understand your request @Josersanvil. Thank you for helping clarify it for me. Definitely valid use cases to consider this feature request for. |
Thanks for opening this @Josersanvil ! And thanks for the additional thoughts and insights @alison985. "Could you defend your data in court?" was an amazing talk 🤩 Not plannedAfter discussing this with @graciegoheen @jtcohen6 and @dataders we're going to close this as "not planned". We recognize that there are situations where one might need to start afresh; however, we actually want to keep some barriers here to ensure that it's not done by mistake. Recommended approachesWe'd encourage you to choose between one of these options (listed in no particular order):
|
While preventing dropped data is a desired barrier. I have had scenarios where Snapshot data was oversnapshotted (values really not changing but a new row snapshotted anyways), changing the columns involved in a Check Strategy to adapt to new business logic, or a weird scenario occurred where the dbt timestamps became out of alignment. Could there be an approach to instead run a lag/lead query across the existing snapshot's data that merely updates the dbt valid from/tos and condenses any repeated filler between them based on the new Check strategy? A bad Check strategy on a highly dynamic table could inadvertently spawn millions of essentially duplicated rows. Storage is cheap, but .... Also, such a feature could potentially allow seamless backfilling into a Snapshot table. Say you dig up some old excel files of business data, transform the files to meet the schema of the Snapshots table. You could insert these rows into the Snapshot and still have linear dbt valid tos/from. I have a working approach to introducing back fills into Snapshot tables but its not exactly pretty but still happy to share |
Sounds interesting @walker-philips -- could you open up a new feature request for the scenario you are describing? |
Is this your first time submitting a feature request?
Describe the feature
Introduce a
--full-refresh
option specifically for snapshot models, it can be analogous to the existing--full-refresh
feature for incremental models. This would enable the rebuilding of snapshots from scratch by dropping the existing target table in the database and re-executing the snapshot query.Use cases:
Just like with incremental models, using full refresh could mean that all historical data in the snapshot table would be permanently deleted. However, there are use cases where such a refresh is not only acceptable but necessary for maintaining data integrity.
Describe alternatives you've considered
The only way would be to drop the tables outside dbt, however, this is not very efficient if you want to refresh multiple snapshots, having to run a separate command to the database for each table. With this option one could do something like:
dbt snapshot --select my-folder --target my-target --full-refresh
and dbt would handle rebuilding all the snapshot tables.Who will this benefit?
No response
Are you interested in contributing this feature?
Happy to contribute if I'm given some pointers to where to look at
Anything else?
No response
The text was updated successfully, but these errors were encountered: