-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include hard-deletes when making snapshot #2749
Include hard-deletes when making snapshot #2749
Conversation
That error message usually means that the command failed when dbt expected it to succeed. A good way to run just that test is to use the
That will run just the failing test and spit the full output to stdout instead of capturing it, which will probably be more informative given the error. |
281f5f5
to
760dd89
Compare
Thanks, in the log it stated that the disk space was full :) Anyhow, now that I'm able to run the tests, I'm having difficulties writing a test for it. I thought this could be easily (read: hackish) done by deleting one of the seed rows in the The following issue is that the test asserts by checking the actual rows / values. This PR however, marks the Maybe the easiest is to write a complete new test, but I'm not sure how to do that in the dbt project (+ no experience with tox). Any pointers? |
There's a library called You might just have to write a new test and assert that the time is between two times you collect (before + after the |
Okay, I wrote a test for checking the deleted records. I just used the same seed data and deleted the last records, and the test validates that their The tests of
For now, I just changed the data type to reflect an actual timestamp. This got me thinking though, my solution will only work if the source What is your opinion on that? (Edit: and if you disagree, any suggestions how to mitigate the problem of mixing types?) -- Additionally, I wanted to add a config option to make this feature opt-in since it changes the current behavior (and may clash with existing expectations. Was also mentioned on the earlier PR.) |
489d918
to
1646002
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, I wanted to add a config option to make this feature opt-in since it changes the current behavior
Strongly agree! Users are already quite accustomed to configuring snapshots, so one more config (remove_hard_deletes
?) makes perfect sense.
Okay made it opt-in, hopefully this also fixes the failing redshift test 🤞 I think the PR is ready for code review 😇 . EDIT: what should be the base branch before merging? The new default branch or the dev/0.18.1? |
It sets dbt_valid_to to the current snapshot time.
86fd0bc
to
2581e98
Compare
Okay the redshift tests still seem to fail. However it fails at Can someone check to see what the problem might be? |
Don't worry about those redshift failures right now. Amazon pushed an update out that automatically adds dist and sort keys to our tables, which breaks the docs tests that check for dist/sort keys and expect them to not exist. We'll probably just merge without that test passing, I have a fix for it in another open PR. |
fff25a4
to
b71b7e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I've made a small suggestion for a couple changes but I think this is generally good to go.
Should this work for other adapters? Even if this is postgres/redshift only, I'm totally fine with merging it, but if we expect it to work on snowflake/bigquery we should test it.
core/dbt/include/global_project/macros/materializations/snapshot/strategies.sql
Outdated
Show resolved
Hide resolved
core/dbt/include/global_project/macros/materializations/snapshot/strategies.sql
Outdated
Show resolved
Hide resolved
Done ✅
Yea it should be database agnostic, but to proof it... I added tests for the other databases (bigquery, snowflake and redshift) 🥳. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the new redshift test is failing! I wonder if it's because redshift uses the postgres code you changed in snapshot_merge.sql
, and redshift doesn't support text
as a real type? I don't see any actual errors in the output though...
_________ TestSnapshotHardDelete.test__redshift__snapshot_hard_delete __________
[gw0] linux -- Python 3.6.9 /home/dbt_test_user/project/.tox/integration-redshift-py36/bin/python
self = <test_simple_snapshot.TestSnapshotHardDelete testMethod=test__redshift__snapshot_hard_delete>
@use_profile('redshift')
def test__redshift__snapshot_hard_delete(self):
self._seed_and_snapshot()
self._delete_records()
> self._invalidate_and_assert_records()
test/integration/004_simple_snapshot_test/test_simple_snapshot.py:856:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/integration/004_simple_snapshot_test/test_simple_snapshot.py:831: in _invalidate_and_assert_records
self.assertIsInstance(result[-1], datetime)
E AssertionError: None is not an instance of <class 'datetime.datetime'>
Oef, good catch! Hm that is weird error though.. I don't think it's due the changes in I need to dig into this, maybe open up a AWS trial to replicate it locally 🤓 |
Ok! You can also feel free to set it up so that Redshift raises an error when this config is enabled, and a future contributor who wants Redshift support or a team member can add support later (as long as the existing behavior continues to work ok - that could involve copy+pasting the postgres macro into redshift, but without the deleted clause). There's plenty of precedent for user-contributed features to only apply to the databases they personally care about. It's not reasonable on our part to require you to create a redshift account for this. |
Okay great, we kinda need this feature soon.. 😬 |
Okay, fixed 😄 Test failed because the results were not sorted as I expected during assertions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me! Thank you for contributing to dbt, @joelluijmes!
Noice! This config probably needs to be included in documentation, is that something I should do or do you (fishtown) generally update the documentation? I'm fine with doing it myself, but my English isn't perfect 😇 |
@joelluijmes If you're up to give a first go at docs for this, I'd be more than happy to review + help out! I think an additional configuration + FAQ on this page would be a good place to start. |
|
Chiming in here! Wowee, I closed that original PR and never thought that someone else would pick it up! Nice work @joelluijmes 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one @joelluijmes! This has been a long time coming, and I know several members of the community are thrilled to see it. Thanks for writing some excellent docs as well.
You can pull in changes from dev/kiyoshi-kuromiya
to get fixes for failing Redshift tests, or @beckjake can merge this in as-is as an admin. I'm happy with either.
…pshot-hard-deletes-joell
Just merged as suggeested, but uhoh now it fails at |
I hit that merge issue as well, and I don't know why! Weird git things. I think the fix is to add |
Is there something I can do to fix the build? Or will you be taking the PR over from here? |
I think this is fine, that test failure was just some ephemeral Windows thing. I'll merge this. Thanks again for contributing to dbt @joelluijmes! |
Hi @joelluijmes, this does not reinstate deleted records when they reappear in the source data. Any suggestions? |
That is true, I kinda hoped it would be added back as an
@beckjake @jtcohen6 what do you think? I think, in theory we could remove the |
Interesting problem. I'm not sure of how common it is for hard-deleted records to return from oblivion, and I'd also be willing to document this as a known limitation of That said, if we're interested in handling this case: my instinct here would be to treat these "returning" records as That way, the previously-hard-deleted record is not included for comparison and its new record is treated like any other new record with a net-new primary key. The time between the previous |
There's a couple of use cases for us:
I'd love to see a |
I've implemented this on top of @joelluijmes's solution and it works.
|
If it's removed, it will insert 'update' records based on historical records because the filter is to select current snapshot records. In the following case, it will insert a record (id: 2) from the source table to the target. source table
target table
|
@joellabes Ah, that's helpful context. It makes sense that there's quite a bit of fluidity in many-to-many mapping tables. @seunghanhong You're right on about why we can't remove I may not fully grasp all the logic in the approach you outline further up. At the moment, I don't see why all those additional CTEs is necessary; I think it'd be sufficient to treat "returns" like any other snapshotted_data as (
select *,
{{ strategy.unique_key }} as dbt_unique_key
from {{ target_relation }}
where dbt_valid_to is null
), In any case, I'd welcome a new issue on the subject of handling hard-delete "returns" so we can move the discussion there and determine which is the most effective and concise approach. |
Continues the work of the closed PR #2355, which resolves #249.
Description
The idea is to select ids from the snapshotted table which no longer exist in the source table, and update these to set
dbt_valid_to
to the current timestamp. Marking them as hard deleted.As of now, I didn't write any integration tests yet as I'm having difficulties running them locally. Any pointers on how these work would be nice. Running integration test (postgres) locally currently fails, but I'm unsure why.
Checklist
CHANGELOG.md
and added information about my change to the "dbt next" section.