Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1362] [Bug] Toggling values when snapshotting a table with a key violation #6089

Closed
2 tasks done
kjstultz opened this issue Oct 17, 2022 · 10 comments
Closed
2 tasks done
Labels
bug Something isn't working Refinement Maintainer input needed snapshots Issues related to dbt's snapshot functionality stale Issues that have gone stale Team:Adapters Issues designated for the adapter area of the code

Comments

@kjstultz
Copy link

kjstultz commented Oct 17, 2022

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

If I have two values for the same primary key in two different rows in a given table that is snapshotted, it toggles between the two whenever I run the snapshot. (Snowflake adapter)

Expected Behavior

When I have a primary key error in a table I am expecting to snapshot, I would expect at least a warning (if not a full failure) if there is a primary key violation in the base table.

Steps To Reproduce

DBT Core 1.0.1

  1. Load test data:
    create or replace table sandbox.public.table2(pk int,city string); insert into sandbox.public.table2 values(1,'New York');

  2. Run Snapshotting code:

{% snapshot DBT_TEST_SNAP_ALT%}

{{
    config(
      unique_key='PK',
      strategy='check',
      check_cols = 'all',
      invalidate_hard_deletes=True
    )
}}

select * from SANDBOX.PUBLIC.TABLE2

{% endsnapshot %}

Run:
dbt snapshot --select DBT_TEST_SNAP_ALT

  1. Insert a key violating record into the table:
    insert into sandbox.public.table2 values(1,'NOLA');

  2. Re-run snapshot

  3. Re-run snapshot again

  4. Re-run snapshot again
    Resulting table
    image

Relevant log output

No response

Environment

- OS: Windows 10 Pro 19043.2130
- Python:3.8
- dbt:1.0.1

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

@kjstultz kjstultz added bug Something isn't working triage labels Oct 17, 2022
@github-actions github-actions bot changed the title [Bug] Toggling values when snapshotting a table with a key violation [CT-1362] [Bug] Toggling values when snapshotting a table with a key violation Oct 17, 2022
@jtcohen6
Copy link
Contributor

Hey @kjstultz - I would also expect Snowflake to raise an error on the merge statement, saying that the key is not unique!

That's the default behavior, but if you've set the ERROR_ON_NONDETERMINISTIC_MERGE parameter to False instead, the merge will proceed, yielding nondeterministic results instead: https://docs.snowflake.com/en/sql-reference/parameters.html#label-error-on-nondeterministic-merge

By any chance, do you know if you have that parameter set for your account / user?

@jtcohen6 jtcohen6 added snapshots Issues related to dbt's snapshot functionality Team:Adapters Issues designated for the adapter area of the code awaiting_response and removed triage labels Oct 18, 2022
@kjstultz
Copy link
Author

Interesting, good point @jtcohen6. Currently we have that value always set to "True". After combing through the code, I think the root of it is actually because dbt snapshot doesn't use the base table as the "merging" table. It creates a staged table on top that it uses, and from examining the code submitted, the PK violation in the table breaks the creation of that table.

As I look through the code more, I do really think the solution would be checking for a PK violation before the merge statement is initiated.

@jtcohen6
Copy link
Contributor

@kjstultz Ah - so just to confirm, the snapshot query itself is returning duplicate values of the unique_key?

As I look through the code more, I do really think the solution would be checking for a PK violation before the merge statement is initiated.

This is a fair thought. Today, we say it's the full responsibility of the end user to guarantee the uniqueness of the unique_key in the snapshot query.

One way to guarantee this is by creating another model (e.g. ephemeral model), put a unique test on it, a snapshot after it, and use dbt build (which will skip the snapshot if the unique test fails).

Or: Should the snapshot materialization just execute its own "unique test," implicitly, after creating the staging table and before running the merge? That does seem more ergonomic than the pattern I described above. Is there any reason we should make this an optional configuration, rather than turning it on by default for everyone (one more query than strictly needed)?

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Jan 29, 2023
@kjstultz
Copy link
Author

Forgot about this one, b

@kjstultz Ah - so just to confirm, the snapshot query itself is returning duplicate values of the unique_key?

As I look through the code more, I do really think the solution would be checking for a PK violation before the merge statement is initiated.

This is a fair thought. Today, we say it's the full responsibility of the end user to guarantee the uniqueness of the unique_key in the snapshot query.

One way to guarantee this is by creating another model (e.g. ephemeral model), put a unique test on it, a snapshot after it, and use dbt build (which will skip the snapshot if the unique test fails).

Or: Should the snapshot materialization just execute its own "unique test," implicitly, after creating the staging table and before running the merge? That does seem more ergonomic than the pattern I described above. Is there any reason we should make this an optional configuration, rather than turning it on by default for everyone (one more query than strictly needed)?

Definitely the second one. I'm not a big software development person, but I'd argue adding a parameter to the snapshotting config that defaults to "Check PK violation" to "FALSE" would be fine. A lot of use cases won't need this, but I've worked at enough organizations that have enough exceptions that having this parameter would be really really helpful.

@dbeatty10
Copy link
Contributor

@kjstultz thanks again for creating this issue! Appreciate that you invested the time at Coalesce in NOLA to figure out how to reproduce this tricky situation.

Reading the discussion between you and @jtcohen6, it sounds like the feature discussed thus far would look like this:

  • Add a new snapshot config parameter validate_uniqueness (default to false)
  • When validate_uniqueness is true, execute a uniqueness test after creating the staging table and before running the merge

Although this proposed feature wouldn't fix a snapshot table that already has duplicates, it would prevent duplicates from being added in the first place! Feels like a win to me, especially with it being configurable for those that (somehow) know that their "unique" key is truly unique.

@github-actions github-actions bot removed the stale Issues that have gone stale label Jan 31, 2023
@kjstultz
Copy link
Author

Perfect! I think that'd be a great solution.

@dbeatty10
Copy link
Contributor

What do you think of the overall feature proposal here @jtcohen6 ?

What do you think about the name validate_uniqueness in particular? Got any other suggestion for this config variable name?

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2023

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Aug 1, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2023

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Refinement Maintainer input needed snapshots Issues related to dbt's snapshot functionality stale Issues that have gone stale Team:Adapters Issues designated for the adapter area of the code
Projects
None yet
Development

No branches or pull requests

3 participants