Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/snapshot staging is not atomic #2390

Merged
merged 6 commits into from
May 4, 2020

Conversation

drewbanin
Copy link
Contributor

@drewbanin drewbanin commented May 1, 2020

resolves #1884

Description

Previously, snapshots built a staging table in two discrete queries (one was a create table as and the other was an insert). This meant that if source data changed between the two statements (ie. if new data was loaded) then dbt would generate an asymmetric set of inserts and updates.

This manifested as records in the snapshot table becoming invalidated (the dbt_valid_to column was set to a not-null value) but no new record was inserted for the corresponding change.

This PR combines the two queries (which is actually how this previously worked) to ensure that both inserts and updates are generated from a consistent view of the source table.

Some notes:

  • This was only an issue on Snowflake and BigQuery. Pg/Redshift support transactions, and we do run this code inside of a transaction, so both queries would have an identical view of source tables throughout the transaction. Snowflake's auto-committing of DDL terminates the transaction, and BigQuery does not support transactions at all, so each query could have a different view of source data on those databases.
  • This is a race condition / temporal issue, so the automated test here hacks the real-world scenario in a repeatable and non-brittle way.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@cla-bot cla-bot bot added the cla:yes label May 1, 2020
@drewbanin drewbanin requested a review from beckjake May 1, 2020 19:59
Copy link
Contributor

@beckjake beckjake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty confident I get what's going on here - just moving the selection of updates and inserts into a single statement and creating a table in one create table as select... so "now" is the same time on both.

The test seems pretty cool too, slick use of pg_sleep.

Once you've got the changelog and all that done, this looks good to me! 🚢

@drewbanin drewbanin force-pushed the fix/snapshot-staging-is-not-atomic branch from 146844e to 8681dd8 Compare May 2, 2020 13:28
@drewbanin drewbanin requested a review from jtcohen6 May 2, 2020 15:56
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels really good. The DDL/DML is much more straightforward and linear. Selfishly, it's going to make the spark implementation (dbt-labs/dbt-spark#76) a lot easier, too.

I did some local testing on the core four adapters, and everything looks good. Nice work on the pg_sleep test. Since this issue wasn't actually applicable to pg/redshift, is there any sense in trying to run models-slow on snowflake/bq? It might require a js udf.

While I was combing through the logs, I did notice that the snapshot materialization calls macros (expand_target_column_types , get_missing_columns, and get_columns_in_relation ) which repeatedly execute the same information_schema or describe query. That's out of scope for the current PR, and I'm not sure if it's even worth opening an issue about; now that we run the more performant describe query on Snowflake, the additional runtime is hardly noticeable.

@drewbanin
Copy link
Contributor Author

good call @jtcohen6 - just opened up a new issue to dig into this here: #2392

It's definitely worth reducing the number of queries like this that dbt runs. This will be a performance benefit, and it will make the logs way easier to consume too :)

Merging this one - thanks for the help everyone!

@johnny-lt
Copy link

johnny-lt commented Feb 16, 2024

@drewbanin We are running a dbt snapshot on top of Snowflake. We're often encountering a Duplicate row detected during DML action error when running the snapshot, despite the fact that the source table does not have any duplicate rows. I'm wondering if it's caused by the issue described in this PR.

The source for the snapshot is a Snowflake table that is synced from a production SQL table using Fivetran. I'm not super-familiar with the inner workings of snapshots, but one hypothesis I have is that Fivetran is syncing the row in question while the snapshot is running. The image below shows the snapshot data for the row in question - can you tell from the data whether this hypothesis might be correct?

Screenshot 2024-02-16 at 12 27 09

The next question is: How could my hypothesis be correct, given that this PR should make snapshots atomic? Well, I learned through painful experience that CTEs are not guaranteed to only evaluate once during a given query execution. So I wonder if snapshot_query could be evaluated twice in certain scenarios: once by insertions_source_data and once by updates_source_data.

Thanks for your time :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make snapshot data staging atomic
4 participants