Fix/snapshot staging is not atomic #2390

drewbanin · 2020-05-01T19:59:44Z

resolves #1884

Description

Previously, snapshots built a staging table in two discrete queries (one was a create table as and the other was an insert). This meant that if source data changed between the two statements (ie. if new data was loaded) then dbt would generate an asymmetric set of inserts and updates.

This manifested as records in the snapshot table becoming invalidated (the dbt_valid_to column was set to a not-null value) but no new record was inserted for the corresponding change.

This PR combines the two queries (which is actually how this previously worked) to ensure that both inserts and updates are generated from a consistent view of the source table.

Some notes:

This was only an issue on Snowflake and BigQuery. Pg/Redshift support transactions, and we do run this code inside of a transaction, so both queries would have an identical view of source tables throughout the transaction. Snowflake's auto-committing of DDL terminates the transaction, and BigQuery does not support transactions at all, so each query could have a different view of source data on those databases.
This is a race condition / temporal issue, so the automated test here hacks the real-world scenario in a repeatable and non-brittle way.

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

beckjake

I'm pretty confident I get what's going on here - just moving the selection of updates and inserts into a single statement and creating a table in one create table as select... so "now" is the same time on both.

The test seems pretty cool too, slick use of pg_sleep.

Once you've got the changelog and all that done, this looks good to me! 🚢

…ated-at-timestamp-for-comparisons Fix for changing snapshot strategy types

…omic

jtcohen6

This feels really good. The DDL/DML is much more straightforward and linear. Selfishly, it's going to make the spark implementation (dbt-labs/dbt-spark#76) a lot easier, too.

I did some local testing on the core four adapters, and everything looks good. Nice work on the pg_sleep test. Since this issue wasn't actually applicable to pg/redshift, is there any sense in trying to run models-slow on snowflake/bq? It might require a js udf.

While I was combing through the logs, I did notice that the snapshot materialization calls macros (expand_target_column_types , get_missing_columns, and get_columns_in_relation ) which repeatedly execute the same information_schema or describe query. That's out of scope for the current PR, and I'm not sure if it's even worth opening an issue about; now that we run the more performant describe query on Snowflake, the additional runtime is hardly noticeable.

drewbanin · 2020-05-04T16:24:27Z

good call @jtcohen6 - just opened up a new issue to dig into this here: #2392

It's definitely worth reducing the number of queries like this that dbt runs. This will be a performance benefit, and it will make the logs way easier to consume too :)

Merging this one - thanks for the help everyone!

johnny-lt · 2024-02-16T21:43:03Z

@drewbanin We are running a dbt snapshot on top of Snowflake. We're often encountering a Duplicate row detected during DML action error when running the snapshot, despite the fact that the source table does not have any duplicate rows. I'm wondering if it's caused by the issue described in this PR.

The source for the snapshot is a Snowflake table that is synced from a production SQL table using Fivetran. I'm not super-familiar with the inner workings of snapshots, but one hypothesis I have is that Fivetran is syncing the row in question while the snapshot is running. The image below shows the snapshot data for the row in question - can you tell from the data whether this hypothesis might be correct?

The next question is: How could my hypothesis be correct, given that this PR should make snapshots atomic? Well, I learned through painful experience that CTEs are not guaranteed to only evaluate once during a given query execution. So I wonder if snapshot_query could be evaluated twice in certain scenarios: once by insertions_source_data and once by updates_source_data.

Thanks for your time :)

cla-bot bot added the cla:yes label May 1, 2020

drewbanin requested a review from beckjake May 1, 2020 19:59

beckjake approved these changes May 1, 2020

View reviewed changes

(#1884) Fix for non-atomic snapshot staging tables

8681dd8

drewbanin force-pushed the fix/snapshot-staging-is-not-atomic branch from 146844e to 8681dd8 Compare May 2, 2020 13:28

drewbanin requested a review from jtcohen6 May 2, 2020 15:56

drewbanin added 5 commits May 2, 2020 12:38

(#2350) Fix for changing snapshot strategy types

2cd6a2d

Tweak code comment

89ba3e7

update changelog

e82e68d

Merge pull request #2391 from fishtown-analytics/fix/snapshot-use-upd…

24ed84d

…ated-at-timestamp-for-comparisons Fix for changing snapshot strategy types

Merge branch 'dev/octavius-catto' into fix/snapshot-staging-is-not-at…

5d99798

…omic

jtcohen6 approved these changes May 4, 2020

View reviewed changes

drewbanin mentioned this pull request May 4, 2020

Snapshots run many introspection queries #2392

Closed

drewbanin merged commit b5f9988 into dev/octavius-catto May 4, 2020

kwigley deleted the fix/snapshot-staging-is-not-atomic branch February 5, 2021 17:17

DavidAkroyd mentioned this pull request Feb 9, 2021

Incremental snapshot sql fails on postgres 9.6 on dbt >= 0.17.0 #3064

Closed

5 tasks

jtcohen6 mentioned this pull request Nov 29, 2021

[Bug] Timestamps are inconsistent between inserted/updated and deleted rows in snapshots #4347

Closed

1 task

dhodge250 mentioned this pull request Nov 15, 2022

DBT Snapshot filling in dbt_valid_to for most recent record oracle/dbt-oracle#52

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/snapshot staging is not atomic #2390

Fix/snapshot staging is not atomic #2390

drewbanin commented May 1, 2020 •

edited

Loading

beckjake left a comment

jtcohen6 left a comment

drewbanin commented May 4, 2020

johnny-lt commented Feb 16, 2024 •

edited

Loading

Fix/snapshot staging is not atomic #2390

Fix/snapshot staging is not atomic #2390

Conversation

drewbanin commented May 1, 2020 • edited Loading

Description

Checklist

beckjake left a comment

Choose a reason for hiding this comment

jtcohen6 left a comment

Choose a reason for hiding this comment

drewbanin commented May 4, 2020

johnny-lt commented Feb 16, 2024 • edited Loading

drewbanin commented May 1, 2020 •

edited

Loading

johnny-lt commented Feb 16, 2024 •

edited

Loading