You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation of archival (0.13.0) effectively implements a merge using update and insert statements. Instead, we should leverage a merge abstraction (also used by incremental models) to help normalize the implementation of archival.
There are a few benefits of using a merge here:
it is an atomic way of performing archival on databases like Snowflake[1] and BigQuery
the database can presumably do less work if the inserts and updates are specified in the same query
It should greatly simplify the archival materialization sql
Reasons not to do this:
if it ain't broke.....
merge is not implemented for all adapters. We'll need to build out a merge abstraction for redshift/postgres/et al, which could be complicated
[1] Original issue description, snowflake specific
If two archive jobs run simultaneously on Snowflake, duplicate records can be inserted into the archive destination table. This problem can be circumvented with a merge.
The problem here is that archival is currently implemented as:
create temp table
insert
update
If two jobs run at the same time, they will both create identical temp tables (how do these not conflict?). When the jobs proceed to insert/update data, they will both duplicate work in the insert + update steps, resulting in duplicated data being inserted into the destination table.
Because a proper Snowflake merge would happen as a single atomic operation, two merges that are serialized would still result in the intended behavior. In this approach, dbt wouldn’t use a temp table. Instead, the merge would be responsible for finding new records to merge, inserting, and updating all at once. The second serialized merge would find no changes to merge, and would exit without modifying the destination table.
The text was updated successfully, but these errors were encountered:
I don't think testing for archived tables exists yet, though that will certainly be doable once they act like more proper dbt resources! Check out this issue in particular. I just added a note to that issue to investigate testing for archives
drewbanin
changed the title
Use a merge statement for archival on Snowflake
Implement archival with a "merge" statement
Mar 23, 2019
Feature
Feature description
The current implementation of archival (0.13.0) effectively implements a
merge
usingupdate
andinsert
statements. Instead, we should leverage amerge
abstraction (also used by incremental models) to help normalize the implementation of archival.There are a few benefits of using a merge here:
Reasons not to do this:
merge
is not implemented for all adapters. We'll need to build out amerge
abstraction for redshift/postgres/et al, which could be complicated[1] Original issue description, snowflake specific
If two archive jobs run simultaneously on Snowflake, duplicate records can be inserted into the archive destination table. This problem can be circumvented with a merge.
The problem here is that archival is currently implemented as:
If two jobs run at the same time, they will both create identical temp tables (how do these not conflict?). When the jobs proceed to insert/update data, they will both duplicate work in the insert + update steps, resulting in duplicated data being inserted into the destination table.
Because a proper Snowflake merge would happen as a single atomic operation, two merges that are serialized would still result in the intended behavior. In this approach, dbt wouldn’t use a temp table. Instead, the merge would be responsible for finding new records to merge, inserting, and updating all at once. The second serialized merge would find no changes to merge, and would exit without modifying the destination table.
The text was updated successfully, but these errors were encountered: