Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement archival with a "merge" statement #1339

Closed
drewbanin opened this issue Mar 6, 2019 · 2 comments · Fixed by #1478
Closed

Implement archival with a "merge" statement #1339

drewbanin opened this issue Mar 6, 2019 · 2 comments · Fixed by #1478
Assignees
Labels
enhancement New feature or request snapshots Issues related to dbt's snapshot functionality

Comments

@drewbanin
Copy link
Contributor

drewbanin commented Mar 6, 2019

Feature

Feature description

The current implementation of archival (0.13.0) effectively implements a merge using update and insert statements. Instead, we should leverage a merge abstraction (also used by incremental models) to help normalize the implementation of archival.

There are a few benefits of using a merge here:

  1. it is an atomic way of performing archival on databases like Snowflake[1] and BigQuery
  2. the database can presumably do less work if the inserts and updates are specified in the same query
  3. It should greatly simplify the archival materialization sql

Reasons not to do this:

  1. if it ain't broke.....
  2. merge is not implemented for all adapters. We'll need to build out a merge abstraction for redshift/postgres/et al, which could be complicated

[1] Original issue description, snowflake specific

If two archive jobs run simultaneously on Snowflake, duplicate records can be inserted into the archive destination table. This problem can be circumvented with a merge.

The problem here is that archival is currently implemented as:

  1. create temp table
  2. insert
  3. update

If two jobs run at the same time, they will both create identical temp tables (how do these not conflict?). When the jobs proceed to insert/update data, they will both duplicate work in the insert + update steps, resulting in duplicated data being inserted into the destination table.

Because a proper Snowflake merge would happen as a single atomic operation, two merges that are serialized would still result in the intended behavior. In this approach, dbt wouldn’t use a temp table. Instead, the merge would be responsible for finding new records to merge, inserting, and updating all at once. The second serialized merge would find no changes to merge, and would exit without modifying the destination table.

@drewbanin drewbanin added enhancement New feature or request estimate: 8 snapshots Issues related to dbt's snapshot functionality labels Mar 6, 2019
@drewbanin drewbanin added this to the Wilt Chamberlain milestone Mar 6, 2019
@drewbanin drewbanin self-assigned this Mar 6, 2019
@tayloramurphy
Copy link

Merge docs for reference https://docs.snowflake.net/manuals/sql-reference/sql/merge.html

Does dbt archive have tests as well for the archived? Would nice to either have built in ones or be able to specify custom ones.

@drewbanin
Copy link
Contributor Author

We have a whole slew of archive-related issues coming down in 0.13.1: https://github.com/fishtown-analytics/dbt/issues?q=is%3Aopen+is%3Aissue+label%3Aarchive

I don't think testing for archived tables exists yet, though that will certainly be doable once they act like more proper dbt resources! Check out this issue in particular. I just added a note to that issue to investigate testing for archives

@drewbanin drewbanin changed the title Use a merge statement for archival on Snowflake Implement archival with a "merge" statement Mar 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request snapshots Issues related to dbt's snapshot functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants