Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding net-new column to explicit check_cols breaks snapshot #3146

Closed
1 of 5 tasks
aspfohl opened this issue Mar 5, 2021 · 4 comments · Fixed by #4893
Closed
1 of 5 tasks

Adding net-new column to explicit check_cols breaks snapshot #3146

aspfohl opened this issue Mar 5, 2021 · 4 comments · Fixed by #4893
Labels
bug Something isn't working good_first_issue Straightforward + self-contained changes, good for new contributors! snapshots Issues related to dbt's snapshot functionality

Comments

@aspfohl
Copy link

aspfohl commented Mar 5, 2021

Describe the bug

Snapshots using the "check" strategy with hard deletes turned on and explicit check_cols break when adding additional columns

Steps To Reproduce

Initially create & run sql:

{% snapshot example %}
{{config(strategy='check', unique_key='id', check_cols=['name'])}}
select 1 as id, 'foo' as name;
{% endsnapshot %}
dbt snapshot -s example

Update the sql to check an additional column:

{% snapshot example %}
{{config(strategy='check', unique_key='id', check_cols=['name', 'other'])}}
select 1 as id, 'foo' as name, 'bar' as other;
{% endsnapshot %}
dbt snapshot -s example

Completed with 1 error and 0 warnings:

Database Error in snapshot example (snapshots/**/example.sql)
  000904 (42000): SQL compilation error: error line 80 at position 13
  invalid identifier 'SNAPSHOTTED_DATA.OTHER'

Expected behavior

https://docs.getdbt.com/docs/building-a-dbt-project/snapshots

Creating new columns from the source query in the destination table

Screenshots and log output

see example

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.19.0
   latest version: 0.19.0

Up to date!

Plugins:
  - bigquery: 0.19.0
  - redshift: 0.19.0
  - snowflake: 0.19.0
  - postgres: 0.19.0

The operating system you're using:
macos

The output of python --version:
Python 3.7.10

Additional context

Tangentially related resolved issue: #1797

@aspfohl aspfohl added bug Something isn't working triage labels Mar 5, 2021
@jtcohen6 jtcohen6 added good_first_issue Straightforward + self-contained changes, good for new contributors! snapshots Issues related to dbt's snapshot functionality and removed triage labels Mar 5, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 5, 2021

Thanks for the detailed writeup @annasmith370!

I don't believe this has to do with invalidate_hard_deletes; I was able to reproduce the same error without turning on that config! The issue here is that, if you're explicitly listing a set of check_cols, dbt assumes that those columns exist in both the snapshot query (new data) and the existing snapshot table (old data).

This works if you split the second step into two:

Step 1: run initial snapshot
{% snapshot example %}
{{config(strategy='check', unique_key='id', check_cols=['name'])}}
select 1 as id, 'foo' as name
{% endsnapshot %}
$ dbt snapshot
id name dbt_scd_id dbt_updated_at dbt_valid_from dbt_valid_to
1 foo 1e2e0f8d847887872cad7a208465e459 2021-03-05 06:42:39 2021-03-05 06:42:39
Step 2: add new column to query, snapshot
{% snapshot example %}
{{config(strategy='check', unique_key='id', check_cols=['name'])}}
select 1 as id, 'foo' as name, 'bar' as other
{% endsnapshot %}
$ dbt snapshot
id name dbt_scd_id dbt_updated_at dbt_valid_from dbt_valid_to other
1 foo 1e2e0f8d847887872cad7a208465e459 2021-03-05 06:42:39 2021-03-05 06:42:39
Step 3: add new column to check_cols, snapshot
{% snapshot example %}
{{config(strategy='check', unique_key='id', check_cols=['name', 'other'])}}
select 1 as id, 'foo' as name, 'bar' as other
{% endsnapshot %}
id name dbt_scd_id dbt_updated_at dbt_valid_from dbt_valid_to other
1 foo 1e2e0f8d847887872cad7a208465e459 2021-03-05 06:42:39 2021-03-05 06:42:39 2021-03-05 06:43:43
1 foo 39801f89a6518598e4b1a1532fb871cb 2021-03-05 06:43:43 2021-03-05 06:43:43   bar

Good call on finding #1797: When check_cols = 'all' and a snapshot query is just select * from a table, dbt needs to be smarter and determine whether a column has been added since the last time dbt snapshot ran, because the column is appearing in the query (new data) but not in the target table (old data). If there is any new column, then all currently valid records in the snapshot will be marked invalid, the column will be added (null for historical records), and all new records will be added. Effectively, it accomplishes the same as steps 2 + 3 above, in one combined step.

So the question is: When check_cols is not 'all', but an explicit list of columns, should dbt check to see if any of those columns is new (i.e. does not exist in the target table), and if so, mark all rows for invalidation and update? I think it would make sense. It would require similar logic to what's performed in the 'all' case, but using the check_cols_config instead of query_columns, and with some thought around casing/quoting:

https://github.com/fishtown-analytics/dbt/blob/b70fb543f5825744b7705009aef815b8071ab516/core/dbt/include/global_project/macros/materializations/snapshot/strategies.sql#L117-L129

Is this code you'd be interested in contributing? :)

@jtcohen6 jtcohen6 changed the title Adding new columns breaks hard delete snapshots Adding net-new column to explicit check_cols breaks snapshot Mar 5, 2021
@kochalex
Copy link

kochalex commented Jul 8, 2021

With redshift & dbt 0.19.1 I was getting the following error: (To help others else find this issue)

Database Error in snapshot snapshot_name_snapshot (snapshots/snapshot_name.sql)
  column snapshotted_data.new_column_name does not exist

@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2022

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Jan 5, 2022
@dbeatty10
Copy link
Contributor

This issue was closed for being stale. Re-opening since there is related activity in #4893

@dbeatty10 dbeatty10 reopened this Apr 21, 2022
@github-actions github-actions bot removed the stale Issues that have gone stale label Apr 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good_first_issue Straightforward + self-contained changes, good for new contributors! snapshots Issues related to dbt's snapshot functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants