-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for re-used check cols in snapshots #1614
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great
|
||
|
||
{% macro default__snapshot_hash_arguments(args) %} | ||
{% macro default__snapshot_hash_arguments(args) -%} | ||
md5({% for arg in args %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to make this {%- for arg in args -%}...{%- endfor %}
, or does that shove all of this onto one line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - i was trying to make the whitespace a little cleaner here - it was pretty hard to find what I was looking for in the logs while it was all on one line (especially with the new subquery).
This output looks something like:
source_data as (
select *,
md5(
coalesce(cast(number as varchar ), '')
|| '|' ||
coalesce(cast((
select count(*) from snapshotted_data
where snapshotted_data.dbt_unique_key = number
) as varchar ), '')
|| '|' ||
coalesce(cast(name as varchar ), '')
) as dbt_scd_id,
Which isn't the most beautiful SQL I've ever seen, but should be ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the whitespace controls makes it look like this:
source_data as (
select *,
md5(coalesce(cast(number as varchar ), '')
|| '|' || coalesce(cast((
select count(*) from snapshotted_data
where snapshotted_data.dbt_unique_key = number
) as varchar ), '')
|| '|' || coalesce(cast(name as varchar ), '')
) as dbt_scd_id,
),
I will update!
{% if target_exists %} | ||
{% set row_version -%} | ||
( | ||
select count(*) from {{ snapshotted_rel }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if there are two rows with the same primary key in one snapshot operation? Obviously user error, but do we do anything like error about it, or just give bad results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that will result in "undefined behavior" -- the exact outcome depends on:
- if the specified
check_cols
values are duplicated as well, or if they differ across the duplicated primary key - the exact state of the database
Whichever way you slice it, dbt is going to do something undesirable if the specified primary key is duplicated in the snapshot operation though. On pg/redshift, you'll end up with dupes in your snapshot table. On snowflake/bq, you should see a nondeterministic merge
error.
We could:
- run a uniqueness check on the staging temp table before loading into the destination table
- use a
row_number()
(or similar) to dedupe records when building the staging table
- this would solve some problems but create others!
- leave it to the user to verify that their snapshot
unique_keys
are unique
- updating documentation
- making it possible to specify a
schema.yml
spec for snapshots?
@@ -202,15 +202,31 @@ def raw_execute(self, sql, fetch=False): | |||
|
|||
def execute(self, sql, auto_begin=False, fetch=None): | |||
# auto_begin is ignored on bigquery, and only included for consistency | |||
_, iterator = self.raw_execute(sql, fetch=fetch) | |||
query_job, iterator = self.raw_execute(sql, fetch=fetch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
to_hex(md5(concat({% for arg in args %}coalesce(cast({{ arg }} as string), ''){% if not loop.last %}, '|',{% endif %}{% endfor %}))) | ||
{% endmacro %} | ||
{% macro bigquery__snapshot_hash_arguments(args) -%} | ||
to_hex(md5(concat({% for arg in args %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have similar for-loop whitespace formatting questions as I did on snowflake, here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good points @beckjake - thanks for the feedback! Let me know if you have any thoughts about the right path forward
|
||
|
||
{% macro default__snapshot_hash_arguments(args) %} | ||
{% macro default__snapshot_hash_arguments(args) -%} | ||
md5({% for arg in args %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - i was trying to make the whitespace a little cleaner here - it was pretty hard to find what I was looking for in the logs while it was all on one line (especially with the new subquery).
This output looks something like:
source_data as (
select *,
md5(
coalesce(cast(number as varchar ), '')
|| '|' ||
coalesce(cast((
select count(*) from snapshotted_data
where snapshotted_data.dbt_unique_key = number
) as varchar ), '')
|| '|' ||
coalesce(cast(name as varchar ), '')
) as dbt_scd_id,
Which isn't the most beautiful SQL I've ever seen, but should be ok
|
||
|
||
{% macro default__snapshot_hash_arguments(args) %} | ||
{% macro default__snapshot_hash_arguments(args) -%} | ||
md5({% for arg in args %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the whitespace controls makes it look like this:
source_data as (
select *,
md5(coalesce(cast(number as varchar ), '')
|| '|' || coalesce(cast((
select count(*) from snapshotted_data
where snapshotted_data.dbt_unique_key = number
) as varchar ), '')
|| '|' || coalesce(cast(name as varchar ), '')
) as dbt_scd_id,
),
I will update!
{% if target_exists %} | ||
{% set row_version -%} | ||
( | ||
select count(*) from {{ snapshotted_rel }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that will result in "undefined behavior" -- the exact outcome depends on:
- if the specified
check_cols
values are duplicated as well, or if they differ across the duplicated primary key - the exact state of the database
Whichever way you slice it, dbt is going to do something undesirable if the specified primary key is duplicated in the snapshot operation though. On pg/redshift, you'll end up with dupes in your snapshot table. On snowflake/bq, you should see a nondeterministic merge
error.
We could:
- run a uniqueness check on the staging temp table before loading into the destination table
- use a
row_number()
(or similar) to dedupe records when building the staging table
- this would solve some problems but create others!
- leave it to the user to verify that their snapshot
unique_keys
are unique
- updating documentation
- making it possible to specify a
schema.yml
spec for snapshots?
@drewbanin Given those failure modes, I think it's fine to just document this pattern and leave it to the user. It sounds like there won't be any loss of data: on BigQuery/Snowflake you'll get an error and on Postgres/Redshift you'll get a pretty dumb database state, but that's fine. If users want to run a snapshot and ensure uniqueness before running, they can always set the table up as a source or model and provide appropriate schema tests on the source, right? |
I am late in the game here. But why can't you just use {% set scd_id_expr = snapshot_hash_arguments([primary_key, updated_at]) %} in macro snapshot_check_strategy ? Then:
Or am I missing something here? |
Hey @mikaelene - the |
Hi @drewbanin . I totally got that. But in the checkstrategy:
If you merge themn on You will end up with the right inserts for not matched and the right updates for matched and you would not need to:
|
What's value are you thinking we'd use for the |
If you read the logs of a check_cols run you see that the first SQL executed is to identify changed rows. The
The 'insert'-rows will have an scd_id_expr of hash( The I'm on a business trip and don't have my dev-pc for running the tests. But if you checkout 0.14.0 and change row 110 to Sorry for bothering you on this old fix! I stumbled over it on a link somewhere and couldn't help to notice that there might be an easier solution. Even if there are a possibillity that I am missing something vital :-) |
Hey @mikaelene - thanks for this! I think the key thing here is that dbt performs this merge in two different ways:
You can check out the source code for the postgres + redshift approach as well as the bigquery + snowflake approach in the dbt source code. In the BigQuery + Snowflake approach (the default!) the We pursued the approach presented in this PR in order to keep the core snapshot logic consistent across adapters while still supporting each of postgres, redshift, snowflake, and bigquery. I do think you're right: the logic here isn't strictly necessary on redshift and postgres. It may turn out that in the future, it will make sense to let the implementation for snapshots vary more across databases. For now though, there's a ton of merit to keeping things consistent. Thanks for taking the time to write this up! Happy to answer any further questions if you have them. |
Hi @drewbanin ,Thanks for getting back to me, event if i am annoying and stubborn :-). I am using the default (merge) approach in my sqlserver-adapter and it has been working for us since 0.14.0RC. I don't have access to Snowflake or BigQuery so I can't test on them. I made that change in my adapter after noticing this on 0.14.0 before you had started working on this PR. The Consider this table to be snappshotted:
The first run it will only have an insert-record in the source-table for the merge with a dbt_scd_id of
If the table change to
The second run it will have below in the source-merge-table:
If it change back to
The third run it will have below in the source-merge-table:
The resulting snappshotted table will have something like:
This approach will be database agnostic as well. |
Hey @mikaelene - thanks for going really deep here! I think you're totally right! Sorry for being obtuse earlier - the design you're advocating for is both simpler and faster than the approach implemented in this PR! I needed to play around with this a lot to convince myself that you are right... but it definitely appears to me that you are :) We're deep in the RC period for the 0.14.1 release, but I'm going to see if we can slip this update in. The simpler we can keep the snapshot code, the better IMO. Thanks so much for taking the time! |
This was the key thing I didn't internalize when reading your comments. I had it in my head that the |
Great to hear @drewbanin! I saw you already merged the changes. I'll give it a spin shortly :-) |
Related to #1590
An issue existed in Snapshots on 0.14.0 for BigQuery and Snowflake which caused
dbt snapshot
to work incorrectly when thecheck
strategy was used in a specific scenario. If the check cols cycled between values and ended up in a previously known state, then dbt would set thedbt_valid_to
date to be non-null for the "current" record, but no corresponding "new" record would be inserted.This is a function of how dbt merges snapshot changes into the destination table (pseudocode):
The problem here is that
when not matched
wouldn't be evaluated for these records, because there was a match on thedbt_scd_id
field. This PR mitigates that problem by including a "row version" in thedbt_scd_id
hash when thecheck_cols
strategy is specified. For incremental snapshot builds, thecheck
strategy will incorporate the number of instances of the specifieddbt_unique_key
into thedbt_scd_id
hash. So, a set of reused values for thecheck_cols
will result in a different hash on a subsequent invocation ofdbt snapshot
.This PR also includes improvements to the
status
shown for BigQuery create/insert/merge log lines, as it was helpful in debugging.Edit: See also #1709 for an improved approach to mitigating this problem (thanks to @mikaelene for the suggestion and sample code!)