[CT-3493] [Bug] unique_key list incremental model has performance issues on the delete phase #150

nfarinha · 2023-12-13T19:09:30Z

Is this a new bug in dbt-core?

I believe this is a new bug in dbt-core
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

The scenario is as follows:
• A model that is loading 20k rows
• Is setup as incremental , delete+insert
• The unique_key is an array of 2 columns
• The 2 columns (unique_key) define a single “partition” with all columns (200k)
• The tmp table creation executes fast as expected
• The delete takes hours until we cancelled it
• The execution plan shows that the delete statement is combining 20k * 20k = 200 million rows

The configuration is

{{
    config(
        materialized='incremental',
        incremental_strategy='delete+insert',
        unique_key=['version_code','bu_agg']
    )
}}

The delete statement is assuming that the key is really unique. That probably is not the case if you are using "delete+insert".

delete from big_table
            using big_table__dbt_tmp
            where (
                    big_table__dbt_tmp.version_code = big_table.version_code
                    and big_table__dbt_tmp.bu_agg = big_table.bu_agg

Expected Behavior

The delete statement should not assume the unique key.
My suggestion is to use :

delete from big_table
            where (version_code ,bu_agg) in (
                  select version_code ,bu_agg from big_table__dbt_tmp
              )

Steps To Reproduce

• Setup the model as incremental , delete+insert
• The unique_key is an array of 2 columns
• Build the model
• Check the execution plan

Relevant log output

No response

Environment

- dbt:Cloud 1.7

Which database adapter are you using with dbt?

No response

Additional Context

No response

dbeatty10 · 2023-12-13T20:22:41Z

Thanks for reporting this @nfarinha !

Based on your screenshots of the query plan, it looks like you are using dbt-snowflake. Let's dive into the details.

Non-unique `unique_key`

The delete statement is assuming that the key is really unique. That probably is not the case if you are using "delete+insert".

The folks that have contributed to dbt-labs/docs.getdbt.com#4355 are saying something similar about use-cases for the delete+insert incremental strategy.

Our documentation currently states that unique_key is expected to be truly unique here, but then it also gives guidance on using delete+insert within dbt-snowflake when it isn't truly unique.

For the sake of this discussion, let's assume that we want to support the use case of non-unique keys with delete+insert specifically, but expecting truly unique keys with the other incremental strategies.

The `delete+insert` strategy in dbt-snowflake

The delete+insert strategy in dbt-snowflake defers to the implementation of default__get_delete_insert_merge_sql within dbt-core).

The `delete+insert` strategy in dbt-core

It only does a delete portion if unique_key is defined, otherwise, it only does an insert.

When it does a delete, it has two different code paths introduced in dbt-labs/dbt-core#4858 depending on if unique_key is a list or not.

If unique_key is a list:

https://github.com/dbt-labs/dbt-core/blob/c2bc2f009bbeeb46b3c69d082ab4d485597898af/core/dbt/include/global_project/macros/materializations/models/incremental/merge.sql#L65-L77

If unique_key is a single column:

https://github.com/dbt-labs/dbt-core/blob/c2bc2f009bbeeb46b3c69d082ab4d485597898af/core/dbt/include/global_project/macros/materializations/models/incremental/merge.sql#L79-L89

Your suggestion

Your suggestion is to use:

delete from big_table
            where (version_code ,bu_agg) in (
                  select version_code ,bu_agg from big_table__dbt_tmp
              )

Your suggestion looks really close to that 2nd code path!

So it would essentially involve eliminating the 1st code path in favor of the 2nd (with some small tweaks, of course).

Summary

To the best I can tell, the current logic is expected to give correct results on tiny data sets but it quickly runs into severe performance issues with normal-sized data sets.

We'd need to try out your suggestion with a similar-sized data set (20K pre-existing x 20K updates) to see if it executes in a reasonable amount of time.

nfarinha · 2023-12-13T21:23:23Z

Just tested now: delete from DEV_CUSTOMER_SEG.dbt_z003kwbj_MARTS.dim_hier_versioned_territory using DEV_CUSTOMER_SEG.dbt_z003kwbj_MARTS.dim_hier_versioned_territory__dbt_tmp where ( DEV_CUSTOMER_SEG.dbt_z003kwbj_MARTS.dim_hier_versioned_territory__dbt_tmp.version_code = DEV_CUSTOMER_SEG.dbt_z003kwbj_MARTS.dim_hier_versioned_territory.version_code and DEV_CUSTOMER_SEG.dbt_z003kwbj_MARTS.dim_hier_versioned_territory__dbt_tmp.level_code = DEV_CUSTOMER_SEG.dbt_z003kwbj_MARTS.dim_hier_versioned_territory.level_code ); ***@***.*** Cancelled it.... Not worth to wait. ***@***.*** Deleted directly on Snowsight with the changed query. ***@***.*** Executed in no time as expected. ***@***.*** It's quite simple to replicate but let me know if you need some more detail on this. I'm very happy to redeem all the effort and time you guys have already spared me. Best regards, Nuno Farinha From: Doug Beatty ***@***.***> Sent: Wednesday, December 13, 2023 20:23 To: dbt-labs/dbt-core ***@***.***> Cc: Farinha Goncalves, Nuno (DI IT EH PT 2) ***@***.***>; Mention ***@***.***> Subject: Re: [dbt-labs/dbt-core] [CT-3493] [Bug] unique_key list incremental model has performance issues on the delete phase (Issue dbt-labs/dbt-adapters#150) Thanks for reporting this @nfarinha<https://github.com/nfarinha> ! Based on your screenshots of the query plan, it looks like you are using dbt-snowflake. Let's dive into the details. Non-unique unique_key The delete statement is assuming that the key is really unique. That probably is not the case if you are using "delete+insert". The folks that have contributed to dbt-labs/docs.getdbt.com#4355<dbt-labs/docs.getdbt.com#4355> are saying something similar about use-cases for the delete+insert incremental strategy. Our documentation currently states that unique_key is expected to be truly unique here<https://docs.getdbt.com/docs/build/incremental-models#defining-a-unique-key-optional>, but then it also gives guidance<https://docs.getdbt.com/reference/resource-configs/snowflake-configs#merge-behavior-incremental-models> on using delete+insert within dbt-snowflake when it isn't truly unique. For the sake of this discussion, let's assume that we want to support the use case of non-unique keys with delete+insert specifically, but expecting truly unique keys with the other incremental strategies. The delete+insert strategy in dbt-snowflake The delete+insert strategy<https://github.com/dbt-labs/dbt-snowflake/blob/0374b4ec948982f2ac8ec0c95d53d672ad19e09c/dbt/include/snowflake/macros/materializations/merge.sql#L35-L38> in dbt-snowflake defers to the implementation of default__get_delete_insert_merge_sql<https://github.com/dbt-labs/dbt-core/blob/c2bc2f009bbeeb46b3c69d082ab4d485597898af/core/dbt/include/global_project/macros/materializations/models/incremental/merge.sql#L59> within dbt-core). The delete+insert strategy in dbt-core It only does a delete portion if unique_key is defined, otherwise, it only does an insert. When it does a delete, it has two different code paths introduced in dbt-labs/dbt-core#4858<dbt-labs/dbt-core#4858> depending on if unique_key is a list or not. 1. If unique_key is a list: https://github.com/dbt-labs/dbt-core/blob/c2bc2f009bbeeb46b3c69d082ab4d485597898af/core/dbt/include/global_project/macros/materializations/models/incremental/merge.sql#L65-L77 1. If unique_key is a single column: https://github.com/dbt-labs/dbt-core/blob/c2bc2f009bbeeb46b3c69d082ab4d485597898af/core/dbt/include/global_project/macros/materializations/models/incremental/merge.sql#L79-L89 Your suggestion Your suggestion is to use: delete from big_table where (version_code ,bu_agg) in ( select version_code ,bu_agg from big_table__dbt_tmp ) Your suggestion looks really close to that 2nd code path! So it would essentially involve eliminating the 1st code path in favor of the 2nd (with some small tweaks, of course). Summary To the best I can tell, the current logic is expected to give correct results on tiny data sets but it quickly runs into severe performance issues with normal-sized data sets. We'd need to try out your suggestion with a similar-sized data set (20K pre-existing x 20K updates) to see if it executes in a reasonable amount of time. - Reply to this email directly, view it on GitHub<#150>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BETND4CJEHVAHO5CTQLAPVTYJIFBXAVCNFSM6AAAAABATVI4T2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJUGY2DQNJZG4>. You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>> [CT-3493]: https://dbtlabs.atlassian.net/browse/CT-3493?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

ataft · 2024-01-25T21:41:03Z

I can absolutely confirm that the current multi-column insert+delete query strategy is beyond inefficient. My tests never finish, even with small amounts of data (~100K rows). The proposed strategy appears like the best approach:

delete from table1
where (col1, col2) in (
    select distinct col1, col2 from table1_tmp
)

Some initial/quick testing appears to be faster with distinct in there.

Any ideas on when this fix might be applied?

ataft · 2024-01-25T22:27:56Z

I had to rewrite this macro for an urgent use-case, so here's the improved code (faster and cleaner):

{% macro default__get_delete_insert_merge_sql(target, source, unique_key, dest_columns, incremental_predicates) -%}

    {%- set dest_cols_csv = get_quoted_csv(dest_columns | map(attribute="name")) -%}

    {% if unique_key %}
        {% if unique_key is string %}
        {% set unique_key = [unique_key] %}
        {% endif %}

        {%- set unique_key_str = unique_key|join(', ') -%}

        delete from {{ target }}
        where ({{ unique_key_str }}) in (
            select distinct {{ unique_key_str }}
            from {{ source }}
        )
        {%- if incremental_predicates %}
            {% for predicate in incremental_predicates %}
                and {{ predicate }}
            {% endfor %}
        {%- endif -%};

    {% endif %}

    insert into {{ target }} ({{ dest_cols_csv }})
    (
        select {{ dest_cols_csv }}
        from {{ source }}
    )

{%- endmacro %}

resolves dbt-labs#150 Problem The delete query for the 'delete+insert' incremental_strategy with 2+ unique_key columns is VERY inefficient. In many cases, it will hang and never return for deleting small amounts of data (<100K rows). Solution Improve the query by switching to a much more efficient delete strategy: ``` delete from table1 where (col1, col2) in ( select distinct col1, col2 from table1_tmp ) ```

github-actions · 2024-11-27T02:04:24Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

b-per · 2024-11-27T10:29:32Z

Removing the stale label as this is still relevant

tsrp25 · 2024-11-28T15:24:53Z

The same issue presists in DBT cloud, but DBT support says that they are not planning to change delete + insert approach anytime in future.

nfarinha added bug Something isn't working triage labels Dec 13, 2023

github-actions bot changed the title ~~[Bug] unique_key list incremental model has performance issues on the delete phase~~ [CT-3493] [Bug] unique_key list incremental model has performance issues on the delete phase Dec 13, 2023

dbeatty10 self-assigned this Dec 13, 2023

dbeatty10 removed the triage label Dec 13, 2023

dbeatty10 removed their assignment Dec 13, 2023

dbeatty10 mentioned this issue Dec 13, 2023

Include examples of unique_key as a list dbt-labs/docs.getdbt.com#4642

Open

1 task

This was referenced Jan 26, 2024

Fix incremental delete+insert SQL dbt-labs/dbt-core#9459

Closed

Delete insert patch 1 ataft/dbt-core#1

Open

dbeatty10 added the performance label Apr 8, 2024

dataders transferred this issue from dbt-labs/dbt-core Apr 10, 2024

ataft linked a pull request Apr 10, 2024 that will close this issue

Improve performance of delete+insert incremental strategy #151

Open

4 tasks

Fleid added triage tracking_pr and removed triage labels Apr 11, 2024

dbeatty10 added the incremental Incremental modeling with dbt label May 30, 2024

dbeatty10 mentioned this issue Nov 1, 2024

[Bug] Compiled post_hook code does not match actual post_hook dbt-labs/dbt-core#10943

Closed

2 tasks

dbeatty10 mentioned this issue Nov 26, 2024

[Bug] Microbatch: Using backfill results in inefficient delete statements (Snowflake adapter) #364

Closed

2 tasks

github-actions bot added the Stale label Nov 27, 2024

b-per removed the Stale label Nov 27, 2024

jtcohen6 mentioned this issue Nov 28, 2024

[Feature] Optimize incremental 'insert_overwrite' strategy dbt-labs/dbt-bigquery#1409

Open

3 tasks

mikealfare pushed a commit that referenced this issue Dec 2, 2024

fix:bump libs for pre-commit (#150)

b351cd2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-3493] [Bug] unique_key list incremental model has performance issues on the delete phase #150

[CT-3493] [Bug] unique_key list incremental model has performance issues on the delete phase #150

nfarinha commented Dec 13, 2023 •

edited by dbeatty10

Loading

dbeatty10 commented Dec 13, 2023

nfarinha commented Dec 13, 2023 via email •

edited by jira bot

Loading

ataft commented Jan 25, 2024

ataft commented Jan 25, 2024

github-actions bot commented Nov 27, 2024

b-per commented Nov 27, 2024

tsrp25 commented Nov 28, 2024

[CT-3493] [Bug] unique_key list incremental model has performance issues on the delete phase #150

[CT-3493] [Bug] unique_key list incremental model has performance issues on the delete phase #150

Comments

nfarinha commented Dec 13, 2023 • edited by dbeatty10 Loading

Is this a new bug in dbt-core?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Which database adapter are you using with dbt?

Additional Context

dbeatty10 commented Dec 13, 2023

Non-unique unique_key

The delete+insert strategy in dbt-snowflake

The delete+insert strategy in dbt-core

Your suggestion

Summary

nfarinha commented Dec 13, 2023 via email • edited by jira bot Loading

ataft commented Jan 25, 2024

ataft commented Jan 25, 2024

github-actions bot commented Nov 27, 2024

b-per commented Nov 27, 2024

tsrp25 commented Nov 28, 2024

nfarinha commented Dec 13, 2023 •

edited by dbeatty10

Loading

Non-unique `unique_key`

The `delete+insert` strategy in dbt-snowflake

The `delete+insert` strategy in dbt-core

nfarinha commented Dec 13, 2023 via email •

edited by jira bot

Loading