[Bug] Cartesian Join based deletion is causing performance problems when it hits a certain scale for microbatch models #1228

graciegoheen · 2024-10-16T15:41:14Z

Is this a new bug in dbt-core?

I believe this is a new bug in dbt-core
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Cartesian Join based deletion is causing data spilling to disk which heavily bogs down performance

The delete statement looks like:

delete from analytics_dev.dbt_aescay.my_model DBT_INTERNAL_TARGET
    using analytics_dev.dbt_aescay.my_model__dbt_tmp DBT_TMP_TARGET
    where (
    DBT_INTERNAL_TARGET.event_at >= TIMESTAMP '2024-10-14 00:00:00+00:00'
    and DBT_INTERNAL_TARGET.event_at < TIMESTAMP '2024-10-15 00:00:00+00:00'
    
    );

But we are not doing anything with my_model__dbt_tmp in the where clause.

We can simplify this logic and improve the performance, by instead doing:

delete from <existing> where <date range>;
insert into <existing> from <new data for same date range>;

One advantage of microbatch is that we know in advance the exact boundaries of every batch (time range, cf. "static" insert_overwrite).

In a world where we support "microbatch merge" models (= update batches by upserting on unique_key, rather than full batch replacement), then we would want to join (using) based on unique_key match, like so:

delete from analytics_dev.dbt_aescay.my_model DBT_INTERNAL_TARGET
    using analytics_dev.dbt_aescay.my_model__dbt_tmp DBT_TMP_TARGET
    where DBT_INTERNAL_TARGET.event_id = DBT_TMP_TARGET.event_id
    and (
    DBT_INTERNAL_TARGET.event_at >= TIMESTAMP '2024-10-14 00:00:00+00:00'
    and DBT_INTERNAL_TARGET.event_at < TIMESTAMP '2024-10-15 00:00:00+00:00'
    
    );

But this shouldn't be the default assumption.

Expected Behavior

We should delete this line.

Steps To Reproduce

See here.

Relevant log output

No response

Environment

- OS:
- Python:
- dbt:

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

The text was updated successfully, but these errors were encountered:

QuentinCoviaux · 2024-10-24T14:34:26Z

This seems like a decent deal-breaker at the moment - unless I'm facing some other config issue.

To put some numbers in perspectives (on Snowflake, based on a X-Small warehouse):

Take one table with ~30 millions records (~0.5GB) roughly spread over 30 days
Lookback of 3 days
Delete query takes more than 10 minutes (I cancelled it after that), and this has to be done for each lookback period

Hoping we can get this one prioritized 😊

graciegoheen added the bug Something isn't working label Oct 16, 2024

graciegoheen mentioned this issue Oct 16, 2024

[EPIC] Incremental Model Improvements - Microbatch dbt-labs/dbt-core#10624

Open

graciegoheen mentioned this issue Oct 23, 2024

[Feature] Incremental strategies for delete+insert and microbatch cause unnecessary cross joins #1198

Closed

3 tasks

MichelleArk transferred this issue from dbt-labs/dbt-core Nov 4, 2024

MichelleArk mentioned this issue Nov 4, 2024

[Microbatch] Optimizations: use view for temp relation + remove using clause during delete statement #1192

Merged

4 tasks

MichelleArk closed this as completed in #1192 Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Cartesian Join based deletion is causing performance problems when it hits a certain scale for microbatch models #1228

[Bug] Cartesian Join based deletion is causing performance problems when it hits a certain scale for microbatch models #1228

graciegoheen commented Oct 16, 2024

QuentinCoviaux commented Oct 24, 2024

[Bug] Cartesian Join based deletion is causing performance problems when it hits a certain scale for microbatch models #1228

[Bug] Cartesian Join based deletion is causing performance problems when it hits a certain scale for microbatch models #1228

Comments

graciegoheen commented Oct 16, 2024

Is this a new bug in dbt-core?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Which database adapter are you using with dbt?

Additional Context

QuentinCoviaux commented Oct 24, 2024