You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched the existing issues, and I could not find an existing issue for this bug
Current Behavior
Cartesian Join based deletion is causing data spilling to disk which heavily bogs down performance
The delete statement looks like:
delete from analytics_dev.dbt_aescay.my_model DBT_INTERNAL_TARGET
using analytics_dev.dbt_aescay.my_model__dbt_tmp DBT_TMP_TARGET
where (
DBT_INTERNAL_TARGET.event_at >= TIMESTAMP '2024-10-14 00:00:00+00:00'
and DBT_INTERNAL_TARGET.event_at < TIMESTAMP '2024-10-15 00:00:00+00:00'
);
But we are not doing anything with my_model__dbt_tmp in the where clause.
We can simplify this logic and improve the performance, by instead doing:
delete from <existing> where <date range>;
insert into <existing> from <new data for same date range>;
One advantage of microbatch is that we know in advance the exact boundaries of every batch (time range, cf. "static" insert_overwrite).
In a world where we support "microbatch merge" models (= update batches by upserting on unique_key, rather than full batch replacement), then we would want to join (using) based on unique_key match, like so:
delete from analytics_dev.dbt_aescay.my_model DBT_INTERNAL_TARGET
using analytics_dev.dbt_aescay.my_model__dbt_tmp DBT_TMP_TARGET
where DBT_INTERNAL_TARGET.event_id = DBT_TMP_TARGET.event_id
and (
DBT_INTERNAL_TARGET.event_at >= TIMESTAMP '2024-10-14 00:00:00+00:00'
and DBT_INTERNAL_TARGET.event_at < TIMESTAMP '2024-10-15 00:00:00+00:00'
);
Is this a new bug in dbt-core?
Current Behavior
Cartesian Join based deletion is causing data spilling to disk which heavily bogs down performance
The delete statement looks like:
But we are not doing anything with
my_model__dbt_tmp
in thewhere
clause.We can simplify this logic and improve the performance, by instead doing:
One advantage of microbatch is that we know in advance the exact boundaries of every batch (time range, cf. "static"
insert_overwrite
).In a world where we support "microbatch merge" models (= update batches by upserting on
unique_key
, rather than full batch replacement), then we would want to join (using) based onunique_key
match, like so:But this shouldn't be the default assumption.
Expected Behavior
We should delete this line.
Steps To Reproduce
See here.
Relevant log output
No response
Environment
Which database adapter are you using with dbt?
snowflake
Additional Context
No response
The text was updated successfully, but these errors were encountered: