Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Initial models builds with microbatch incremental strategy resulting in row duplication #10924

Closed
2 tasks done
bthomson22 opened this issue Oct 25, 2024 · 1 comment
Closed
2 tasks done
Labels
bug Something isn't working incremental Incremental modeling with dbt microbatch Issues related to the microbatch incremental strategy pre-release Bug not yet in a stable release wontfix Not a bug or out of scope for dbt-core

Comments

@bthomson22
Copy link

bthomson22 commented Oct 25, 2024

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When creating a new model using the microbatch incremental strategy, the initial tmp table is a full copy of source data and is re-inserted, in full, for every batch. The delete queries are removing data from the approach batch range, but inserts are full-table loads.

The impact of this is heavily duplicated data on initial model builds, and increasing batch query times.

Expected Behavior

There's generally two ways I'd expect for this to happen:

  1. The initial table creation is a one-time full_refresh, similar to existing incremental models (perhaps not ideal for very large tables).
  2. The tmp table creation mirrors the event_time windows in the batch and only inserts values for the same period that were deleted. Currently there is no event_time filter on these tmp tables (probably ideal).

Steps To Reproduce

  1. Create a new model with the microbatch incremental strategy. Ensure a table doesn't already exist with the same name.
  2. Build the model without any event-time conditions, such as dbt build -s model_name
  3. Review the logs and see the total table size re-inserted for each batch.

Relevant log output

No response

Environment

- OS: MacOS Sonoma 14.6.1
- Python: 3.8.10
- dbt: CLI 0.38.18 (versionless)

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

@bthomson22 bthomson22 added bug Something isn't working triage labels Oct 25, 2024
@dbeatty10 dbeatty10 added incremental Incremental modeling with dbt pre-release Bug not yet in a stable release microbatch Issues related to the microbatch incremental strategy labels Oct 25, 2024
@graciegoheen
Copy link
Contributor

Looks like the issue was that there was no event_time configured for the direct parent(s) of the model.

Closing in favor of this issue #10926

@graciegoheen graciegoheen closed this as not planned Won't fix, can't repro, duplicate, stale Oct 28, 2024
@dbeatty10 dbeatty10 added wontfix Not a bug or out of scope for dbt-core and removed triage labels Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working incremental Incremental modeling with dbt microbatch Issues related to the microbatch incremental strategy pre-release Bug not yet in a stable release wontfix Not a bug or out of scope for dbt-core
Projects
None yet
Development

No branches or pull requests

3 participants