[CT-646] [Proposal] Streamline Incremental Strategies #5245

nathaniel-may · 2022-05-13T20:13:23Z

When defining an incremental model, there are several incremental strategies that could be used to execute the incremental model, (namely: "append," "delete+insert," "merge," and "insert_overwrite.") but not all of them are supported by every warehouse. Working with these incremental strategies as an adapter maintainer and as an advanced user is difficult. This ticket aims to improve the experience for both of these groups of people.

Functional Requirements

Adapter maintainers need to be able to specify which incremental strategies are supported by the warehouse and be able to supply the correct sql statement to execute if the default is not sufficient.
Advanced users that wish to specify a custom incremental strategy must be able to do so.

Behavior Today

To specify which incremental strategies are supported by the warehouse, the adapter maintainer must write a macro that returns a list of strings that name the supported strategies, and include the jinja macro adapter__get_merge_sql that takes a fixed set of parameters and returns the materialization sql. This macro is often copy-pasted from core, and is error prone when editing.
Advanced dbt users who wish to specify a custom incremental strategy must override the same boiler plate jinja macro by copy pasting it into their dbt project.
When an unsupported incremental strategy is specified, the user is notified via an exception at during the first dbt run.

Desired Behavior

To specify which incremental strategies are supported by the warehouse, the adapter maintainer must write a macro that returns a list of strings that name the supported strategies. If the warehouse only supports a subset of the default incremental strategies and the default implementation in the global project produces sql that works on the target warehouse, no additional work needs to be done. If any of the default behavior needs to be overridden, it can be done by writing a macro that conforms to the naming convention "get_incremental_NAME_sql" where "NAME" is the string used to configure the incremental strategy (e.g. "get incremental_append_sql"). If the warehouse has an additional incremental strategy, defining a new macro using the same "get_incremental_NAME_sql" convention will make it available to users of the adapter.
Advanced dbt users who wish to specify a custom incremental strategy will only need to create a macro that conforms to the naming convention "get_incremental_NAME_sql" that produces the correct sql for the target warehouse.
When an unsupported incremental strategy is designated, raise an error to users before runtime.
Existing projects will experience no breaking changes when they upgrade to the version of dbt that includes this new behavior.

Implementation

Docs
This will require a significant change to the documentation on incremental materializations.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2022-05-16T09:39:10Z

Update: everything in this comment has been resolved, and incorporated into the main issue

@nathaniel-may This is really clearly stated, and I think you captured the many nuances well!

TODO name the macro

We were discussing validate_incremental_strategy, and that this may want to be an adapter (Python) method rather than a macro. It would work similar to "snapshot strategy dispatch", but we agreed the Jinja here is pretty jank.

TODO should this actually be a compile check?

I think this may get tricky with partial parsing: We'd need to recalculate any time a new incremental_strategy is defined, and any time a macro named get_incremental_X_sql is created/deleted. While it's definitely preferable to raise this error before the model's actual runtime, it would be worse to cause a buggy experience.

It's also possible that users completely override the incremental materialization—a thing we want them to need to do less of, but still—in which case, the validate_incremental_strategy macro/method may not be used at all.

TODO CONTINUE THIS LIST

Other things I recall from our conversation:

The get_incremental_X_sql macros should all have the same macro (function) signature, but they may need different sorts of arguments to be passed in. Rather than developing macro signatures of 10+ arguments, and requiring all macros to evolve their signature when any one of them does, we might use a single dictionary argument instead. The materialization would pass this dictionary into the macro, and the macro would return a string (the templated SQL to be run).
If a user defines their own custom strategy, that requires their own custom config, they can just pull that custom config into get_incremental_X_sql via context-global config.get('...'). That's not a good habit for us, as maintainers, but it's a valuable escape hatch for users, and avoids the need to copy-paste-edit the whole incremental materialization in their projects.
There are several other complexities with the incremental materialization (full refresh, on_schema_change, whether to create a temp table or not), and this change won't solve for all of them. It should enable us to delete a bunch of boilerplate code, though, and clear the way for examining those other discrepancies that exist today.

dataders · 2022-05-17T21:59:17Z

one quick thought as a requirement. will come back to this later

there should be a "functional" test for at least each incremental strategy in core/tests (arguably each possible config of each strategy), from which each adapter can import the tests it needs (as well as make their own as needed.

jtcohen6 · 2022-05-18T08:48:00Z

TODO what is the default implementation?

The lowest-level default implementation should be append (as defined in dbt-core). That should also be the default when no unique_key is specified. For backwards compatibility with current behavior, each adapter should be able to specify its default strategy when a unique_key is supplied, namely delete+insert on Postgres/Redshift and merge on Snowflake/BigQuery/Databricks.

nathaniel-may · 2022-06-01T16:29:02Z

top-level ticket description reflects the previous conversation.

gshank · 2022-06-02T18:17:46Z

A couple of nits: I think the name "get_incremental_delete+insert_sql(dict)" would need to use an underscore instead of a plus. I believe that the adapter tests should go in the adapter zone (tests/adapter) and not in core/tests. I think that's what was intended from the wording.

gshank · 2022-06-02T19:46:09Z

I'm not finding get_insert_into_sql or get_incremental_append_sql.

nathaniel-may · 2022-06-02T20:03:49Z

Yeah the + isn't going to work. I'll change that to an _.
Yeah adapter zone sounds like the right place for these tests I'll change that as well.
get_incremental_append_sql is one of the macros that needs to be created. It just gets wrapped by another new "default" function in core.
get_insert_into_sql def isn't right though. I bet it was supposed to be get_merge_sql where predicates=None from merge.sql. Does that sound right to you?

jtcohen6 · 2022-06-03T12:25:34Z

get_insert_into_sql is the cleanest way to represent true "append" behavior, when a unique_key isn't specified. This is an attempt to clean up a few different things that exist today:

get_insert_into_sql, defined in dbt-spark only. This is explicit "append" behavior on Spark today.
get_delete_insert_merge_sql when no unique_key is specified. This is implicit "append" behavior on Postgres + Redshift + Snowflake (!!) today
get_merge_sql, when no unique_key (or other custom predicates) are specified, as it ends up using ON FALSE, a.k.a. constant false predicate. Turns out, that achieves the same result (= insert only, no update/delete). This is implicit "append" behavior on BigQuery + Databricks (Delta) today.

Why (!!) for Snowflake? Even though Snowflake uses the merge strategy by default, its MERGE statement does NOT support a constant false predicate, so we've hard-coded insert into here!

Since all of these databases support good-old-fashioned insert into, I'd like to make that the explicit behavior of all of them when the append strategy is specified, i.e. the standard behavior of get_incremental_append_sql. Doesn't mean that append should be the default, and we definitely shouldn't break the current behavior (implicit append) when folks specify delete+insert or merge without a unique_key specified, but it's the direction we should move in.

Happy to spend more time talking about this, since it's pretty tricky stuff!

nathaniel-may · 2022-06-03T19:41:36Z

updated description to add the new macro get_insert_into_sql.

gshank · 2022-06-08T01:08:17Z

Which incremental strategies are we supporting in postgres/redshift? If we're implementing this in dbt-core, does that mean that we support them all in postgres?

Currently:
Snowflake: merge, delete+insert
BigQuery: merge, insert_overwrite
Spark: append, insert_overwrite, merge
Redshift: append
Postgres: append

I'm unclear on what the point is of returning a list of the supported strategies, when an additional strategy can be implemented simply by creating a macro with the right name. The description above states behavior that would throw an exception if there's a 'get_incremental_silly_sql' and 'silly' is not returned by 'validate_incremental_strategy'. And since that's described as a python function, it's not like people could replace it when they add a new strategy.

jtcohen6 · 2022-06-08T06:15:47Z

@gshank Fair question. We want to do two things:

Allow each adapter maintainer to declare which strategies are supported out-of-the-box for that adapter, by returning the list of strings reflecting those strategy names
Establish a common pattern that makes it easy for adapter maintainers / end users to register a new strategy (e.g. silly), and a macro to use for that strategy (get_incremental_silly_sql)

Postgres/Redshift are only able to support append + delete+insert today — but we should still aim to define the reusable logic for the other strategies in dbt-core's default global_project, to the extent possible.

Today:

Snowflake: append (implicit), merge (default), delete+insert
BigQuery: append (implicit), merge (default), insert_overwrite
Spark: append (explicit, default), insert_overwrite, merge
Redshift: append (implicit), delete+insert (default)
Postgres: append (implicit), delete+insert (default)

gshank · 2022-06-08T14:22:05Z

Regarding this statement: "When an unsupported incremental strategy is designated, raise an error to users before runtime." if an end user has added a new incremental strategy, how would it pass the error checking? What would "register a new strategy" look like?

Regarding this statement: "If any of the default behavior needs to be overridden, it can be done by writing a macro that conforms to the naming convention "get_incremental_NAME_sql" where "NAME" is the string used to configure the incremental strategy (e.g. "get incremental_append_sql")." I'm wondering how this interacts with adapter.dispatch. There are three macros that do adapter.dispatch now: get_merge_sql, get_delete_insert_merge_sql, get_insert_overwrite_merge_sql. The only non-default implementation of these macros is spark__get_merge_sql, but I presume that other external adapters might have implemented non-default versions. So we're leaving these adapter.dispatch calls in for compatibility, but the new method of overriding is to actually re-implement the get_incremental_NAME_sql and rely on macro project precedence to resolve to the right one (via the get_incremental_strategy_macro method)?

Are we going to be rewriting our in-house adapters to use the new arrangements?

gshank · 2022-06-08T15:43:51Z

The special "materialization" macros are mostly copied from the base incremental materialization. They don't inherit. We do need to support the existing ones. What are you thinking we will do to reduce copy-and-pasting? Split up the base materialization into sub-macros?

gshank · 2022-06-08T17:28:01Z

In order to create tests in the adapter zone for all of the incremental strategies, we will either need to be able to run the tests in Postgres or mark them to skip.

gshank · 2022-06-09T23:45:23Z

@jtcohen6 What is the purpose of the "predicates" parameter on the insert_overwrite_merge_sql? I can't see that it's ever set anywhere. So it can't be passed in to the macro.

Also the 'include_sql_header', which is only set in one place for bigquery and is set in a keyword parameter, which is not something we can preserve. What was the purpose of that?

jtcohen6 · 2022-06-10T14:03:16Z

@gshank Great questions!

"When an unsupported incremental strategy is designated, raise an error to users before runtime." if an end user has added a new incremental strategy, how would it pass the error checking? What would "register a new strategy" look like?

I'm not strongly opinionated about the exact implementation details here.

It could look like us raising a helpful error if the user has defined a model with incremental_strategy: silly, and there is no get_incremental_silly_sql macro defined. If the macro does exist, it's assumed to be supported.

We need to reconcile that with the need for each adapter plugin to be able to declare which incremental strategies it does and doesn't support out-of-the-box. That could be an argument in favor of making that a macro (user-space code), rather than an adapter method (Python). Then, end users who add a new silly strategy could override the macro and returns a new set of strings:

{% macro validate_incremental_strategies() %}
  {{ return('append', 'delete+insert', 'silly') }}
{% endmacro %}

So we're leaving these adapter.dispatch calls in for compatibility, but the new method of overriding is to actually re-implement the get_incremental_NAME_sql and rely on macro project precedence to resolve to the right one (via the get_incremental_strategy_macro method)?

I think each of these get_incremental_NAME_sql macros should be dispatched, and have a default__ version defined in dbt-core. That allows adapters to implement just a macro named adapter__get_incremental_NAME_sql, in cases where their SQL differs. E.g. spark__get_incremental_merge_sql.

Are we going to be rewriting our in-house adapters to use the new arrangements?

Yes! The existing adapter code should all still run successfully regardless of the changes we make in dbt-core — this will be a good sanity check of backwards compatibility. But the goal here is to delete / consolidate lots of copy-pasted code, so we can only actually capture that value with follow-up refactoring work in the adapters.

The special "materialization" macros are mostly copied from the base incremental materialization. They don't inherit. We do need to support the existing ones. What are you thinking we will do to reduce copy-and-pasting? Split up the base materialization into sub-macros?

They're 90% copy-pasted code. Materializations end up being quite redundant (a) across adapters, and (b) even between materialization types. We should always be striving to identify where they frequently differ, and move those pieces of logic into separate standalone macros. Then, each adapter can reimplement just the piece that actually needs to be different.

In order to create tests in the adapter zone for all of the incremental strategies, we will either need to be able to run the tests in Postgres or mark them to skip.

I think marking them to skip is the right approach. We could also just define the Base test classes without inheriting them into a Test-named class.

What is the purpose of the "predicates" parameter on the insert_overwrite_merge_sql? I can't see that it's ever set anywhere. So it can't be passed in to the macro.

There's some context for this, and an initial implementation, in #4546. That gives the big idea pretty well. (Even more context in #3293, and recently in dbt-labs/dbt-spark#369 as well.)

The big idea: By default, incremental models find rows to update if they match the same unique_id value as rows in the new data. That unique_id match is the only "predicate" we really support today. There are legitimate cases (usually for performance reasons) where users want the ability to configure additional custom "predicates" — usually to filter scans of the existing (very large) table, e.g. only look to update rows from the past 7 days.

Practically, this looks like:

We support predicates as a config that's pulled in by the incremental materialization, and passed into all of these get_incremental_X_macros (as one attribute of the dict argument)
We template predicates (if defined) into the SQL/DML returned by each get_incremental_X_macro

[Preview](https://docs-getdbt-com-git-dbeatty-custom-incremental-d92d96-dbt-labs.vercel.app/docs/build/incremental-models#custom-strategies) ## What are you changing in this pull request and why? This addresses the "**For end users**" portion of #1761. The feature request in dbt-labs/dbt-core#5245 describes the value proposition as well as the previous and new behavior: #### Functional Requirement - Advanced users that wish to specify a custom incremental strategy must be able to do so. #### Previous behavior - Advanced dbt users who wished to specify a custom incremental strategy must override the same boilerplate Jinja macro by copy pasting it into their dbt project. #### New behavior - Advanced dbt users who wish to specify a custom incremental strategy will only need to create a macro that conforms to the naming convention `get_incremental_NAME_sql` that produces the correct SQL for the target warehouse. ## Also To address the questions raised in dbt-labs/dbt-core#8769, we also want to document how to utilize custom incremental macros that come from a package. For example, to use the `merge_null_safe` custom incremental strategy from the `example` package, first [install the package](/build/packages#how-do-i-add-a-package-to-my-project), then add this macro to your project: ```sql {% macro get_incremental_merge_null_safe_sql(arg_dict) %} {% do return(example.get_incremental_merge_null_safe_sql(arg_dict)) %} {% endmacro %} ``` ## 🎩 <img width="503" alt="image" src="https://github.com/dbt-labs/docs.getdbt.com/assets/44704949/51c3266e-e3fb-49bd-9428-7c43920a5412"> ## Checklist - [x] Review the [Content style guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md) so my content adheres to these guidelines. - [x] For [docs versioning](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#about-versioning), review how to [version a whole page](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#adding-a-new-version) and [version a block of content](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#versioning-blocks-of-content).

nathaniel-may added the enhancement New feature or request label May 13, 2022

github-actions bot changed the title ~~[Draft] Streamline Incremental Strategies~~ [CT-646] [Draft] Streamline Incremental Strategies May 13, 2022

jtcohen6 added the incremental Incremental modeling with dbt label May 16, 2022

jtcohen6 mentioned this issue May 20, 2022

Support for ingestion time partition table on BigQuery as incremental materialization dbt-labs/dbt-bigquery#136

Merged

4 tasks

jtcohen6 added this to the v1.2 milestone May 20, 2022

nathaniel-may changed the title ~~[CT-646] [Draft] Streamline Incremental Strategies~~ [CT-646] [Proposal] Streamline Incremental Strategies Jun 1, 2022

gshank self-assigned this Jun 10, 2022

gshank mentioned this issue Jun 10, 2022

Initial refactoring of incremental materialization #5359

Merged

6 tasks

jtcohen6 removed this from the v1.2 milestone Jun 30, 2022

gshank closed this as completed in #5359 Jul 21, 2022

jtcohen6 added this to the v1.3 milestone Aug 22, 2022

jlarue26 mentioned this issue Oct 24, 2022

dbt-core 1.3 upgrade: Incremental mats: more standard and more error-proof dremio/dbt-dremio#44

Closed

dbeatty10 mentioned this issue Feb 27, 2023

[CT-2190] [Feature] Incremental delete and merge update #7057

Closed

3 tasks

dbeatty10 mentioned this issue Dec 14, 2023

[CT-3501] [Epic] Streamline Incremental Strategies #9290

Open

This was referenced Jan 6, 2024

[CT-3462] [Feature] Additional configurability of incremental merge strategy #9223

Open

User-defined custom incremental strategies dbt-labs/docs.getdbt.com#4716

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-646] [Proposal] Streamline Incremental Strategies #5245

[CT-646] [Proposal] Streamline Incremental Strategies #5245

nathaniel-may commented May 13, 2022 •

edited by dbeatty10

Loading

jtcohen6 commented May 16, 2022 •

edited

Loading

dataders commented May 17, 2022

jtcohen6 commented May 18, 2022

nathaniel-may commented Jun 1, 2022

gshank commented Jun 2, 2022

gshank commented Jun 2, 2022

nathaniel-may commented Jun 2, 2022

jtcohen6 commented Jun 3, 2022 •

edited

Loading

nathaniel-may commented Jun 3, 2022

gshank commented Jun 8, 2022

jtcohen6 commented Jun 8, 2022

gshank commented Jun 8, 2022 •

edited

Loading

gshank commented Jun 8, 2022

gshank commented Jun 8, 2022

gshank commented Jun 9, 2022 •

edited

Loading

jtcohen6 commented Jun 10, 2022

[CT-646] [Proposal] Streamline Incremental Strategies #5245

[CT-646] [Proposal] Streamline Incremental Strategies #5245

Comments

nathaniel-may commented May 13, 2022 • edited by dbeatty10 Loading

jtcohen6 commented May 16, 2022 • edited Loading

dataders commented May 17, 2022

jtcohen6 commented May 18, 2022

nathaniel-may commented Jun 1, 2022

gshank commented Jun 2, 2022

gshank commented Jun 2, 2022

nathaniel-may commented Jun 2, 2022

jtcohen6 commented Jun 3, 2022 • edited Loading

nathaniel-may commented Jun 3, 2022

gshank commented Jun 8, 2022

jtcohen6 commented Jun 8, 2022

gshank commented Jun 8, 2022 • edited Loading

gshank commented Jun 8, 2022

gshank commented Jun 8, 2022

gshank commented Jun 9, 2022 • edited Loading

jtcohen6 commented Jun 10, 2022

nathaniel-may commented May 13, 2022 •

edited by dbeatty10

Loading

jtcohen6 commented May 16, 2022 •

edited

Loading

jtcohen6 commented Jun 3, 2022 •

edited

Loading

gshank commented Jun 8, 2022 •

edited

Loading

gshank commented Jun 9, 2022 •

edited

Loading