-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-646] [Proposal] Streamline Incremental Strategies #5245
Comments
Update: everything in this comment has been resolved, and incorporated into the main issue @nathaniel-may This is really clearly stated, and I think you captured the many nuances well!
We were discussing
I think this may get tricky with partial parsing: We'd need to recalculate any time a new It's also possible that users completely override the incremental materialization—a thing we want them to need to do less of, but still—in which case, the
Other things I recall from our conversation:
|
one quick thought as a requirement. will come back to this later
|
The lowest-level default implementation should be |
top-level ticket description reflects the previous conversation. |
A couple of nits: I think the name "get_incremental_delete+insert_sql(dict)" would need to use an underscore instead of a plus. I believe that the adapter tests should go in the adapter zone (tests/adapter) and not in core/tests. I think that's what was intended from the wording. |
I'm not finding get_insert_into_sql or get_incremental_append_sql. |
|
Why (!!) for Snowflake? Even though Snowflake uses the Since all of these databases support good-old-fashioned Happy to spend more time talking about this, since it's pretty tricky stuff! |
updated description to add the new macro |
Which incremental strategies are we supporting in postgres/redshift? If we're implementing this in dbt-core, does that mean that we support them all in postgres? Currently: I'm unclear on what the point is of returning a list of the supported strategies, when an additional strategy can be implemented simply by creating a macro with the right name. The description above states behavior that would throw an exception if there's a 'get_incremental_silly_sql' and 'silly' is not returned by 'validate_incremental_strategy'. And since that's described as a python function, it's not like people could replace it when they add a new strategy. |
@gshank Fair question. We want to do two things:
Postgres/Redshift are only able to support Today:
|
Regarding this statement: "When an unsupported incremental strategy is designated, raise an error to users before runtime." if an end user has added a new incremental strategy, how would it pass the error checking? What would "register a new strategy" look like? Regarding this statement: "If any of the default behavior needs to be overridden, it can be done by writing a macro that conforms to the naming convention "get_incremental_NAME_sql" where "NAME" is the string used to configure the incremental strategy (e.g. "get incremental_append_sql")." I'm wondering how this interacts with adapter.dispatch. There are three macros that do adapter.dispatch now: get_merge_sql, get_delete_insert_merge_sql, get_insert_overwrite_merge_sql. The only non-default implementation of these macros is spark__get_merge_sql, but I presume that other external adapters might have implemented non-default versions. So we're leaving these adapter.dispatch calls in for compatibility, but the new method of overriding is to actually re-implement the get_incremental_NAME_sql and rely on macro project precedence to resolve to the right one (via the get_incremental_strategy_macro method)? Are we going to be rewriting our in-house adapters to use the new arrangements? |
The special "materialization" macros are mostly copied from the base incremental materialization. They don't inherit. We do need to support the existing ones. What are you thinking we will do to reduce copy-and-pasting? Split up the base materialization into sub-macros? |
In order to create tests in the adapter zone for all of the incremental strategies, we will either need to be able to run the tests in Postgres or mark them to skip. |
@jtcohen6 What is the purpose of the "predicates" parameter on the insert_overwrite_merge_sql? I can't see that it's ever set anywhere. So it can't be passed in to the macro. Also the 'include_sql_header', which is only set in one place for bigquery and is set in a keyword parameter, which is not something we can preserve. What was the purpose of that? |
@gshank Great questions!
I'm not strongly opinionated about the exact implementation details here. It could look like us raising a helpful error if the user has defined a model with We need to reconcile that with the need for each adapter plugin to be able to declare which incremental strategies it does and doesn't support out-of-the-box. That could be an argument in favor of making that a macro (user-space code), rather than an adapter method (Python). Then, end users who add a new {% macro validate_incremental_strategies() %}
{{ return('append', 'delete+insert', 'silly') }}
{% endmacro %}
I think each of these
Yes! The existing adapter code should all still run successfully regardless of the changes we make in
They're 90% copy-pasted code. Materializations end up being quite redundant (a) across adapters, and (b) even between materialization types. We should always be striving to identify where they frequently differ, and move those pieces of logic into separate standalone macros. Then, each adapter can reimplement just the piece that actually needs to be different.
I think marking them to skip is the right approach. We could also just define the
There's some context for this, and an initial implementation, in #4546. That gives the big idea pretty well. (Even more context in #3293, and recently in dbt-labs/dbt-spark#369 as well.) The big idea: By default, incremental models find rows to update if they match the same Practically, this looks like:
|
[Preview](https://docs-getdbt-com-git-dbeatty-custom-incremental-d92d96-dbt-labs.vercel.app/docs/build/incremental-models#custom-strategies) ## What are you changing in this pull request and why? This addresses the "**For end users**" portion of #1761. The feature request in dbt-labs/dbt-core#5245 describes the value proposition as well as the previous and new behavior: #### Functional Requirement - Advanced users that wish to specify a custom incremental strategy must be able to do so. #### Previous behavior - Advanced dbt users who wished to specify a custom incremental strategy must override the same boilerplate Jinja macro by copy pasting it into their dbt project. #### New behavior - Advanced dbt users who wish to specify a custom incremental strategy will only need to create a macro that conforms to the naming convention `get_incremental_NAME_sql` that produces the correct SQL for the target warehouse. ## Also To address the questions raised in dbt-labs/dbt-core#8769, we also want to document how to utilize custom incremental macros that come from a package. For example, to use the `merge_null_safe` custom incremental strategy from the `example` package, first [install the package](/build/packages#how-do-i-add-a-package-to-my-project), then add this macro to your project: ```sql {% macro get_incremental_merge_null_safe_sql(arg_dict) %} {% do return(example.get_incremental_merge_null_safe_sql(arg_dict)) %} {% endmacro %} ``` ## 🎩 <img width="503" alt="image" src="https://github.com/dbt-labs/docs.getdbt.com/assets/44704949/51c3266e-e3fb-49bd-9428-7c43920a5412"> ## Checklist - [x] Review the [Content style guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md) so my content adheres to these guidelines. - [x] For [docs versioning](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#about-versioning), review how to [version a whole page](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#adding-a-new-version) and [version a block of content](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#versioning-blocks-of-content).
When defining an incremental model, there are several incremental strategies that could be used to execute the incremental model, (namely: "append," "delete+insert," "merge," and "insert_overwrite.") but not all of them are supported by every warehouse. Working with these incremental strategies as an adapter maintainer and as an advanced user is difficult. This ticket aims to improve the experience for both of these groups of people.
Functional Requirements
Behavior Today
adapter__get_merge_sql
that takes a fixed set of parameters and returns the materialization sql. This macro is often copy-pasted from core, and is error prone when editing.dbt run
.Desired Behavior
Implementation
core/dbt/include/global_project/macros/materializations/models/incremental
that follow the naming convention for the existing incremental strategies and pass through to the existing current macros. The new macros will take a dictionary, and pass the key-value pairs explicitly to the wrapped macro call to return the sql string that executes the materialization. Concretely:get_insert_into_sql() -> str
that returns"insert into"
get_incremental_append_sql(dict)
that wrapsget_insert_into_sql
get_incremental_delete_insert_sql(dict)
that wrapsget_delete_insert_merge_sql
get_incremental_merge_sql(dict)
that wrapsget_merge_sql
get_incremental_insert_overwrite_sql(dict)
that wrapsget_insert_overwrite_sql
get_incremental_default_sql(dict)
that wrapsget_incremental_append_sql
. This is where adapters can override the default incremental strategy when append isn't appropriate.config.get('...')
.incremental_strategy
toNodeConfig
with a default value of"default"
.core/dbt/adapters/sql/impl.py::AdapterConfig
calledvalidate_incremental_strategy
that takes no parameters and returns a list of strings. Adapter maintainers are expected to implement this so that the returned strings name the supported incremental strategies. This could have been a macro, but it's much easier for core to call this as a python function to validate incremental strategies in projects. (TODO what if an adapter supports a different set of incremental strategies for sql and python models? where would they indicate that?)get_incremental_strategy_macro
that takes in a string and returns the macro from the context. (e.g. "insert_overwrite" -> callable get_incremental_insert_overwrite_sql). This function will callvalidate_incremental_strategy
and throw an exception if the requested strategy is not returned fromvalidate_incremental_strategy
or if it does not exist in the context. This function will need to be called in the materialization macro for modelstests/adapter
including tests for several configs for each strategy. This will allow each adapter to import the tests it needs as well as make their own as needed. (Thanks for pointing this requirement out, @dataders)Docs
This will require a significant change to the documentation on incremental materializations.
The text was updated successfully, but these errors were encountered: