-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BQ incremental merge statements respect dest date partitioning #1034
Comments
My approach to this at the time: override the Everything in the materialization was exactly the same, except I added The first and simpler version expected the config argument to look something like -- in custom materialization
{%- set dest_where = config.get('dest_where') -%}
{%- if not dest_where -%}
{{ exceptions.raise_compiler_error("Must supply a dest_where clause") }}
{%- endif -%} -- updated macro
{% macro get_merge_sql(target, source, unique_key, dest_columns, dest_where) -%}
-- standard stuff
{%- set dest_cols_csv = dest_columns | map(attribute="name") | join(', ') -%}
merge into {{ target }} as DBT_INTERNAL_DEST
using {{ source }} as DBT_INTERNAL_SOURCE
{% if unique_key %}
on DBT_INTERNAL_SOURCE.{{ unique_key }} = DBT_INTERNAL_DEST.{{ unique_key }}
{% else %}
on FALSE
{% endif %}
-- my only addition
and {{dest_where}}
-- rest of macro with match conditions
{% endmacro %} The fancier version filled a default value of {%- set dest_where = config.get('dest_where') -%}
{%- if not dest_where -%}
{%- set dest_where = formatting_macro(config.get('partition_by')) -%}
{%- endif -%}
{% if not dest_where %}
{{ exceptions.raise_compiler_error("Must supply a dest_where clause") }}
{% endif %} Where date(my_timestamp_column) between '2019-03-16' and current_date I think the fancier version makes a fair assumption: In order to take advantage of cost limiting by adding a column filter to the merge statement, we need to be already partitioning by the same column. This approach still feels much like a manual override. Over the past few months, I have tried to think about approaches that feel cleaner and more appropriate to include in dbt's default BQ behavior. So far, I don't have any great ideas. |
Hi, {% macro get_merge_sql(target, source, unique_key, dest_columns) -%}
{{ adapter_macro('get_merge_sql', target, source, unique_key, dest_columns) }}
{%- endmacro %}
{% macro get_delete_insert_merge_sql(target, source, unique_key, dest_columns) -%}
{{ adapter_macro('get_delete_insert_merge_sql', target, source, unique_key, dest_columns) }}
{%- endmacro %}
{% macro common_get_merge_sql(target, source, unique_key, dest_columns) -%}
{%- set dest_cols_csv = dest_columns | map(attribute="name") | join(', ') -%}
########### MY ADDITION HERE ###################
{%- set today = modules.datetime.date.today() -%}
{%- set one_day = modules.datetime.timedelta(days=2) -%}
{%- set yesterday = (today - one_day) -%}
{%- set yesterday_yyyy_mm_dd = yesterday.strftime("%Y-%m-%d") -%}
########### MY ADDITION HERE ###################
merge into {{ target }} as DBT_INTERNAL_DEST
using {{ source }} as DBT_INTERNAL_SOURCE
########### MY ADDITION HERE ###################
{% if unique_key and unique_key == 'event_id' %}
on DBT_INTERNAL_SOURCE.{{ unique_key }} = DBT_INTERNAL_DEST.{{ unique_key }}
and DBT_INTERNAL_DEST.date_partition >= '{{ yesterday_yyyy_mm_dd }}'
{% elif unique_key %}
on DBT_INTERNAL_SOURCE.{{ unique_key }} = DBT_INTERNAL_DEST.{{ unique_key }}
########### MY ADDITION HERE ###################
{% else %}
on FALSE
{% endif %}
(...) standard code it works for me but I had to hardcode the partition column name!!! |
Hey @jarlainnix - the problem you've identified is sort of the core challenge we need to solve! There are two things to sort out:
I can see that you're filtering for only the last day of data. If this approach works for you, then that's great! I don't however think that this is a good general solution to the problem -- some projects may need to look back 2 days, or 7 days, or 30 days! If you want to make the Try something like this: -- models/my_incremental_model.sql
{{
config(
materialized='incremental',
... other configs here ...
filter_field="date_partition"
)
}}
select ... Then in the
You can then use I'd love to get a fix out for this - let me spend some time revisiting this over the next week or two :) |
I understand that there can be various use cases to solve and we can only solve one at a time.
For Q2 I have a strange suggestion to generalise the solution. Assuming the code lines addition on the -- macros/common_get_merge_sql.sql
(...)
{% if unique_key %}
on DBT_INTERNAL_SOURCE.{{ unique_key }} = DBT_INTERNAL_DEST.{{ unique_key }}
## these lines ##
and DBT_INTERNAL_DEST.PARTITION_COLUMN BETWEEN
{{ DBT_INTERNAL_SOURCE_MIN_VALUE }}
and {{ DBT_INTERNAL_SOURCE_MAX_VALUE }}
## these lines ##
{% else %}
on FALSE
{% endif %} we could set a mid-process subquery (like we do with call statements on our models) that would check for the limit values for the partitions. -- models/stg_formula_evaluated_partitioned.sql
{{ config(
materialized='incremental',
unique_key='event_id',
partition_by='date_partition'
)
}}
with deduped_events as (
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY attributes.event_id) as row_number
FROM `{{ source('user_events', 'formula_evaluated') }}_*`
{% if is_incremental() %}
where _TABLE_SUFFIX >= '{{ var('date_suffix', default=yesterday()) }}'
{% endif %}
)
WHERE row_number = 1
),
cte_final as (
select
attributes.event_id,
attributes.event_time,
attributes.publish_time,
attributes.user_id,
attributes.app_id,
attributes.view_id,
attributes.request_id,
attributes.producer,
attributes.type,
attributes.session_id,
formula_evaluation_id,
cell_id,
column,
row,
execution_trigger,
EXTRACT(DATE FROM attributes.event_time) AS date_partition
from deduped_events
)
SELECT
*
FROM
cte_final (that is my specific use case) the mid-process subquery would check SELECT
min(date_partition) as DBT_INTERNAL_SOURCE_MIN_VALUE
,max(date_partition) as DBT_INTERNAL_SOURCE_MAX_VALUE
from {{ stg_formula_evaluated_partitioned.sql model generated SQL }} here the (sorry for the messy explanation) |
Thanks @jarlainnix - this is really great! |
Hi @jarlainnix, I agree with your first point, the {{ filter_field }} For the 2nd point about getting min and max values, could I know if you actually implemented that part or is it a thought for discussion? I could see some challenges or inefficiency to implement that with the current dbt design. In the mid-subprocess, to get the min and max of the DBT_INTERNAL_SOURCE, it requires us to first run (stg_formula_evaluated_partitioned) just as a CTE of the mid-subprocess. then we could compile and run the merge statement as the final step. The whole process requires running (stg_formula_evaluated_partitioned) DBT_INTERNAL_SOURCE two times, which is very inefficient. I think we could use BigQuery scripts and temporary_tables to solve this. The approach is for dbt to compile a BigQuery script to run the following steps as one BigQuery transaction for an incremental run.
I don't know how easy to modify get_merge_sql() macro to run a BigQuery script instead of a single query. but I think if we could do that, this might be a good approach to solve the problem. |
@hui-zheng I agree with the approach you've outlined! Check out the work in #1971, which we're planning to ship in |
@hui-zheng (I was away and just got back today) |
hi guys. I have the same specific use case of such partition pruning for the dbt merge against BQ. |
Hi @andreic-ub, we ended up implementing this as a partition-aware incremental strategy called |
Issue (feature?)
Description
When dbt generates merge statements for BigQuery incremental models, we can take advantage of date-partitioning for
DBT_INTERNAL_SOURCE
with advanced incremental model usage. This code will properly limit the data BQ scans by querying the "source" data with templated code:But the merge statement does not limit the
DBT_INTERNAL_DEST
, i.e. the existing table to be incrementally updated. By adding one (optional) line of code to the merge statementon
, we can limit the total data scanned (and thereby cost):Where '2018-09-28' is a user-supplied/config value fed into the
sql_where
or the "advanced incremental" CTE, and here also the merge condition.Results
In my local example, the BQ incremental merge (status quo) scans 4.32 GB, and adding the additional line decreases the query scan to 1.87 GB, since the full destination table (
{{this}}
) is itself 3.18 GB.System information
The output of
dbt --version
:The text was updated successfully, but these errors were encountered: