-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update BQ deduplicate macro to support partition pruning downstream #929
base: main
Are you sure you want to change the base?
Conversation
@@ -93,12 +93,19 @@ | |||
-- clause in BigQuery: | |||
-- https://github.com/dbt-labs/dbt-utils/issues/335#issuecomment-788157572 | |||
#} | |||
{%- macro bigquery__deduplicate(relation, partition_by, order_by) -%} | |||
{%- macro bigquery__deduplicate(relation, partition_by, order_by, partition_by_pass_thru=false) -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this PR @austinclooney !
I'd prefer not to introduce a new parameter if we can avoid it.
BigQuery supports the qualify
clause. Is there any reason we shouldn't / couldn't just use qualify
in the same way as Snowflake and Databricks?
The original implementation for BigQuery was for performance reasons. But I'm not sure if those performance reasons still exist or if working for views outweighs those considerations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge the memory issue still exists, yes. Deduplicating with array_agg
uses fewer resources than with qualify
which allows it to be done on larger datasets before hitting memory issues. The reason I added a parameter here is because explicitly selecting the partition_by
by default will change the column order in the output so I think it would technically be a breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several different scenarios to consider for the relation being deduplicated:
- table
- view
- materialized view
- CTE
What would be the consequences if we did not add a new parameter named partition_by_pass_thru
and just treated it as always being true
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the only difference between the current method and the changes added in this PR is this PR would move the partition_by
columns to the end of the output (or the beginning if preferred) and the current output retains the original column order.
Since we're only explicitly selecting the partitioning keys it won't affect the actual deduplication as those are the groups we're deduplicating within.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Edit: I just re-read the prior message and update the question below]
So the net consequence would be the following?
- partition pruning allowing faster queries for less cost
- output relation has the same columns but possibly in a different order than before
And these consequences would equally apply to all of tables, views, materialized views, and CTE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you think of any way to do the following?
- output relation with the same columns in the same order as before
If so, that would allow us to ship these performance improvements without any breaking changes or new parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've thought about this for a while but I don't believe it's possible without running another query since the input to the macro can be a CTE.
Something like this would work:
{%- macro bigquery__deduplicate(relation, partition_by, order_by) -%}
{%- set columns_query -%}
select * from {{ relation }} limit 1
{%- endset -%}
{%- set columns_result = run_query(columns_query) -%}
{%- set columns = columns_result.column_names -%}
select
{%- for column in columns %}
unique.{{ column }}{%- if not loop.last %},{% endif %}
{%- endfor %}
from (
select
{{ partition_by }},
array_agg(
original
order by {{ order_by }}
limit 1
)[offset(0)] unique
from {{ relation }} original
group by {{ partition_by }}
)
{%- endmacro -%}
But I don't think we'd want to run an external query to get the macro to work.
resolves #928
Problem
The current BQ deduplicate macro will not work for a view as it doesn't allow partition pruning downstream of the macro.
Solution
Add an an optional parameter that explicitly selects the
partition by
columns outside of thearray_agg
Checklist