Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Optimize incremental 'insert_overwrite' strategy #1409

Open
3 tasks done
AxelThevenot opened this issue Nov 21, 2024 · 13 comments · May be fixed by #1410 or #1415
Open
3 tasks done

[Feature] Optimize incremental 'insert_overwrite' strategy #1409

AxelThevenot opened this issue Nov 21, 2024 · 13 comments · May be fixed by #1410 or #1415
Labels
enhancement New feature or request incremental

Comments

@AxelThevenot
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

The MERGE statement is sub-optimized in BigQuery.
This is of course ok for unique keys as this is what we are looking for.

But for the insert_overwrite strategy where we are looking to rows at the partition-level there is a better solution and here is why:

  • a DELETE or INSERT statement is cheapest than a MERGE statement.
  • incremental tables are the most expensive tables in real-world projects.
  • The DELETE statement in BigQuery is free at the partition-level.

This has been tested at Carrefour which is my company.

  • On this replacement of the MERGE statement it reduces the cost by 50.4% and the elapsed time by 35.2%
  • On the overall procedure it reduces the cost by 26.1% and the elapsed time by 23.1%

This is wrapped in a transaction to avoid deleting rows if any error occurs.

Describe alternatives you've considered

I have considered an alternative to create an additional delete+insert incremental strategy.
But overriding the existing insert_overwrite is the more convenient as this is exactly what we are looking for and the features are the same.

Who will this benefit?

Of course, everyone without any changes.

Are you interested in contributing this feature?

Yes

Anything else?

No response

@AxelThevenot AxelThevenot added enhancement New feature or request triage labels Nov 21, 2024
AxelThevenot pushed a commit to AxelThevenot/dbt-bigquery that referenced this issue Nov 21, 2024
@philBatardiere
Copy link

Hey @AxelThevenot did you tried the https://docs.getdbt.com/reference/resource-configs/bigquery-configs#copying-partitions?

@AxelThevenot
Copy link
Author

Hi,

yes this is also something i've tried but:

  • It is sequential and not in parallel so really slow compare to this method
  • I prefer to have the entire view of what is executed in my BigQuery procedure instead of having the temp table creation in one side and a BigQuery job in an other side :)

I also though about the static partition definition but uses cases requires to have a dynamic partition overwrite :)

@philBatardiere
Copy link

Yes agreed if your incremental refresh a lot of partitions back then it can takes time (usually i refresh around 3 partitions back so the process isn’t so long).

For debuging purposes, I have seen some benefits having the tmp table by restoring it with the time travel feature. But its clearly worst for auditing purposes, it should be nice to have labels related to the dbt run id in the copy config btw 😀.

I keep your feature in mind in case of future need.

Thanks!

@AxelThevenot
Copy link
Author

For debuging purposes, I have seen some benefits having the tmp table by restoring it with the time travel feature. But its clearly worst for auditing purposes, it should be nice to have labels related to the dbt run id in the copy config btw 😀.

Oh yes it could be nice to have those labels

So you won't valide my pull request for now from what I understand ?

In any case, feel free to reach me if you want some help or anything else
https://www.linkedin.com/in/axel-thevenot/

@philBatardiere
Copy link

Unfortunately I don't have yet the authority to do it :) But I hope someone will do it

@AxelThevenot
Copy link
Author

AxelThevenot commented Nov 25, 2024

Ok, let's pray someone with this authority will go through this pull request/issue

I can create an additional delete+insert strategy instead of overriding the insert_overwrite, even if it make less sense to me 👍

@amychen1776
Copy link

@AxelThevenot, I don't think I quite understand your justification for not creating a new delete+insert materialization strategy. Changing a commonly used incremental strategy can cause some unintended results, especially since it seems like your intent is to remove the merge statement from the insert_override statement (or am I misunderstanding?)

@borjavb
Copy link
Contributor

borjavb commented Nov 27, 2024

This is great stuff! But I think we should create a delete+insert operation instead of changing the already widely used insert_overwrite, as it could have unexpected consequences across all users. (+1 on @amychen1776)

For example, the DELETE operation being free is only under certain conditions as stated in the documentation and you might not always get the slot-free operation unfortunately. I’ve seen cases where the costs of the DELETE + INSERT end up being higher in aggregate to the whole MERGE. It’s rare, but It can happen and rolling this change for such a fundamental macro can backfire. At the end of the day is all about testing strategies that better suit the use case.

For example granular partitioned tables do not benefit from free DELETEs 😢 :
image

Using a delete+insert will definitely be a performance boost in most cases (if using daily partitions specially), but users have to be aware of how to get the best of it. Better to be clear through documentation and through new methods rather than changing the current one.

@AxelThevenot
Copy link
Author

AxelThevenot commented Nov 27, 2024

@amychen1776 and @borjavb I got your points, I will go for a delete+insert strategy if this is more convenient to you and avoid having edge cases for specific users

I will make the change as soon as possible and tell you when it's done :)

But delete+insert is at the unique-key-level in other adapters and I want a partition-level
Any idea of if I reuse this name even if this is not doing the same or if I give a new name ? (and what name?)

@borjavb
Copy link
Contributor

borjavb commented Nov 27, 2024

Been thinking about this and... how do we feel about an operator/function configuration within instert_overwrite?

The way I've hijacked the insert_overwrite strategy in the past to implement this was through the config block and it always worked really well. Check this draft PR with the gist behind this idea.

Technically we have an incremental strategy that can work with different functions, but apply the same semantics of replacing a whole partition. So basically we can define an incremental_strategy plus the operator/function we want to use: either a merge or a delete+insert. So we get the best of both worlds:

  1. If it's not defined, we default to merge, so we keep backwards compatibility
  2. If the user selects the new delete+insert underlying function, we swap the macro that works under the hood

We already have custom configurations for merge like merge_update_columns and merge_exclude_columns, so this doesn't break with how we do things with other incremental strategies.

So the config block would look like something like this:

{{
    config(
        materialized="incremental",
        incremental_strategy="insert_overwrite",
        insert_overwrite_function='delete+insert',
        cluster_by="id",
        partition_by={
            "field": "date_time",
            "data_type": "datetime",
            "granularity": "day"
        },
        partitions=partitions_to_replace,
    )
}}

Although I'm torn if we should add the incremental_strategy as a dict... but given that we haven't done that for the other configurations for merge... it feels now weird to add it as a dictionary? Not sure.

incremental_strategy={
   strategy = "insert_overwrite",
   operation = "delete+insert"
}

But this would just be some syntax sugar of how we want to expose the config. But I think with this we can let the user decide which operator is the most performant, and keep full backwards compatibility.

What do you think, @amychen1776 and @AxelThevenot?

@AxelThevenot
Copy link
Author

@borjavb this is a great idea ! I will go for it 😄

@jtcohen6
Copy link
Contributor

@AxelThevenot @borjavb Very cool to see the level of interest here :)

But delete+insert is at the unique-key-level in other adapters and I want a partition-level
Any idea of if I reuse this name even if this is not doing the same or if I give a new name ? (and what name?)

This is a totally fair point!

FWIW... The unique_key specified for other adapters does not actually have to be unique, and many users have a configuration like this to achieve the outcome you're after here:

{{ config(
   incremental_strategy = "insert_overwrite",
   unique_key = "my_partition_by_column_name",  -- not actually unique!
   ...
) }}

I know that nomenclature is misleading (Grace called it out a few months ago), but it would be consistent with how users have gotten this working on other DWHs/adapters. (I say that, at the same time acknowledging that non-unique keys can lead to poor query performance; there's a suggested performance optimization to switch from delete ... using to delete ... where: dbt-labs/dbt-adapters#150 (comment), dbt-labs/dbt-adapters#151.)


To offer one other perspective:

  • We're rolling out a brand-new microbatch incremental strategy in dbt Core v1.9 (https://docs.getdbt.com/docs/build/incremental-microbatch)
  • This strategy has parallelism built in, with the capability within dbt for per-batch concurrency + retry
  • Our stated goal: "To perform this operation, dbt will use the most efficient atomic mechanism for "full batch" replacement that is available on each data platform." Right now, that's using merge with a constant-false predicate — the same as insert_overwrite (here — but with the option to use the copy_partitions API instead (as Borja noted).
  • Given that microbatch is brand-new — should we make it use a delete; insert transaction right from the start? Or, should we add this as another opt-in config for the insert_overwrite strategy — one that would also be implicitly available to the microbatch strategy, too)? It's probably too late in the game to be asking, because dbt Core v1.9 is planning to go live in early December, but I figured I'd ask anyway :)

@borjavb
Copy link
Contributor

borjavb commented Nov 29, 2024

Hello @jtcohen6 ❤️ !
Oh, the microbatch logic is a great point and maybe I’ll diverge a bit, but I think it’s all related:

Given that microbatch is brand-new — should we make it use a delete; insert transaction right from the start? Or, should we add this as another opt-in config for the insert_overwrite strategy — one that would also be implicitly available to the microbatch strategy, too)? It's probably too late in the game to be asking, because dbt Core v1.9 is planning to go live in early December, but I figured I'd ask anyway :)

Unfortunately, I don’t think we can. With the future requirement of microbatch potentially running in parallel, this delete+insert would lock us in a sequential run: we are wrapping the operations within a transaction, and transactions in BQ cannot run concurrently :_(. I mean, we could potentially remove the transaction block, but we would lose the atomicity of the delete+insert operation which sounds scary and would potentially open a can of worms. 

image

On the other hand, thinking of how microbatch will work for that future requirement. MERGE offers limited parallelism: BigQuery only allows 2 concurrent MERGE operations over the same table and then everything else goes to a queue of PENDING jobs (up to 20, after that they will start failing)

So if we really want to offer pure parallelism, we probably need the insert_overwrite relying on a copy_partitions strategy.

And to add even more complexity to everything..., the delete;insert is not fully optimised (yet?) for hourly partitions (the delete is not free). So it feels like there's no silver bullet for all cases 🤦 😢


Given all of the above, and considering this issue, I would go with the latter option you suggested: add delete+insert logic within the insert_overwrite as an opt-in operation (even for microbatch), so users can have full control over the different options and trade offs: more parallelism vs less parallelism but potential cheaper queries.
Plus this being a opt-in, we don't break past behaviours (who knows what's being built out there assuming the merge is not locking the table...)

This feels like a proper v2 of your old post @jtcohen6 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request incremental
Projects
None yet
5 participants