Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Microbatch parallelism #10958

Merged
merged 44 commits into from
Nov 21, 2024
Merged

Microbatch parallelism #10958

merged 44 commits into from
Nov 21, 2024

Conversation

MichelleArk
Copy link
Contributor

@MichelleArk MichelleArk commented Oct 31, 2024

Resolves #10853
Resolves #10855

Problem

Part of the benefit of microbatch, is updating a model in smaller (micro) batches. However, before the work in this PR, those batches were run sequentially. Having them run sequentially didn't negate all benefits of splitting into batches (partial instead of complete failures, better data guarantees, easier backfills, etc), however because they were running serially there weren't any speed gains. We wanted the ability to run batches in parallel. This PR does that.

Solution

First we pulled all the batch execution logic into a new MicrobatchModelRunner, which is called from the ModelRunner when a microbatch model is encountered. Handwaving all the complexities of creating a new Runner class, the class determines whether things can be run in parallel in the following.

First, if any of the following are true, a batch is run sequentially

  1. The adapter doesn't support concurrent batches
  2. The relation in the data warehouse for the model doesn't exist

If neither (1) nor (2) was hit, then we check if a config concurrent_batches is set for the model. If the value for that config is True then we run the batches in parallel, if False we run the batches sequentially.

If however concurrent_batches is None (i.e. not set), then we check if the model jinja contains a reference to this. If it references this then we run the batches sequentially. Otherwise, we run them in parallel.

Checklist

  • I have read the contributing guide and understand what's expected of me.
  • I have run this code in development, and it appears to resolve the stated issue.
  • This PR includes tests, or tests are not required or relevant for this PR.
  • This PR has no interface changes (e.g., macros, CLI, logs, JSON artifacts, config files, adapter interface, etc.) or this PR has already received feedback and approval from Product or DX.
  • This PR includes type annotations for new and modified functions.

@cla-bot cla-bot bot added the cla:yes label Oct 31, 2024
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

Copy link

codecov bot commented Oct 31, 2024

Codecov Report

Attention: Patch coverage is 94.92386% with 10 lines in your changes missing coverage. Please review.

Project coverage is 89.10%. Comparing base (f080346) to head (b78f251).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10958      +/-   ##
==========================================
- Coverage   89.13%   89.10%   -0.04%     
==========================================
  Files         183      183              
  Lines       23646    23760     +114     
==========================================
+ Hits        21078    21171      +93     
- Misses       2568     2589      +21     
Flag Coverage Δ
integration 86.40% <87.81%> (-0.14%) ⬇️
unit 62.12% <31.97%> (-0.65%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Unit Tests 62.12% <31.97%> (-0.65%) ⬇️
Integration Tests 86.40% <87.81%> (-0.14%) ⬇️
---- 🚨 Try these New Features:

@QMalcolm QMalcolm added the artifact_minor_upgrade To bypass the CI check by confirming that the change is not breaking label Nov 18, 2024
…batch execution

We had a unit test. However that unit test broke upon our work to parallelize
things. The parallelization work made it _possible_ for this test to be an
integration test, which is actually what we always wanted for this functionality.
So we removed the broken unit test, and added a new integration test :)
…this` to determine batch parallelism

Microbatch batches can be run in parallel, but won't _always_ be run in parallel.
There are 3 general cases
1. The batch's model relation doesn't exist -> It can't be run in parallel
2. The relation exists and the model has a `this` invocation -> Assume it can't be run in parallel
3. The relation exists and the model _doesn't_ have a `this` invocation -> Assume it can be run in parallel

However there are some exceptions:
a. the `this` reference is actually safe (maybe the needed `this` data is guaranteed to exist)
b. the batches are small enough (data wise) that it's actually faster to run sequentally

Because of (a) and (b) we needed a way to escape out of (1) and (2). Thus we
added the `parallel_batches` config. This config defaults to `None`. If it is
set though, its value takes precedence over the presence of `this`.
@QMalcolm QMalcolm requested review from a team as code owners November 20, 2024 04:16
@QMalcolm QMalcolm requested review from PaulVPham and removed request for a team November 20, 2024 04:16
@QMalcolm QMalcolm changed the title [WIP] Microbatch parallelism Microbatch parallelism Nov 20, 2024
core/dbt/task/run.py Outdated Show resolved Hide resolved
core/dbt/task/run.py Outdated Show resolved Hide resolved
@QMalcolm QMalcolm removed the cla:yes label Nov 21, 2024
@QMalcolm
Copy link
Contributor

@cla-bot check

@QMalcolm
Copy link
Contributor

The cla-bot is refusing to show up 🤔

@QMalcolm
Copy link
Contributor

@cla-bot check

@MichelleArk MichelleArk added the proto update update proto definitions in CI label Nov 21, 2024
@QMalcolm
Copy link
Contributor

@cla-bot check

@MichelleArk
Copy link
Contributor Author

Merging and bypassing branch protection for verification/cla-signed, as the check seems broken across other branches as well.

@MichelleArk MichelleArk merged commit fd6ec71 into main Nov 21, 2024
52 of 55 checks passed
@MichelleArk MichelleArk deleted the microbatch-parallelism branch November 21, 2024 05:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
artifact_minor_upgrade To bypass the CI check by confirming that the change is not breaking cla:yes proto update update proto definitions in CI
Projects
None yet
3 participants