Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the Performance Characteristics of add_test_edges() #11092

Merged
merged 4 commits into from
Dec 5, 2024

Conversation

peterallenwebb
Copy link
Contributor

@peterallenwebb peterallenwebb commented Dec 3, 2024

Resolves #10950

Problem

The existing add_test_edges() function, executed on every dbt build, has poor performance characteristics. In the average case, it increases the number of edges in the graph by a factor of five. In extreme cases the factor can be 100 or more. This is causing slow run times and high memory usage for a small but important subset users. In the extreme case that kicked off this investigation, over 8,000,000 edges are being added, taking several minutes and consuming over a gigabyte of memory.

Solution

Create a new version of the add_test_edges() function, with the same overall behavior (as defined and explained in the code comments) but taking a faster approach which also adds fewer edges.

For now, this new behavior is behind the --use-fast-test-edges flag, also accessible via the DBT_USE_FAST_TEST_EDGES=True env var.

The before and after results across a set of >7K real world graph structures is recorded in this spreadsheet.

For each graph in the test set, the old algorithm and the new algorithm were run separately to produce two result graphs. The transitive closure of the result graphs were calculated and confirmed to be equal, meaning they impose the exact same restrictions on execution order.

Across the test set, the new algorithm saved a median of 134 edges and an average of 8227 edges, emphasizing the outsized role played by the worst-case graphs. The new algorithm was strictly faster on graphs of appreciable size (>20 nodes). The average speedup was 17x with a median of 9x.

In the most extreme case, the new algorithm added ~96,000 edges instead of ~8,000,000, and it completed in 0.27s instead of 140s.

Checklist

  • I have read the contributing guide and understand what's expected of me.
  • I have run this code in development, and it appears to resolve the stated issue.
  • This PR includes tests, or tests are not required or relevant for this PR.
  • This PR has no interface changes (e.g., macros, CLI, logs, JSON artifacts, config files, adapter interface, etc.) or this PR has already received feedback and approval from Product or DX.
  • This PR includes type annotations for new and modified functions.

@cla-bot cla-bot bot added the cla:yes label Dec 3, 2024
Copy link
Contributor

github-actions bot commented Dec 3, 2024

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

1 similar comment
Copy link
Contributor

github-actions bot commented Dec 3, 2024

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

Copy link

codecov bot commented Dec 3, 2024

Codecov Report

Attention: Patch coverage is 22.89157% with 64 lines in your changes missing coverage. Please review.

Project coverage is 88.89%. Comparing base (1b7d9b5) to head (234956d).
Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #11092      +/-   ##
==========================================
- Coverage   89.18%   88.89%   -0.29%     
==========================================
  Files         183      183              
  Lines       23783    23864      +81     
==========================================
+ Hits        21211    21215       +4     
- Misses       2572     2649      +77     
Flag Coverage Δ
integration 86.22% <22.89%> (-0.35%) ⬇️
unit 62.02% <20.48%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Unit Tests 62.02% <20.48%> (-0.15%) ⬇️
Integration Tests 86.22% <22.89%> (-0.35%) ⬇️

@peterallenwebb peterallenwebb marked this pull request as ready for review December 4, 2024 20:32
@peterallenwebb peterallenwebb requested a review from a team as a code owner December 4, 2024 20:32
Copy link
Contributor

@gshank gshank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! Looks great. Good comments. Nice optimizations.

@peterallenwebb peterallenwebb merged commit afe25a9 into main Dec 5, 2024
52 of 56 checks passed
@peterallenwebb peterallenwebb deleted the paw/add_better_edges branch December 5, 2024 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SPIKE+] Improve the Performance Characteristics of add_test_edges()
2 participants