Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Regression] 1.8.2 slower to build than 1.5.9 when tag+ includes many nodes #10434

Closed
2 tasks done
cajubelt opened this issue Jul 12, 2024 · 20 comments · Fixed by #10526 or #10567
Closed
2 tasks done

[Regression] 1.8.2 slower to build than 1.5.9 when tag+ includes many nodes #10434

cajubelt opened this issue Jul 12, 2024 · 20 comments · Fixed by #10526 or #10567

Comments

@cajubelt
Copy link

cajubelt commented Jul 12, 2024

Is this a regression in a recent version of dbt-core?

  • I believe this is a regression in dbt-core functionality
  • I have searched the existing issues, and I could not find an existing issue for this regression

Current Behavior

dbt build -s tag:my_tag+ takes about 20 minutes longer to start on dbt 1.8.2 than it does on 1.5.9 with the same tag. The tag used has a lot of downstream nodes in our project, about 11k. Generally we’re seeing better performance on 1.8 so we were surprised to see this big regression in performance.

Expected/Previous Behavior

Previously building everything downstream of a tag with lots of nodes would take a couple minutes of startup time and then begin running queries against our db. Now it takes 20+ minutes.

Steps To Reproduce

  1. Set up dbt project with a tag that has about 11k downstream nodes
  2. Install dbt 1.8.2
  3. Run dbt build -s tag:my_tag+

Relevant log output

No response

Environment

- OS: MacOS 14.5 and Ubuntu 22.04
- Python: 3.9.12
- dbt (working version): 1.5.9
- dbt (regression version): 1.8.2

Which database adapter are you using with dbt?

bigquery

Additional Context

The reason we need this is because we have a selector used in CI/CD that excludes everything downstream of a tag that is upstream of many nodes. The example given is a simpler version of the original issue we found with that selector. (We tested the simpler version and found to also have the same performance issue.) The selector was something like

- name: my_selector
  definition: 
    union:
      - state:modified+
      - exclude:
        - tag:my_tag+
@cajubelt cajubelt added bug Something isn't working regression triage labels Jul 12, 2024
@jtcohen6
Copy link
Contributor

Thanks @cajubelt! Is this only for the build command (i.e. not run)?

I suspect this might be due to the large number of tests in your project (per our conversation), and the additional time that dbt spends "linking" the DAG (adding edges between test on upstream model -> downstream models, so that they skip on test failure). While it's not immediately clear to me what change we made between v1.5 -> v1.8 to that logic, if you only see this slowdown on build, it would be a strong indication that therein lies the problem.

@cajubelt
Copy link
Author

@jtcohen6 yes I just confirmed that dbt run doesn't have any noticeable slowdown on 1.8 with the same selector.

@jtcohen6
Copy link
Contributor

@cajubelt Okay! So I think the hypothesis is:

  • A project with a lot of tests (~5k models and ~15k data tests) is meaningfully slower at startup for dbt build in v1.8 compared to v1.5
  • My suspicion is that this slowdown is happening within (or closely adjacent to) the add_test_edges method

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jul 12, 2024

If you're up for it, there are two ways to try confirming that hypothesis by profiling dbt's performance:

pip install py-spy
sudo py-spy record -s -f speedscope -- dbt build

@gshank
Copy link
Contributor

gshank commented Jul 12, 2024

We have the ability to exclude tests from the build resource types, but that doesn't affect whether or not we pass "add_test_edges" to compile. It would be possible to check the resource types and if tests are excluded, pass "add_test_edges" = False.

Of course if you do want tests to run, that doesn't help.

@cajubelt
Copy link
Author

We do want to run tests if possible.

@jtcohen6 here's a screenshot of a search through py-spy's output for the longest segments of the flame graph. Looks like generic_bfs_edges is adding 12:41, not sure if that's particularly revealing however

Screenshot 2024-07-14 at 8 43 37 PM

matthewshaver pushed a commit to dbt-labs/docs.getdbt.com that referenced this issue Jul 20, 2024
[Preview](https://docs-getdbt-com-git-dbeatty10-patch-2-dbt-labs.vercel.app/reference/global-configs/record-timing-info)

## What are you changing in this pull request and why?
Officially documenting the performance profiling approach mentioned in
dbt-labs/dbt-core#10434 (comment).

## Checklist
- [x] Review the [Content style
guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md)
so my content adheres to these guidelines.
@stefankeidel
Copy link

@jtcohen6 Not super sure if related to this issue but we're seeing an extreme slowdown when building a large chunk of a fairly big graph with --indirect-selection buildable in CI/CD. The project has 2573 models, 13 snapshots, 7 analyses, 2254 data tests, 20 seeds, 1 operation, 359 sources, 44 exposures, 1408 macros, 1 group, 13 unit tests

cautious and eager end up being fine, but with buildable, the process "hangs" for up to 20 minutes after it prints the project stats "Found 2573 models ..." etc. and before the "Concurrency: 4 threads (target='test')" message. No other output, even on debug log level.

I tried to figure out what's going on using the trace module but it was a bit too much spam :). Will try add some trace output when I find a minute

@stefankeidel
Copy link

here's me waiting for about 5minutes for the generic_bfs_edges traversal

image

In a "good" run with eager selection this takes about two seconds with the exact same selector (the rest is dbt actually starting to execute, which I didn't wait for in the other run)

image

@ttusing
Copy link
Contributor

ttusing commented Aug 3, 2024

What percentage of your graph (including tests) is selected by the tag, and what % is selected by the selector?

I had someone on my team try using the @ decorator on a selector (which I don't normally do) and we both had DBT hang. I kind of pushed it out of my head, will be thinking on this now.

I don't really understand the py-spy output (worked with SnakeViz).... if I had to guess what might be happening:

  • for some reason, generic_bfs_edges is either really slow or being called a lot
  • collect_specified_neighbors in selector.py calls self.graph.select_children when passed certain selector patterns
  • self.graph.select_children applies the self.descendants method for each node, which is perhaps a lot of times, or maybe it uses up a lot of memory.
  • the slow part based on your output is filter_edges_by_type, which is calling get_edge_data in networkx,

If you can share the snakeviz would help out, also maybe check your memory usage? Last bug I worked on for this graph optimization stuff was having performance issues from running out of memory when the DAG got too big.

If my stab in the dark here is in the right direction, maybe there is a pruning that could be done to reduce calling descendants so much. Like, if you have a ton of nodes in your selector for which many are already descendents of each other, you don't need to re-tread and evaluate each node. could first get a list of all descendents for each node with a depth of 1, then take the difference of that and the list of nodes in the selector to restrict to searching for descendents of unsearched nodes, find descendents of depth 1 from that list, and iterate.

For your graph has 20k nodes, if 1k have the tag, 10k are descendents, and each descendent is on average the descendent of 70% of the tagged nodes, this takes the number of times each of the descendents has to be evaluated by ~700. 700*10,000 is a lot of calls.

After writing this out this seems actually pretty likely XD. Think an update to select_children with this logic would be the place to implement this. This seems similar to #7194

Also if the slow part is filtered_graph = self.exclude_edge_type("parent_test"), maybe could cache that or pass it as a parameter into descendants?

will try tests on my own project with ~500 models and ~5000 tests next week.

@ttusing
Copy link
Contributor

ttusing commented Aug 6, 2024

I added a PR! #10526

I believe I was able to recreate the issue in my project, which has: Found 442 models, 43 snapshots, 5112 data tests, 32 seeds, 2 operations, 134 sources, 23 exposures, 1075 macros

Running dbt build --select tag:silver+ on my DAG on 1.8.4, it takes 132 seconds to start building. dbt build just takes 11 seconds.

My PR applies the pruning strategy I described. I saw the runtime improve for dbt build --select tag:silver+ to 66 seconds. Using SnakeViz, I identified that the number of calls to get_edge_data reduced from 14,332,569 to 383,916.

I am unsure how much DBT 1.9/main branch contains performance improvements... the total runtime improvement was 66 seconds, but select_children only improved from 80 seconds to 45 seconds (35 seconds improvement). Going to try to run main branch for performance check.

Feedback welcome, especially if you can test/review the change!

@peterallenwebb
Copy link
Contributor

I've reviewed and suggested some changes to @ttusing's PR but we should be close to a fix on this.

@peterallenwebb
Copy link
Contributor

This is now resolved in 1.9+, but I'm re-opening this issue so we can consider backports.

@ttusing
Copy link
Contributor

ttusing commented Aug 9, 2024

I'm also interested in seeing performance reports from @stefankeidel and @cajubelt. This should help, but checking the edge type seems to be inherently slow. Wonder if the runtime will improve from 20 minutes to like 15, 10, or 5 minutes and if there are further improvements that can be made in the edge type lookup speed.

I could image avoiding the edge type lookup altogether somehow with the knowledge that tests cannot depend on tests. So, don't check the edge type for searching children if the node is a test (or if parent is a test for ancestors) or something like that. Or, building the subgraph with all non-tests nodes first.

@cajubelt
Copy link
Author

cajubelt commented Aug 9, 2024

Just saw the great activity on this after it fell off my radar for a bit, thanks all. Will follow up on Monday with a perf report, I have personal commitments this weekend.

We'd love to see a backport to 1.8.

@stefankeidel
Copy link

stefankeidel commented Aug 10, 2024 via email

@ttusing
Copy link
Contributor

ttusing commented Aug 12, 2024

So I'm looking at this section:

def add_test_edges(self, manifest: Manifest) -> None:
"""This method adds additional edges to the DAG. For a given non-test
executable node, add an edge from an upstream test to the given node if
the set of nodes the test depends on is a subset of the upstream nodes
for the given node."""
# Given a graph:
# model1 --> model2 --> model3
# | |
# | \/
# \/ test 2
# test1
#
# Produce the following graph:
# model1 --> model2 --> model3
# | /\ | /\ /\
# | | \/ | |
# \/ | test2 ----| |
# test1 ----|---------------|

This adds a huge amount of edges. For probably most but pathological DAGs, this is an edge between every test and all nodes downstream of what its testing. For the example of 5k models and 15k data tests, this might be added approximately (5k*15k/2)= 37.5 million edges.

Was added in #7276, I don't really understand the purpose of this code. @gshank can you explain why it is necessary to add all of these edges? It seems like a huge factor limiting DBT scaling. If the goal is to get dbt build to run tests on parents before children, isn't enough to just do this with a depth of 1?

I'm guessing the get edge type function is slow because of the millions of edges. There are way less nodes (thousands), so getting node type should probably be way faster, and I am nearly certain nothing can directly depend on tests (only parent_edge type), so I think checking the node types is another potential huge optimization.

Somewhat separately, I wonder about a feature enhancement to print simple DAG stats with like --dag-debug for helping with these. Might show the total number of edges, highest degree node (or distribution stats on degree), highest number of tests on a single model (or distribution), etc.

@cajubelt
Copy link
Author

Ok I did some profiling last night on dbt build --select tag:large_tag+. Looks like get_edge_data accounted for about 36 minutes and was called 710,951,628 times. Total time to build was 81 minutes. (I used --empty and it errored out immediately when writing to tables so none of this was time waiting for the db writes.) I used dbt 1.8.2. Will try again tonight on 1.9 if that's helpful

@cajubelt
Copy link
Author

Ok, ran it again with 1.9.0-a1 and the improvement was dramatic. The build time was just 10 minutes. get_edge_data was called 11,778,492, a 98% reduction, and accounted for just 21 seconds.

@cajubelt
Copy link
Author

Some stats about my project: 5799 models, 20 snapshots, 91 analyses, 18763 data tests, 186 seeds, 1 operation, 1086 sources, 69 exposures, 1899 macros

@dbeatty10
Copy link
Contributor

Resolved for dbt-core 1.9+ in #10526 and backported to dbt-core 1.8 in #10567.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants