[Performance] 1.8 slower partial parsing than 1.7 #10127

d-cole · 2024-05-10T14:49:47Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Performance issue

In a large dbt project we noticed a 26% increase in partial parse time updating from 1.7.3 - 1.8.0. This was replicated in a smaller reproducible project. Measurements below are the median time over 10 runs of the same command:

Parse time: 24% reduction, 68.46sec -> 52.05sec (-16.41sec). This is great!
Partial parse time: 26% increase, 6.00sec -> 7.58sec (+1.58sec). While this isn't huge, this is not ideal as our users encounter partial parsing much more frequently than a full parse.

Steps to reproduce

The results can be replicated in a reproducible way using https://github.com/dbt-labs/jaffle_shop_duckdb.

Follow setup steps here: https://github.com/dbt-labs/jaffle_shop_duckdb
dbt run -r 1.7_parse_timing.txt parse
dbt run -r 1.7_partial_parse_timing.txt parse
Remove target/ and install 1.8. There may be an issue here with dbt-adapters install so I've included all steps below. I can move this to a separate discussion if needed.

python3 -m pip install dbt-core==1.8.0 #version conflict with dbt-duckdb so install that next and reinstall
pip install dbt-duckdb==1.8.0  
python3 -m pip install dbt-core==1.8.0
# dbt debug results in "ModuleNotFoundError: No module named 'dbt.adapters.factory"
pip uninstall dbt-adapters
pip install dbt-adapters==1.1.1

dbt run -r 1.8_parse_timing.txt parse
dbt run -r 1.8_partial_parse_timing.txt parse

Results from partial parses are attached below.

1.7_partial_parse_timing.txt
1.8_partial_parse_timing.txt

The results shown in the record timing for dbt-duckdb match what we see in our large dbt project. That is, the parse_manifest time is significantly greater in 1.8 (5.6x greater in duckdb example 0.0568s -> 0.320s). This increase can be broken down into:

16x increase in write_manifest (0.00685 -> 0.110). This matches what we see in our real dbt project.
4.4x increase in get_full_manifest (0.0464 -> 0.206)

Who will this benefit?

Users of dbt 1.8+

The text was updated successfully, but these errors were encountered:

peterallenwebb · 2024-05-10T17:14:58Z

@d-cole Thank you so much for the exceptional repro steps and analysis! I am looking in to this.

peterallenwebb · 2024-05-10T18:49:50Z

My local evaluation (on an M1 MacBook) shows that partial parsing of jaffle_shop_duckdb is slightly faster on 1.8 than it was on 1.7.3:

dbt 1.7.3
  dbt parse 1.08s user 0.10s system 87% cpu 1.351 total


dbt 1.8.0
  dbt parse  0.86s user 0.09s system 83% cpu 1.143 total

I think the profiles you attached are correct, but I don't think they give the full picture for this small project.

Python's built-in profiler ignores time spent loading Python modules, which is a significant fraction of dbt's runtime on short runs. In addition, we improved startup time in 1.8 by moving a significant amount of work from module load time to post-load run time. So I think the amount of work stayed about the same, but now the built-in profiler actually notices it.

I believe your claim that partial parsing slowed down for the larger project you described, but I think we'll need more information to get to the bottom of it. If you're able to supply profiles for the larger project, I might be able to find a solution. Ideally, I would recommend you use py-spy to collect the profiles in speedscope format, which would look like this:

pip install py-spy
sudo py-spy record -s -f speedscope -- dbt parse

I'm attaching the profiles I gathered during my investigation for reference. These can be viewed at https://www.speedscope.app/

dbt-1_7_3-parse-no-parrtial-parse.json
dbt-1_7_3-parse-yes-partial-parse.json
dbt-1_8_0-parse-no-partial-parse.json
dbt-1_8_0-parse-yes-partial-parse.json

d-cole · 2024-05-15T15:53:00Z

@peterallenwebb Thank you!

Python's built-in profiler ignores time spent loading Python modules, which is a significant fraction of dbt's runtime on short runs. In addition, we improved startup time in 1.8 by moving a significant amount of work from module load time to post-load run time. So I think the amount of work stayed about the same, but now the built-in profiler actually notices it.

Makes sense!

On the real dbt project I initially gathered times using a script to wrap a parse call and took the median over 10 runs, I believe this won't have the same issues as the profiler.

start_time = time.time()
subprocess.run(command, shell=True, check=True)
elapsed_time = time.time() - start_time

Using py-spy we see similar results where the write_manifest time is longer in 1.8 (see screenshot below). I've attached the partial parse profiles below. Thanks again for taking a look!

dbt_18_partial_parse.json
dbt_17_partial_parse.json

github-actions · 2024-11-13T01:59:33Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2024-11-20T02:02:41Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

d-cole added enhancement New feature or request triage labels May 10, 2024

d-cole changed the title ~~[Feature] 1.8 slower partial parsing than 1.7~~ [Performance] 1.8 slower partial parsing than 1.7 May 10, 2024

dbeatty10 added performance and removed triage labels May 10, 2024

dbeatty10 added the awaiting_response label May 13, 2024

github-actions bot added triage and removed awaiting_response labels May 15, 2024

dbeatty10 removed the triage label May 16, 2024

jtcohen6 mentioned this issue Jul 12, 2024

[Regression] 1.8.2 slower to build than 1.5.9 when tag+ includes many nodes #10434

Closed

2 tasks

github-actions bot added the stale Issues that have gone stale label Nov 13, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] 1.8 slower partial parsing than 1.7 #10127

[Performance] 1.8 slower partial parsing than 1.7 #10127

d-cole commented May 10, 2024 •

edited

Loading

peterallenwebb commented May 10, 2024

peterallenwebb commented May 10, 2024

d-cole commented May 15, 2024

github-actions bot commented Nov 13, 2024

github-actions bot commented Nov 20, 2024

[Performance] 1.8 slower partial parsing than 1.7 #10127

[Performance] 1.8 slower partial parsing than 1.7 #10127

Comments

d-cole commented May 10, 2024 • edited Loading

Is this your first time submitting a feature request?

Performance issue

Steps to reproduce

Who will this benefit?

peterallenwebb commented May 10, 2024

peterallenwebb commented May 10, 2024

d-cole commented May 15, 2024

github-actions bot commented Nov 13, 2024

github-actions bot commented Nov 20, 2024

d-cole commented May 10, 2024 •

edited

Loading