Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues with a nontrivial number of sources #2474

Closed
drewbanin opened this issue May 20, 2020 · 3 comments · Fixed by #2478
Closed

Performance issues with a nontrivial number of sources #2474

drewbanin opened this issue May 20, 2020 · 3 comments · Fixed by #2478
Labels
bug Something isn't working performance

Comments

@drewbanin
Copy link
Contributor

drewbanin commented May 20, 2020

Describe the bug

In dbt v0.17.0-rc1, it appears that performance degrades greatly when a nontrivial number of sources are added to a project. This is a regression: I could not reproduce this performance failure mode in dbt v0.16.x.

I tested this out by repeatedly running dbt ls and recording runtimes. The data looks like:

sources runtime (s) time per node (s)
25 4.049509287 0.1619803715
50 7.128911018 0.1425782204
75 12.29483604 0.1639311473
100 18.71185374 0.1871185374
125 29.37486005 0.2349988804
150 38.06821609 0.2537881072
175 50.92429304 0.2909959602
200 67.39295316 0.3369647658
225 79.97533703 0.3554459423

Screen Shot 2020-05-20 at 9 37 11 AM

This data indicates that a dbt ls command with 225 sources takes 1m20s to run. A corresponding dbt ls on 0.16.1 runs in 4s!

There may be some not-ideal algorithmic complexity issues to look into here. Additionally, the fixed cost for parsing a single source is super high. Most of this latency appears to come from the serialization and deserialization of data in the source patching part of the codebase.

The patch_source method accounts for the majority of the runtime of this dbt ls command, but notably, there are no sources to patch in my example project!

Screen Shot 2020-05-20 at 9 43 41 AM

The relevant part of the codebase is around here:

https://github.com/fishtown-analytics/dbt/blob/75dbb0bc19376b2905d5bbb66284b9be3bf3c93c/core/dbt/parser/sources.py#L44-L68

Possible resolutions

  • Can we skip the source patching code if the source is not patched?
  • The slowest parts of this execution are around serialization and deserialization in hologram (I think). Is there an easy way to make this serialization/deserialization significantly faster?

The output of dbt --version:

dbt v0.17.0-rc1

The operating system you're using: macOS

The output of python --version: 3.7.7

@drewbanin drewbanin added bug Something isn't working performance labels May 20, 2020
@drewbanin drewbanin added this to the Octavius Catto milestone May 20, 2020
@beckjake
Copy link
Contributor

I assure you, there's no way dbt 0.17.0rc1 is running with python 2.7.7. 😄
Maybe we should have people run dbt debug instead, so we can capture homebrew/virtualenv installs?

@drewbanin
Copy link
Contributor Author

oops - i meant 3.7.7 - that was a typo

@drewbanin
Copy link
Contributor Author

fixed by #2478

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants