-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lots of time is spent in allocation in puffin-dev resolve-many
for the top PyPI packages
#396
Comments
Personal opinion is that I'd love to improve this and these ideas strike me as being in the right direction. |
I'd love to try smallvec for |
We could also use an enum for |
That might be strictly more work. I recall |
Perfwise we shouldn't see the difference, but |
This copies the allocator configuration used in the Ruff project. In particular, this gives us an instant 10% win when resolving the top 1K PyPI packages: $ hyperfine \ "./target/profiling/puffin-dev-main resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null" \ "./target/profiling/puffin-dev resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null" Benchmark 1: ./target/profiling/puffin-dev-main resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null Time (mean ± σ): 974.2 ms ± 26.4 ms [User: 17503.3 ms, System: 2205.3 ms] Range (min … max): 943.5 ms … 1015.9 ms 10 runs Benchmark 2: ./target/profiling/puffin-dev resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null Time (mean ± σ): 883.1 ms ± 23.3 ms [User: 14626.1 ms, System: 2542.2 ms] Range (min … max): 849.5 ms … 916.9 ms 10 runs Summary './target/profiling/puffin-dev resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null' ran 1.10 ± 0.04 times faster than './target/profiling/puffin-dev-main resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null' I was moved to do this because I noticed `malloc`/`free` taking up a fairly sizeable percentage of time during light profiling. Ref #396
This copies the allocator configuration used in the Ruff project. In particular, this gives us an instant 10% win when resolving the top 1K PyPI packages: $ hyperfine \ "./target/profiling/puffin-dev-main resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null" \ "./target/profiling/puffin-dev resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null" Benchmark 1: ./target/profiling/puffin-dev-main resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null Time (mean ± σ): 974.2 ms ± 26.4 ms [User: 17503.3 ms, System: 2205.3 ms] Range (min … max): 943.5 ms … 1015.9 ms 10 runs Benchmark 2: ./target/profiling/puffin-dev resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null Time (mean ± σ): 883.1 ms ± 23.3 ms [User: 14626.1 ms, System: 2542.2 ms] Range (min … max): 849.5 ms … 916.9 ms 10 runs Summary './target/profiling/puffin-dev resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null' ran 1.10 ± 0.04 times faster than './target/profiling/puffin-dev-main resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null' I was moved to do this because I noticed `malloc`/`free` taking up a fairly sizeable percentage of time during light profiling. As is becoming a pattern, it will be easier to review this commit-by-commit. Ref #396 (wouldn't call this issue fixed) ----- I did also try adding a `smallvec` optimization to the `Version::release` field, but it didn't bare any fruit. I still think there is more to explore since the results I observed don't quite line up with what I expect. (So probably either my mental model is off or my measurement process is flawed.) You can see that attempt with a little more explanation here: f9528b4 In the course of adding the `smallvec` optimization, I also shrunk the `Version` fields from a `usize` to a `u32`. They should at least be a fixed size integer since version numbers aren't used to index memory, and I shrunk it to `u32` since it seems reasonable to assume that all version numbers will be smaller than `2^32`.
#399 gets a ~10% perf win by switching the global allocator. I did attempt a smallvec optimization experiment (as mentioned above) but couldn't observe any measurable win. I'm not 100% convinced that smallvec has no role to play and I'm sure there is more that can still be done here. |
@BurntSushi - Should we close this when merging #789 or is there more that's worth tracking here? |
Ah yeah that's a good point. Yeah, I don't think this specific issue still exists once #789 gets merged. (Or even before it to be honest. I think gratuitous allocation has been removed from the happy path for the most part already.) |
One thing i'm seeing is that we are spending a lot of time in allocation, around 16% for
malloc
and 11.5% forfree
of total runtime for the top 1K pypi packages:That jumps to about 18% for
malloc
and 13% forfree
in the top 4K.My hypothesis for this is that we are allocating oodles of tiny little objects everywhere. For example,
Version
has twoVec
s in it andWheelFilename
adds many more allocs with more indirection. There are probably other places with lots of tiny little objects.There are a variety of different approaches one can use to attack this problem (if we consider it a problem at all):
smallvec
. This is also known as a "small string optimization" (SSO). This might be a good first thing to try since I'm guessing it's quite relevant. I imagine many of theVec
s andString
s inWheelFilename
andVersion
, for example, are quite small.mimalloc
. My system is using glibc's allocator which is generally pretty fast. So I wouldn't expect an enormous gain here.bumpalo
. This will likely require some refactoring of the code.Version
could probably be represented as aVec<u8>
internally. This makes its representation more complex and usually will incur some kind of perf penalty when you try to use aVesion
. But it could speed up objection creation enough to make it a worthwhile trade.Anyway, this is just what I have off the top of my head. I'm actually not even certain that we want to solve this problem at all. But it is worth pointing out that some of the above remedies are much easier to try than others (
smallvec
should be quite easy to slot into the existing code and could significantly reduce allocs).The text was updated successfully, but these errors were encountered: