lots of time is spent in allocation in `puffin-dev resolve-many` for the top PyPI packages #396

BurntSushi · 2023-11-10T17:59:13Z

One thing i'm seeing is that we are spending a lot of time in allocation, around 16% for malloc and 11.5% for free of total runtime for the top 1K pypi packages:

That jumps to about 18% for malloc and 13% for free in the top 4K.

My hypothesis for this is that we are allocating oodles of tiny little objects everywhere. For example, Version has two Vecs in it and WheelFilename adds many more allocs with more indirection. There are probably other places with lots of tiny little objects.

There are a variety of different approaches one can use to attack this problem (if we consider it a problem at all):

Try container types that use inline storage for small elements and fallback to the heap for bigger elements. I'm thinking about things like smallvec. This is also known as a "small string optimization" (SSO). This might be a good first thing to try since I'm guessing it's quite relevant. I imagine many of the Vecs and Strings in WheelFilename and Version, for example, are quite small.
We could try different allocators. It looks like ruff uses mimalloc. My system is using glibc's allocator which is generally pretty fast. So I wouldn't expect an enormous gain here.
We could try different allocation styles. For example, a bump allocator like bumpalo. This will likely require some refactoring of the code.
We could try representational changes to our data types to make fewer allocations. For example, a Version could probably be represented as a Vec<u8> internally. This makes its representation more complex and usually will incur some kind of perf penalty when you try to use a Vesion. But it could speed up objection creation enough to make it a worthwhile trade.
We could use some kind of string interning to exploit cross-package redundancy. It is very likely for example that the total number of unique version strings is far smaller than the total number of version strings. Similarly for other things such as package names. (Since some packages have a lot of releases.)

Anyway, this is just what I have off the top of my head. I'm actually not even certain that we want to solve this problem at all. But it is worth pointing out that some of the above remedies are much easier to try than others (smallvec should be quite easy to slot into the existing code and could significantly reduce allocs).

The text was updated successfully, but these errors were encountered:

charliermarsh · 2023-11-10T18:00:45Z

Personal opinion is that I'd love to improve this and these ideas strike me as being in the right direction.

konstin · 2023-11-10T18:27:44Z

I'd love to try smallvec for release. We can also use u32 instead of usize in there

charliermarsh · 2023-11-10T18:29:11Z

We could also use an enum for release that captures the overwhelmingly common cases of 1, 2, or 3 segments.

charliermarsh · 2023-11-10T18:31:54Z

That might be strictly more work. I recall smallvec introduces some overhead since you need to check if an element is on the stack or the heap, but I guess an enum would have the same problem :)

konstin · 2023-11-10T19:19:27Z

Perfwise we shouldn't see the difference, but smallvec would be more ergonomic with less effort

This copies the allocator configuration used in the Ruff project. In particular, this gives us an instant 10% win when resolving the top 1K PyPI packages: $ hyperfine \ "./target/profiling/puffin-dev-main resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null" \ "./target/profiling/puffin-dev resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null" Benchmark 1: ./target/profiling/puffin-dev-main resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null Time (mean ± σ): 974.2 ms ± 26.4 ms [User: 17503.3 ms, System: 2205.3 ms] Range (min … max): 943.5 ms … 1015.9 ms 10 runs Benchmark 2: ./target/profiling/puffin-dev resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null Time (mean ± σ): 883.1 ms ± 23.3 ms [User: 14626.1 ms, System: 2542.2 ms] Range (min … max): 849.5 ms … 916.9 ms 10 runs Summary './target/profiling/puffin-dev resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null' ran 1.10 ± 0.04 times faster than './target/profiling/puffin-dev-main resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null' I was moved to do this because I noticed `malloc`/`free` taking up a fairly sizeable percentage of time during light profiling. Ref #396

This copies the allocator configuration used in the Ruff project. In particular, this gives us an instant 10% win when resolving the top 1K PyPI packages: $ hyperfine \ "./target/profiling/puffin-dev-main resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null" \ "./target/profiling/puffin-dev resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null" Benchmark 1: ./target/profiling/puffin-dev-main resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null Time (mean ± σ): 974.2 ms ± 26.4 ms [User: 17503.3 ms, System: 2205.3 ms] Range (min … max): 943.5 ms … 1015.9 ms 10 runs Benchmark 2: ./target/profiling/puffin-dev resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null Time (mean ± σ): 883.1 ms ± 23.3 ms [User: 14626.1 ms, System: 2542.2 ms] Range (min … max): 849.5 ms … 916.9 ms 10 runs Summary './target/profiling/puffin-dev resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null' ran 1.10 ± 0.04 times faster than './target/profiling/puffin-dev-main resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null' I was moved to do this because I noticed `malloc`/`free` taking up a fairly sizeable percentage of time during light profiling. As is becoming a pattern, it will be easier to review this commit-by-commit. Ref #396 (wouldn't call this issue fixed) ----- I did also try adding a `smallvec` optimization to the `Version::release` field, but it didn't bare any fruit. I still think there is more to explore since the results I observed don't quite line up with what I expect. (So probably either my mental model is off or my measurement process is flawed.) You can see that attempt with a little more explanation here: f9528b4 In the course of adding the `smallvec` optimization, I also shrunk the `Version` fields from a `usize` to a `u32`. They should at least be a fixed size integer since version numbers aren't used to index memory, and I shrunk it to `u32` since it seems reasonable to assume that all version numbers will be smaller than `2^32`.

BurntSushi · 2023-11-10T19:50:38Z

#399 gets a ~10% perf win by switching the global allocator. I did attempt a smallvec optimization experiment (as mentioned above) but couldn't observe any measurable win. I'm not 100% convinced that smallvec has no role to play and I'm sure there is more that can still be done here.

charliermarsh · 2024-01-05T04:36:13Z

@BurntSushi - Should we close this when merging #789 or is there more that's worth tracking here?

BurntSushi · 2024-01-05T12:39:05Z

Ah yeah that's a good point. Yeah, I don't think this specific issue still exists once #789 gets merged. (Or even before it to be honest. I think gratuitous allocation has been removed from the happy path for the most part already.)

BurntSushi added the performance Potential performance improvement label Nov 10, 2023

BurntSushi mentioned this issue Nov 10, 2023

Manual PEP 440 version parser #373

Closed

BurntSushi mentioned this issue Nov 10, 2023

change global allocator to jemalloc (and mimalloc on Windows) #399

Merged

BurntSushi mentioned this issue Jan 5, 2024

pep440: rewrite the parser and make version comparisons cheaper #789

Merged

BurntSushi closed this as completed in #789 Jan 5, 2024

BurntSushi closed this as completed in 6c98ae9 Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lots of time is spent in allocation in `puffin-dev resolve-many` for the top PyPI packages #396

lots of time is spent in allocation in `puffin-dev resolve-many` for the top PyPI packages #396

BurntSushi commented Nov 10, 2023

charliermarsh commented Nov 10, 2023

konstin commented Nov 10, 2023

charliermarsh commented Nov 10, 2023

charliermarsh commented Nov 10, 2023

konstin commented Nov 10, 2023

BurntSushi commented Nov 10, 2023

charliermarsh commented Jan 5, 2024

BurntSushi commented Jan 5, 2024

lots of time is spent in allocation in puffin-dev resolve-many for the top PyPI packages #396

lots of time is spent in allocation in puffin-dev resolve-many for the top PyPI packages #396

Comments

BurntSushi commented Nov 10, 2023

charliermarsh commented Nov 10, 2023

konstin commented Nov 10, 2023

charliermarsh commented Nov 10, 2023

charliermarsh commented Nov 10, 2023

konstin commented Nov 10, 2023

BurntSushi commented Nov 10, 2023

charliermarsh commented Jan 5, 2024

BurntSushi commented Jan 5, 2024

lots of time is spent in allocation in `puffin-dev resolve-many` for the top PyPI packages #396

lots of time is spent in allocation in `puffin-dev resolve-many` for the top PyPI packages #396