Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lots of time is spent in allocation in puffin-dev resolve-many for the top PyPI packages #396

Closed
BurntSushi opened this issue Nov 10, 2023 · 8 comments · Fixed by #789
Closed
Labels
performance Potential performance improvement

Comments

@BurntSushi
Copy link
Member

One thing i'm seeing is that we are spending a lot of time in allocation, around 16% for malloc and 11.5% for free of total runtime for the top 1K pypi packages:

malloc

That jumps to about 18% for malloc and 13% for free in the top 4K.

My hypothesis for this is that we are allocating oodles of tiny little objects everywhere. For example, Version has two Vecs in it and WheelFilename adds many more allocs with more indirection. There are probably other places with lots of tiny little objects.

There are a variety of different approaches one can use to attack this problem (if we consider it a problem at all):

  • Try container types that use inline storage for small elements and fallback to the heap for bigger elements. I'm thinking about things like smallvec. This is also known as a "small string optimization" (SSO). This might be a good first thing to try since I'm guessing it's quite relevant. I imagine many of the Vecs and Strings in WheelFilename and Version, for example, are quite small.
  • We could try different allocators. It looks like ruff uses mimalloc. My system is using glibc's allocator which is generally pretty fast. So I wouldn't expect an enormous gain here.
  • We could try different allocation styles. For example, a bump allocator like bumpalo. This will likely require some refactoring of the code.
  • We could try representational changes to our data types to make fewer allocations. For example, a Version could probably be represented as a Vec<u8> internally. This makes its representation more complex and usually will incur some kind of perf penalty when you try to use a Vesion. But it could speed up objection creation enough to make it a worthwhile trade.
  • We could use some kind of string interning to exploit cross-package redundancy. It is very likely for example that the total number of unique version strings is far smaller than the total number of version strings. Similarly for other things such as package names. (Since some packages have a lot of releases.)

Anyway, this is just what I have off the top of my head. I'm actually not even certain that we want to solve this problem at all. But it is worth pointing out that some of the above remedies are much easier to try than others (smallvec should be quite easy to slot into the existing code and could significantly reduce allocs).

@BurntSushi BurntSushi added the performance Potential performance improvement label Nov 10, 2023
@charliermarsh
Copy link
Member

Personal opinion is that I'd love to improve this and these ideas strike me as being in the right direction.

@konstin
Copy link
Member

konstin commented Nov 10, 2023

I'd love to try smallvec for release. We can also use u32 instead of usize in there

@charliermarsh
Copy link
Member

We could also use an enum for release that captures the overwhelmingly common cases of 1, 2, or 3 segments.

@charliermarsh
Copy link
Member

That might be strictly more work. I recall smallvec introduces some overhead since you need to check if an element is on the stack or the heap, but I guess an enum would have the same problem :)

@konstin
Copy link
Member

konstin commented Nov 10, 2023

Perfwise we shouldn't see the difference, but smallvec would be more ergonomic with less effort

BurntSushi added a commit that referenced this issue Nov 10, 2023
This copies the allocator configuration used in the Ruff project. In
particular, this gives us an instant 10% win when resolving the top 1K
PyPI packages:

    $ hyperfine \
        "./target/profiling/puffin-dev-main resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null" \
        "./target/profiling/puffin-dev resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null"
    Benchmark 1: ./target/profiling/puffin-dev-main resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null
      Time (mean ± σ):     974.2 ms ±  26.4 ms    [User: 17503.3 ms, System: 2205.3 ms]
      Range (min … max):   943.5 ms … 1015.9 ms    10 runs

    Benchmark 2: ./target/profiling/puffin-dev resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null
      Time (mean ± σ):     883.1 ms ±  23.3 ms    [User: 14626.1 ms, System: 2542.2 ms]
      Range (min … max):   849.5 ms … 916.9 ms    10 runs

    Summary
      './target/profiling/puffin-dev resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null' ran
        1.10 ± 0.04 times faster than './target/profiling/puffin-dev-main resolve-many --cache-dir cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2> /dev/null'

I was moved to do this because I noticed `malloc`/`free` taking up a
fairly sizeable percentage of time during light profiling.

Ref #396
BurntSushi added a commit that referenced this issue Nov 10, 2023
This copies the allocator configuration used in the Ruff project. In
particular, this gives us an instant 10% win when resolving the top 1K
PyPI packages:

    $ hyperfine \
"./target/profiling/puffin-dev-main resolve-many --cache-dir
cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2>
/dev/null" \
"./target/profiling/puffin-dev resolve-many --cache-dir
cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2>
/dev/null"
Benchmark 1: ./target/profiling/puffin-dev-main resolve-many --cache-dir
cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2>
/dev/null
Time (mean ± σ): 974.2 ms ± 26.4 ms [User: 17503.3 ms, System: 2205.3
ms]
      Range (min … max):   943.5 ms … 1015.9 ms    10 runs

Benchmark 2: ./target/profiling/puffin-dev resolve-many --cache-dir
cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2>
/dev/null
Time (mean ± σ): 883.1 ms ± 23.3 ms [User: 14626.1 ms, System: 2542.2
ms]
      Range (min … max):   849.5 ms … 916.9 ms    10 runs

    Summary
'./target/profiling/puffin-dev resolve-many --cache-dir
cache-docker-no-build --no-build pypi_top_8k_flat.txt --limit 1000 2>
/dev/null' ran
1.10 ± 0.04 times faster than './target/profiling/puffin-dev-main
resolve-many --cache-dir cache-docker-no-build --no-build
pypi_top_8k_flat.txt --limit 1000 2> /dev/null'

I was moved to do this because I noticed `malloc`/`free` taking up a
fairly sizeable percentage of time during light profiling.

As is becoming a pattern, it will be easier to review this
commit-by-commit.

Ref #396 (wouldn't call this issue fixed)

-----

I did also try adding a `smallvec` optimization to the
`Version::release` field, but it didn't bare any fruit. I still think
there is more to explore since the results I observed don't quite line
up with what I expect. (So probably either my mental model is off or my
measurement process is flawed.) You can see that attempt with a little
more explanation here:
f9528b4

In the course of adding the `smallvec` optimization, I also shrunk the
`Version` fields from a `usize` to a `u32`. They should at least be a
fixed size integer since version numbers aren't used to index memory,
and I shrunk it to `u32` since it seems reasonable to assume that all
version numbers will be smaller than `2^32`.
@BurntSushi
Copy link
Member Author

#399 gets a ~10% perf win by switching the global allocator. I did attempt a smallvec optimization experiment (as mentioned above) but couldn't observe any measurable win. I'm not 100% convinced that smallvec has no role to play and I'm sure there is more that can still be done here.

@charliermarsh
Copy link
Member

@BurntSushi - Should we close this when merging #789 or is there more that's worth tracking here?

@BurntSushi
Copy link
Member Author

Ah yeah that's a good point. Yeah, I don't think this specific issue still exists once #789 gets merged. (Or even before it to be honest. I think gratuitous allocation has been removed from the happy path for the most part already.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Potential performance improvement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants