Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate find_all_candidates cache from threadpool #10480

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jbylund
Copy link
Contributor

@jbylund jbylund commented Sep 16, 2021

Fetching pages from pypi to determine which versions are available is the rate limiting step of package collection. There's a bit of a tradeoff here in that by pre-populating the find_all_candidates cache in full before doing conflict resolution there's a chance that more work is done since all pages will be fetched even if there is a conflict between the first two packages. I think this still may make sense though as the wall clock time of collecting packages decreases significantly, and it's nice that the order in which packages are processed is unchanged and that part still effectively takes place in series.

Time spent on package collection decreases from ~40s to ~10 on the sample case from #10467.

@jbylund jbylund force-pushed the joe/warm_cache_in_threadpool branch from 2d57052 to ea48612 Compare September 16, 2021 17:39
@uranusjr
Copy link
Member

uranusjr commented Sep 20, 2021

The general idea looks good to me, but iirc we can’t use multiprocessing for some reason (some platforms don’t support threads? I don’t remember). Summoning @McSinyx who recently dealt with parallelisation bug reports.

@McSinyx
Copy link
Contributor

McSinyx commented Sep 21, 2021

@McSinyx emerges from the ground.

Yup it's not portable because some exotic platform does not have proper semaphore support. There's utils.parallel wrapping imap_unordered though and I think it should be safe to use that.

@jbylund jbylund force-pushed the joe/warm_cache_in_threadpool branch from 98d4cbf to 24f76da Compare September 21, 2021 11:32
Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this looks like it should work, the implementation has some code smells that I feel should be improved. This includes the (somewhat weird) pass blocks, accessing private attribute on factory, and relying on find_all_candidates being LRU-cached. Refactoring is needed.

news/10480.feature.rst Outdated Show resolved Hide resolved
@jbylund
Copy link
Contributor Author

jbylund commented Sep 21, 2021

Put the finder into a public attribute so avoid accessing a private attribute.

What do you see as the options for not relying on the caching behavior of find_all_candidates?

Re: pass-es, I think we need to consume the imap iterable since it's lazily generated? but there's nothing to be done with the result of the function.

@uranusjr
Copy link
Member

One simple solution would be to implement a cache layer on the factory (e.g. a Factory.find_all_candidates() wrapper). If parallisation is available, the wrapper would populate the cache with imap on the first invocation; if not, it’d simply pass on the call the the finder. This would also resolve the private attribute issue.

@jbylund jbylund marked this pull request as draft September 22, 2021 13:36
@jbylund jbylund force-pushed the joe/warm_cache_in_threadpool branch from 3b5ac13 to 8ee1589 Compare September 22, 2021 18:28
@jbylund
Copy link
Contributor Author

jbylund commented Sep 29, 2021

One simple solution would be to implement a cache layer on the factory (e.g. a Factory.find_all_candidates() wrapper). If parallisation is available, the wrapper would populate the cache with imap on the first invocation; if not, it’d simply pass on the call the the finder. This would also resolve the private attribute issue.

I'm sorry, but I still don't think I understand what you're targeting. I think in order for this to be of use we need to parallelize when we have the list of projects available rather than at the point when find_all_candidates is called on a per project basis.

This approach is only possible because we're exploiting the fact that find_all_candidates is:

  1. the bottleneck
  2. cached
  3. embarrassingly parallelizable (well close enough)

I figured that since pip owns the implementation of find_all_candidates it would be ok to rely on find_all_candidates being cached?

I'd appreciate it if you could give this another quick look-over and let me know in which direction you'd like to see it go. Thanks.

@jbylund jbylund marked this pull request as ready for review October 4, 2021 15:36
@jbylund
Copy link
Contributor Author

jbylund commented Oct 4, 2021

Not ready for merge, but not sure if draft prs are just hidden from review queue.

@jbylund jbylund requested a review from uranusjr October 7, 2021 11:20
@uranusjr
Copy link
Member

uranusjr commented Oct 8, 2021

What I'm trying to say is we should implement a separate cache layer in the resolver, instead of relying on the cache layer in the finder. Like how we're doing a separate caching layer for Requirement objects instead of relying on packaging's caching (which it does not have, but that's the point—pip doesn't need to know whether packaging has a caching layer; the resolver does not need to know about the finder's cache layer either).

@github-actions github-actions bot added the needs rebase or merge PR has conflicts with current master label Oct 9, 2021
@jbylund
Copy link
Contributor Author

jbylund commented Oct 11, 2021

I think the way in which this is different is that the resolver never calls find_all_candidates except via find_best_candidates? so unless the package finder's find_best_candidate is implemented by using find_all_candidates there's no benefit to be had here?

@uranusjr
Copy link
Member

Hmm, good point. Alright, let's do this then. We'll first need to resolve the conflicts, and could you investigate how difficult it would be to add a test for the caching behaviour? (e.g. mocking out some internals of find_all_candidates and ensure they are called only once)

@jbylund jbylund force-pushed the joe/warm_cache_in_threadpool branch from 8ee1589 to f5a70cc Compare October 12, 2021 20:54
@pypa-bot pypa-bot removed the needs rebase or merge PR has conflicts with current master label Oct 12, 2021
@jbylund
Copy link
Contributor Author

jbylund commented Oct 12, 2021

How about adding a test to the finder which demonstrates that calling find_best_candidate after calling find_all_candidates results in a cache hit?

@uranusjr
Copy link
Member

uranusjr commented Oct 12, 2021

How about adding a test to the finder which demonstrates that calling find_best_candidate after calling find_all_candidates results in a cache hit?

How difficult it would be to initiate the call from the resolver? Because there's no real guarantee the resolver will always call find_best_candidate (we might miss it in a refactoring or something), and what we really want is for the resolver to fetch each package list exactly once in its lifetime, not the finder.

@jbylund
Copy link
Contributor Author

jbylund commented Oct 12, 2021

How about adding a test to the finder which demonstrates that calling find_best_candidate after calling find_all_candidates results in a cache hit?

How difficult it would be to initiate the call from the resolver? Because there's no real guarantee the resolver will always call find_best_candidate (we might miss it in a refactoring or something), and what we really want is for the resolver to fetch each package list exactly once in its lifetime, not the finder.

You mean for the tests? or you want the prime_cache method to move?

@uranusjr
Copy link
Member

Sorry, I meant the test, responding to your comment before that.

Comment on lines 254 to 255
def test_resolver_cache_population(resolver: Resolver) -> None:
resolver._finder.find_all_candidates.cache_clear()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should add this to the resolver fixture to keep tests deterministic. Also the cleanup should probably happen after the test to not leave things behind.

@jbylund jbylund force-pushed the joe/warm_cache_in_threadpool branch from 366b5d1 to b34933d Compare October 21, 2021 19:02
@jbylund
Copy link
Contributor Author

jbylund commented Oct 21, 2021

Looks like this is the failure:


2021-10-21T19:03:24.7123305Z nox > Running session docs
2021-10-21T19:03:24.7469283Z nox > Creating virtual environment (virtualenv) using python in .nox/docs
2021-10-21T19:03:25.8502672Z nox > python -m pip install -e .
2021-10-21T19:03:36.2667130Z nox > python -m pip install -r docs/requirements.txt
2021-10-21T19:03:55.7121876Z nox > sphinx-build -W -c docs/html -d docs/build/doctrees/html -b html docs/html docs/build/html
2021-10-21T19:03:56.1883325Z Traceback (most recent call last):
2021-10-21T19:03:56.1886244Z   File "/home/runner/work/pip/pip/.nox/docs/bin/sphinx-build", line 5, in <module>
2021-10-21T19:03:56.1887811Z     from sphinx.cmd.build import main
2021-10-21T19:03:56.1889421Z   File "/home/runner/work/pip/pip/.nox/docs/lib/python3.10/site-packages/sphinx/cmd/build.py", line 25, in <module>
2021-10-21T19:03:56.1890572Z     from sphinx.application import Sphinx
2021-10-21T19:03:56.1892042Z   File "/home/runner/work/pip/pip/.nox/docs/lib/python3.10/site-packages/sphinx/application.py", line 32, in <module>
2021-10-21T19:03:56.1893419Z     from sphinx.config import Config
2021-10-21T19:03:56.1896294Z   File "/home/runner/work/pip/pip/.nox/docs/lib/python3.10/site-packages/sphinx/config.py", line 21, in <module>
2021-10-21T19:03:56.1897565Z     from sphinx.util import logging
2021-10-21T19:03:56.1898933Z   File "/home/runner/work/pip/pip/.nox/docs/lib/python3.10/site-packages/sphinx/util/__init__.py", line 41, in <module>
2021-10-21T19:03:56.1900080Z     from sphinx.util.typing import PathMatcher
2021-10-21T19:03:56.1901526Z   File "/home/runner/work/pip/pip/.nox/docs/lib/python3.10/site-packages/sphinx/util/typing.py", line 37, in <module>
2021-10-21T19:03:56.1902575Z     from types import Union as types_Union
2021-10-21T19:03:56.1903878Z ImportError: cannot import name 'Union' from 'types' (/opt/hostedtoolcache/Python/3.10.0/x64/lib/python3.10/types.py)
2021-10-21T19:03:56.2096739Z nox > Command sphinx-build -W -c docs/html -d docs/build/doctrees/html -b html docs/html docs/build/html failed with exit code 1
2021-10-21T19:03:56.2098058Z nox > Session docs failed.

@uranusjr
Copy link
Member

Looks like a Sphinx bug, let's not worry about that here. sphinx-doc/sphinx#9512

@bluenote10
Copy link

bluenote10 commented Jan 20, 2022

Is there any chance of this getting merged / included in a release soon? Would be highly appreciated 😉 (while I was just waiting another 3 minutes of pip just re-collecting wheels that were all already in the cache...)

Copy link
Member

@pradyunsg pradyunsg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to double check that our networking stack is thread safe.

@sethmlarson
Copy link
Contributor

@pradyunsg urllib3.PoolManager is only unsafe when the number of distinct origins is more than the number of urllib3.ConnectionPools allowed within the PoolManager. Setting num_pools (or pool_connections on requests.Session) to be greater than the number of origins would avoid the problems being referenced.

Going to ping @nateprewitt to confirm that the thread-safety properties of requests.Session are the same as urllib3.PoolManager.

We have an in-progress PR which defers the closing of connections at the HTTPConnectionPool level until the connection pool is no longer referenced instead of during PoolManager eviction which would solve this issue.

@jbylund
Copy link
Contributor Author

jbylund commented Jul 15, 2022

We have an in-progress PR which defers the closing of connections at the HTTPConnectionPool level until the connection pool is no longer referenced instead of during PoolManager eviction which would solve this issue.

Amazing, thank you for the update/explanation.

@nateprewitt
Copy link
Member

Going to ping @nateprewitt to confirm that the thread-safety properties of requests.Session are the same as urllib3.PoolManager.

@sethmlarson yeah, the pool manager should be Requests only contention point for send in a threaded context. Obviously mutating Session-level settings like adapters/proxies etc, won't be thread safe but I don't believe pip is doing any of that.

@uranusjr
Copy link
Member

So is this good to go now? Ping @pradyunsg in case there are still concerns.

@uranusjr uranusjr requested a review from pradyunsg November 10, 2022 23:09
@jbylund
Copy link
Contributor Author

jbylund commented Nov 10, 2022

I think the options are:

  1. wait until urllib3 releases a 2.0.0 (https://github.com/urllib3/urllib3/milestone/6) update the vendor-ed urllib3 then this should be good to go.
  2. increase the connection pool size so that we feel the race condition is unlikely to be hit

I think 2 is more invasive, less safe, and will eventually be unnecessary. So I am in favor of waiting it out (as an outsider it seems pretty close). But if you feel strongly that we should go with 2 let me know.

@uranusjr
Copy link
Member

From the look of things it seems a 2.0.0 release is relatively close (please correct me otherwise), so waiting that sounds like a better choice.

@sethmlarson
Copy link
Contributor

urllib3 2.0.1 is available! https://github.com/urllib3/urllib3/releases/tag/2.0.1

@jbylund jbylund force-pushed the joe/warm_cache_in_threadpool branch 2 times, most recently from 10ca6a9 to f1922cc Compare May 3, 2023 22:56
@uranusjr
Copy link
Member

Requests released 2.23.0 with urllib3 2.x support last week. I’ll do the vendor update to unblock this.

@jbylund
Copy link
Contributor Author

jbylund commented Jul 10, 2023

I think now that vendored urllib3 to 1.26.16 which contained a backport of a fix for a thread-safety issue this is now unblocked?

@pfmoore
Copy link
Member

pfmoore commented Jul 10, 2023

Just to note, this will have to wait until after 23.2 is released - I don't want something this significant added to 23.2 at the last minute.

I'm assuming that someone still needs to verify that the urllib3 fix does actually fix the problem that affected this PR before it can be merged, anyway?

@jbylund
Copy link
Contributor Author

jbylund commented Jul 10, 2023

Just to note, this will have to wait until after 23.2 is released - I don't want something this significant added to 23.2 at the last minute.

Absolutely, just wanted to try to push this out of limbo state.

I'm assuming that someone still needs to verify that the urllib3 fix does actually fix the problem that affected this PR before it can be merged, anyway?

I don't think there was ever an observed issue with this pr. We could try to produce one pre urllib3 1.26.16 and then verify that it doesn't happen with the updated version of urllib3. The difficulty would be that:

  1. we only expect the possibility of some sort of thread unsafety if there are many different origins - which I don't think comes up in the use of pip very frequently
  2. it could still be very difficult to reproduce since it would depend on the network behavior

I realize that's a pretty unsatisfactory answer, so if anyone has any good ideas for other testing they'd like to see done (or even better another test that could be added) let me know.

@pfmoore
Copy link
Member

pfmoore commented Jul 10, 2023

No worries, I just didn't want my comment on the release schedule to imply I had much of a clue about the status of this PR :-) I'm happy to leave any further review to @uranusjr and/or @pradyunsg, who have been following this more closely than I have.

@jbylund jbylund force-pushed the joe/warm_cache_in_threadpool branch from 4e31045 to 6ddecdf Compare July 25, 2023 13:51
@jbylund
Copy link
Contributor Author

jbylund commented Sep 6, 2023

Could this go into 23.3 ? Are there any other updates/checks needed?

@edmorley
Copy link
Contributor

edmorley commented Sep 6, 2023

In the PR description it mentions there's a potential trade-off, albeit at the time this was still worth doing:

There's a bit of a tradeoff here in that by pre-populating the find_all_candidates cache in full before doing conflict resolution there's a chance that more work is done since all pages will be fetched even if there is a conflict between the first two packages.

Is it possible that the performance win will be less now that (a) .metadata files are now supported in pip 23.2, (b) there are about to be further optimisations to the index page/metadata handling (eg #12256 and #12257)?

If so, would it be worth waiting until #12256 and #12257 land, and then re-benchmarking with those + --use-feature=metadata-cache enabled?

@cosmicexplorer
Copy link
Contributor

I am picking back up #12256 and #12257 (don't forget #12258, we can go deeper), so please feel free to review any of those if you're interested in how they affect pip's performance characteristics. Thanks!

@ichard26 ichard26 added the state: blocked Can not be done until something else is done label Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bot:chronographer:provided state: blocked Can not be done until something else is done type: enhancement Improvements to functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.