Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve proxy handling #5893

Closed
wants to merge 1 commit into from

Conversation

omermizr
Copy link

  • do not rebuild proxies if they are configured
  • do not remove Proxy-Authorization on initial request
  • internal refactoring

This is my attempt to resolve #5891 as well as #5888.

@omermizr
Copy link
Author

Note: there is even more worthwhile performance tuning to be done here, most notably - getproxies_environment is called multiple times in this flow. Caching it might also be a good idea, but I don't know this repo too well, so please let me know if that's something we want or not.
In any case, I didn't want to expand the scope of this PR too much so I did minimal changes :)

@@ -251,13 +251,25 @@ def resolve_redirects(self, resp, req, stream=False, timeout=None,
url = self.get_redirect_target(resp)
yield resp

def rebuild_auth(self, prepared_request, response):
def rebuild_auth(self, prepared_request, response, proxies):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a functionally public API. We can't change the signature like this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my bad, I thought this was internal. I'll look into for an alternative approach

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can just extract this change to a different (private?) function and call it before rebuild_auth. WDYT?

@@ -633,7 +634,8 @@ def send(self, request, **kwargs):
kwargs.setdefault('stream', self.stream)
kwargs.setdefault('verify', self.verify)
kwargs.setdefault('cert', self.cert)
kwargs.setdefault('proxies', self.rebuild_proxies(request, self.proxies))
if 'proxies' not in kwargs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know this is correct. You've added no tests. Why do you feel this change is necessary?

Copy link
Contributor

@dbaxa dbaxa Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sigmavirus24 this change is needed because without it we are calling to re-build proxies from the environment even when proxies have been set/provided.

In one use test case I have setup - the time to "retrieve" a cached connection (using the cache control adapter) 10,000 times goes from 9.741 seconds and 25212240 function calls to 9902240 function calls & 6.169 seconds after applying this change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sigmavirus24 the change in this line is twofold:

  1. Previously, if proxies was already set on kwargs, the rebuild_proxies method was called, but the kwargs were not updated (due to setdefault). By adding this check we're avoiding unnecessary calls.
  2. rebuild_proxies was removing the Proxy-Authorization header, and this behavior was files as a regression here: Manually set Proxy-Authorization header is always removed in 2.26 #5888. It no longer does that.

I intend to add tests, but as per the contribution guidelines I thought I'd get feedback as soon as I can to avoid unnecessary work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting on this @omermizr. I think the logic on point 1 is sound, we shouldn't have passed rebuild_proxies() as an argument in setdefault to begin with, regardless of #5888.

As for point 2, I'm hesitant to remove the stripping behavior directly from rebuild_proxies. Given how often people subclass Session, someone's developed something this breaks. So, as an alternative I think we should think about breaking this into two separate functions. One that performs our proxy building and one that handles our proxy auth.

At a very high level, rebuild_proxies and rebuild_auth need to functionally stay the same. I'm thinking the logic for rebuild_proxies without auth changes can get moved to a new build_proxies method. The Proxy-Authorization logic is already decoupled from the proxy workflow, so that can get moved into a rebuild_proxy_auth method. For this first pass, we should leave the logic untouched unless there's something mandatory.

At that point we'd rewire things to effectively be:

def rebuild_proxies(self, prepared_request, proxies):
    new_proxies = self.build_proxies(prepared_request, proxies)
    self.rebuild_proxy_auth(prepared_request, new_proxies)
    return new_proxies

...

def send(self, request, **kwargs):
    ...
    if 'proxies' not in kwargs:
        kwargs['proxies'] = self.build_proxies(request, self.proxies)

@sigmavirus24 please disagree with me here if you think I'm on the wrong track. We may also want to consider making these utility functions, or private on the Session. I don't know how I feel about adding new interface surface. If this is a non-starter, we may want to consider simply reverting #5681 for now and going back to the drawing board.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to address the performance regression, it'd be enough to change the setdefault. I didn't want to do that because it would've introduced inconsistency with how the proxy auth header is handled (stripping the header would depend on whether 'proxies' is passed to the function or not).

The way I see it we have a few options:

  1. Only change the setdefault and document the inconsistency or create a new issue to address it later.
  2. Break the code up into separate functions like @nateprewitt suggested here.
  3. Add a boolean parameter with a default argument to the rebuild_proxies function. That way we'll have better control over the proxy auth header handling, and we won't break the API.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to do that because it would've introduced inconsistency with how the proxy auth header is handled (stripping the header would depend on whether 'proxies' is passed to the function or not).

Yep, agreed, I don't think we want to have this inconsistency in the API as it's going to surprise most users, and is likely the wrong behavior when proxies aren't supplied. I think that eliminates option 1. The use of rebuild_proxies where it is now is frankly just wrong to begin with.

For option 2, I've thought some more, and am leaning towards having the functionality potentially moved out to utils, so we're not adding new methods to Session. We may only need to move the "build" logic out and can leave the auth portion inline for rebuild_proxies. If we don't do that we would need to effectively duplicate the first part of the function elsewhere which would lead to further drift as each function evolves. Session already has a number of issues with not behaving like the standard requests flow, so whatever change we make should minimize that as much as possible.

I don't know if option 3 works because we're changing the signature on a public API which is what we wanted to avoid. Anyone who's already extended this function, and is not using keyword arguments, will likely see unexpected breakage from that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've posted #5924 as a talking point for what a potential solution for #5888 may look like. Please let me know if you have any thoughts.

As for #5891, the behavior of checking the environment for proxy settings before resolving for send was the primary purpose of #5681. That's a fairly expensive process though when there's a significant number of variables, which was an unintended side effect. The new behavior is correcting what I'd deem a bug but at the cost of performance. I'm not sure I see a straight forward way to mitigate that and maintain backwards compatibility yet.

For the time being, setting proxies on the Session or disabling trust_env are the two escape hatches for the performance impact. This is something we would likely want to consider revisiting a breakage in a new major version. I'm happy to review the performance idea here but we'll need significantly more testing to ensure we're not breaking functionality.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed your PR, basically LGTM.
I think you actually handled the most important performance regression (at least for my flow).

Now, while the escape hatches are ok, I don't think that's really a good way to move forward considering the default behavior is now less performant. Given that requests is such a ubiquitous library, making the defaults less performant will surely cause a lot of degradations for a lot of python libraries and a lot of services.
Having said that, I think the simple fix where we check if proxies is already set should be good enough for most use cases, and you already did that in your PR 😃

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's really a good way to move forward considering the default behavior is now less performant. Given that requests is such a ubiquitous library, making the defaults less performant will surely cause a lot of degradations for a lot of python libraries and a lot of services.

Thanks for taking a look! Completely agree, Requests has a pretty broad and diverse user base. It's always a difficult balancing act of usability and correctness that works for everyone. Hopefully we've arrived at one of the better outcomes for this case 😄

@sigmavirus24
Copy link
Contributor

I don't believe this refactoring is what nate or I had in mind and it breaks a public API without adding additional testing

@dbaxa
Copy link
Contributor

dbaxa commented Jul 29, 2021

Note: there is even more worthwhile performance tuning to be done here, most notably - getproxies_environment is called multiple times in this flow. Caching it might also be a good idea, but I don't know this repo too well, so please let me know if that's something we want or not.
In any case, I didn't want to expand the scope of this PR too much so I did minimal changes :)

As a user of the library, I found similar & also noticed that setting trust_env to false can also help improve the situation (caching other trust_env related things would also likely be of use). (#5496)

dbaxa added a commit to dbaxa/requests that referenced this pull request Jul 29, 2021
…building proxies if proxies have been supplied.

Signed-off-by: David Black <[email protected]>
@omermizr
Copy link
Author

I don't believe this refactoring is what nate or I had in mind and it breaks a public API without adding additional testing

Like I said, I'll try to avoid breaking the public API (I obviously don't want to introduce breaking changes).
If we decide to proceed with this, I'll add tests (although I didn't add any functionality, so tests not breaking is the best indication that it's working as intended).
Is there any other approach you'd recommend before I start changing the PR? Or do you want me to abandon it so you guys can take it up internally?
I thought this could be released quickly because it's a major regression for a lot of use-cases, but I completely understand if you think otherwise.


proxy = environ_proxies.get(scheme, environ_proxies.get('all'))
environ_proxies = get_environ_proxies(url, no_proxy=no_proxy)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here were done mainly to avoid calling should_bypass_proxies if not necessary because it is more of a bottleneck than get_environ_proxies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this should be inside / under the self.trust_env code path.

dbaxa added a commit to dbaxa/requests that referenced this pull request Jul 29, 2021
…y not rebuilding proxies if proxies have been supplied."

This reverts commit 3c32ae8.
@junqfisica
Copy link

I would like to point out a solution for the proxy problem. I have been applying this solution every time I need to build an venv behind proxy.

I propose the following change at line:

proxies = proxies or {}

Replace it by:

proxies = proxies or self.proxies

This change will make this hierarchy kwargs -> Session args -> environment to be respected with no need to change anywhere else. Solving problems also with pip as discussed here

Also discussed here: #5735 (comment)

@nateprewitt
Copy link
Member

Resolving this in favor of the merged #5924. Thanks again for working through this with us @omermizr!

@omermizr
Copy link
Author

@nateprewitt Sure thing! Thanks for addressing this 😃

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Session requests are ~70% slower in version 2.26.0
5 participants