Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve proxy handling #5893

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 25 additions & 23 deletions requests/sessions.py
Original file line number Diff line number Diff line change
Expand Up @@ -213,7 +213,7 @@ def resolve_redirects(self, resp, req, stream=False, timeout=None,

# Rebuild auth and proxy information.
proxies = self.rebuild_proxies(prepared_request, proxies)
self.rebuild_auth(prepared_request, resp)
self.rebuild_auth(prepared_request, resp, proxies)

# A failed tell() sets `_body_position` to `object()`. This non-None
# value ensures `rewindable` will be True, allowing us to raise an
Expand Down Expand Up @@ -251,13 +251,25 @@ def resolve_redirects(self, resp, req, stream=False, timeout=None,
url = self.get_redirect_target(resp)
yield resp

def rebuild_auth(self, prepared_request, response):
def rebuild_auth(self, prepared_request, response, proxies):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a functionally public API. We can't change the signature like this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my bad, I thought this was internal. I'll look into for an alternative approach

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can just extract this change to a different (private?) function and call it before rebuild_auth. WDYT?

"""When being redirected we may want to strip authentication from the
request to avoid leaking credentials. This method intelligently removes
and reapplies authentication where possible to avoid credential loss.
"""
headers = prepared_request.headers
url = prepared_request.url
scheme = urlparse(url).scheme

if 'Proxy-Authorization' in headers:
del headers['Proxy-Authorization']

try:
username, password = get_auth_from_url(proxies[scheme])
except KeyError:
username, password = None, None

if username and password:
headers['Proxy-Authorization'] = _basic_auth_str(username, password)

if 'Authorization' in headers and self.should_strip_auth(response.request.url, url):
# If we get redirected to a new host, we should strip out any
Expand All @@ -283,32 +295,21 @@ def rebuild_proxies(self, prepared_request, proxies):
:rtype: dict
"""
proxies = proxies if proxies is not None else {}
headers = prepared_request.headers
new_proxies = proxies.copy()
url = prepared_request.url
scheme = urlparse(url).scheme
new_proxies = proxies.copy()
no_proxy = proxies.get('no_proxy')
if scheme in proxies:
return new_proxies

bypass_proxy = should_bypass_proxies(url, no_proxy=no_proxy)
if self.trust_env and not bypass_proxy:
environ_proxies = get_environ_proxies(url, no_proxy=no_proxy)
no_proxy = proxies.get('no_proxy')

proxy = environ_proxies.get(scheme, environ_proxies.get('all'))
environ_proxies = get_environ_proxies(url, no_proxy=no_proxy)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here were done mainly to avoid calling should_bypass_proxies if not necessary because it is more of a bottleneck than get_environ_proxies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this should be inside / under the self.trust_env code path.

proxy = environ_proxies.get(scheme, environ_proxies.get('all'))
if proxy:
bypass_proxy = should_bypass_proxies(url, no_proxy=no_proxy)

if proxy:
if self.trust_env and not bypass_proxy:
new_proxies.setdefault(scheme, proxy)

if 'Proxy-Authorization' in headers:
del headers['Proxy-Authorization']

try:
username, password = get_auth_from_url(new_proxies[scheme])
except KeyError:
username, password = None, None

if username and password:
headers['Proxy-Authorization'] = _basic_auth_str(username, password)

return new_proxies

def rebuild_method(self, prepared_request, response):
Expand Down Expand Up @@ -633,7 +634,8 @@ def send(self, request, **kwargs):
kwargs.setdefault('stream', self.stream)
kwargs.setdefault('verify', self.verify)
kwargs.setdefault('cert', self.cert)
kwargs.setdefault('proxies', self.rebuild_proxies(request, self.proxies))
if 'proxies' not in kwargs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know this is correct. You've added no tests. Why do you feel this change is necessary?

Copy link
Contributor

@dbaxa dbaxa Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sigmavirus24 this change is needed because without it we are calling to re-build proxies from the environment even when proxies have been set/provided.

In one use test case I have setup - the time to "retrieve" a cached connection (using the cache control adapter) 10,000 times goes from 9.741 seconds and 25212240 function calls to 9902240 function calls & 6.169 seconds after applying this change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sigmavirus24 the change in this line is twofold:

  1. Previously, if proxies was already set on kwargs, the rebuild_proxies method was called, but the kwargs were not updated (due to setdefault). By adding this check we're avoiding unnecessary calls.
  2. rebuild_proxies was removing the Proxy-Authorization header, and this behavior was files as a regression here: Manually set Proxy-Authorization header is always removed in 2.26 #5888. It no longer does that.

I intend to add tests, but as per the contribution guidelines I thought I'd get feedback as soon as I can to avoid unnecessary work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting on this @omermizr. I think the logic on point 1 is sound, we shouldn't have passed rebuild_proxies() as an argument in setdefault to begin with, regardless of #5888.

As for point 2, I'm hesitant to remove the stripping behavior directly from rebuild_proxies. Given how often people subclass Session, someone's developed something this breaks. So, as an alternative I think we should think about breaking this into two separate functions. One that performs our proxy building and one that handles our proxy auth.

At a very high level, rebuild_proxies and rebuild_auth need to functionally stay the same. I'm thinking the logic for rebuild_proxies without auth changes can get moved to a new build_proxies method. The Proxy-Authorization logic is already decoupled from the proxy workflow, so that can get moved into a rebuild_proxy_auth method. For this first pass, we should leave the logic untouched unless there's something mandatory.

At that point we'd rewire things to effectively be:

def rebuild_proxies(self, prepared_request, proxies):
    new_proxies = self.build_proxies(prepared_request, proxies)
    self.rebuild_proxy_auth(prepared_request, new_proxies)
    return new_proxies

...

def send(self, request, **kwargs):
    ...
    if 'proxies' not in kwargs:
        kwargs['proxies'] = self.build_proxies(request, self.proxies)

@sigmavirus24 please disagree with me here if you think I'm on the wrong track. We may also want to consider making these utility functions, or private on the Session. I don't know how I feel about adding new interface surface. If this is a non-starter, we may want to consider simply reverting #5681 for now and going back to the drawing board.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to address the performance regression, it'd be enough to change the setdefault. I didn't want to do that because it would've introduced inconsistency with how the proxy auth header is handled (stripping the header would depend on whether 'proxies' is passed to the function or not).

The way I see it we have a few options:

  1. Only change the setdefault and document the inconsistency or create a new issue to address it later.
  2. Break the code up into separate functions like @nateprewitt suggested here.
  3. Add a boolean parameter with a default argument to the rebuild_proxies function. That way we'll have better control over the proxy auth header handling, and we won't break the API.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to do that because it would've introduced inconsistency with how the proxy auth header is handled (stripping the header would depend on whether 'proxies' is passed to the function or not).

Yep, agreed, I don't think we want to have this inconsistency in the API as it's going to surprise most users, and is likely the wrong behavior when proxies aren't supplied. I think that eliminates option 1. The use of rebuild_proxies where it is now is frankly just wrong to begin with.

For option 2, I've thought some more, and am leaning towards having the functionality potentially moved out to utils, so we're not adding new methods to Session. We may only need to move the "build" logic out and can leave the auth portion inline for rebuild_proxies. If we don't do that we would need to effectively duplicate the first part of the function elsewhere which would lead to further drift as each function evolves. Session already has a number of issues with not behaving like the standard requests flow, so whatever change we make should minimize that as much as possible.

I don't know if option 3 works because we're changing the signature on a public API which is what we wanted to avoid. Anyone who's already extended this function, and is not using keyword arguments, will likely see unexpected breakage from that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've posted #5924 as a talking point for what a potential solution for #5888 may look like. Please let me know if you have any thoughts.

As for #5891, the behavior of checking the environment for proxy settings before resolving for send was the primary purpose of #5681. That's a fairly expensive process though when there's a significant number of variables, which was an unintended side effect. The new behavior is correcting what I'd deem a bug but at the cost of performance. I'm not sure I see a straight forward way to mitigate that and maintain backwards compatibility yet.

For the time being, setting proxies on the Session or disabling trust_env are the two escape hatches for the performance impact. This is something we would likely want to consider revisiting a breakage in a new major version. I'm happy to review the performance idea here but we'll need significantly more testing to ensure we're not breaking functionality.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed your PR, basically LGTM.
I think you actually handled the most important performance regression (at least for my flow).

Now, while the escape hatches are ok, I don't think that's really a good way to move forward considering the default behavior is now less performant. Given that requests is such a ubiquitous library, making the defaults less performant will surely cause a lot of degradations for a lot of python libraries and a lot of services.
Having said that, I think the simple fix where we check if proxies is already set should be good enough for most use cases, and you already did that in your PR 😃

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's really a good way to move forward considering the default behavior is now less performant. Given that requests is such a ubiquitous library, making the defaults less performant will surely cause a lot of degradations for a lot of python libraries and a lot of services.

Thanks for taking a look! Completely agree, Requests has a pretty broad and diverse user base. It's always a difficult balancing act of usability and correctness that works for everyone. Hopefully we've arrived at one of the better outcomes for this case 😄

kwargs['proxies'] = self.rebuild_proxies(request, self.proxies)

# It's possible that users might accidentally send a Request object.
# Guard against that specific failure case.
Expand Down