Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect behavior with schemeless-dotless host:port URLs #6455

Open
itamaro opened this issue May 13, 2023 · 11 comments
Open

Incorrect behavior with schemeless-dotless host:port URLs #6455

itamaro opened this issue May 13, 2023 · 11 comments

Comments

@itamaro
Copy link

itamaro commented May 13, 2023

URLs of the form hostname:8080 (with no scheme, with "hostname" not containing any dots) can be used to refer to the netloc "hostname:8080"

requests.utils.prepend_scheme_if_needed should correctly prepend the new_scheme when provided with such a URL.

Expected Result

the prepended-scheme URL should be "http://hostname:8080"

Actual Result

the prepended-scheme URL is "hostname:///8080" (e.g. treating the "hostname" part as the scheme, no host, no port, and "8080" as the path)

I extended the test_prepend_scheme_if_needed to demonstrate this behavior (see main...itamaro:requests:schemeless-hostname-anad-port-bug)

Reproduction Steps

from requests.utils import prepend_scheme_if_needed
print(prepend_scheme_if_needed("hostname:8080", "http"))

System Information

$ python -m requests.help
{
  "chardet": {
    "version": null
  },
  "charset_normalizer": {
    "version": "3.1.0"
  },
  "cryptography": {
    "version": ""
  },
  "idna": {
    "version": "3.4"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.10.4"
  },
  "platform": {
    "release": "22.4.0",
    "system": "Darwin"
  },
  "pyOpenSSL": {
    "openssl_version": "",
    "version": null
  },
  "requests": {
    "version": "2.30.0"
  },
  "system_ssl": {
    "version": "101010ef"
  },
  "urllib3": {
    "version": "2.0.2"
  },
  "using_charset_normalizer": true,
  "using_pyopenssl": false
}
@sigmavirus24
Copy link
Contributor

How does this realistically affect users who aren't using utils which aren't part of the public API?

@itamaro
Copy link
Author

itamaro commented May 14, 2023

How does this realistically affect users who aren't using utils which aren't part of the public API?

ah sorry, I went down a rabbit hole tracking down the issue to this that I forgot to mention the user-facing scenario!

one scenario is when using such a URL as a proxy:

response = requests.get(
    "http://www.example.com",
    proxies={"http": "myproxy:8080"},
    ....
)

this used to work (with requests 2.25.1 and python 3.8), but with requests 2.27.1 it fails with

...
requests.exceptions.InvalidProxyURL: Please check proxy URL. It is malformed and could be missing the host.

another scenario (less critical) is difference in exceptions (MissingSchema vs InvalidSchema):

response = requests.get("www.example.com")
...
MissingSchema: Invalid URL 'www.example.com': No scheme supplied. Perhaps you meant http://www.example.com?

vs

response = requests.get("hostname:8080")
...
requests.exceptions.InvalidSchema: No connection adapters were found for 'hostname:8080'

@sigmavirus24
Copy link
Contributor

So the last two aren't ever supposed to work. Would it be nice if they raised the same exception? Sure. But never should we be guessing scheme. We never document that schemeless URLs are supported in that way.

The proxy case may take investigation. But I suspect we documented a breaking change there. I vaguely remember other people complaining

@itamaro
Copy link
Author

itamaro commented May 15, 2023

The proxy case may take investigation. But I suspect we documented a breaking change there. I vaguely remember other people complaining

Yes, it's definitely the proxy case that brought me here and is causing me issues. The other one is just a nit.

@nateprewitt
Copy link
Member

Hi @itamaro, this was an intentional change in 2.27.0 due to this bug in CPython. The behavior of urlparse fundamentally changed from Python 3.9 onwards which makes supporting schemeless URIs in this case fairly difficult. The decision was made to move to urllib3's url_parse function, which is the behavior you're seeing now to limit blast radius with the the standard library updates.

This is why you're seeing an error now that wasn't occurring previously.

Python 3.10

Python 3.10.5 (main, Jul  1 2022, 17:28:53) [Clang 13.0.0 (clang-1300.0.27.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urlparse
>>> urlparse('hostname:8080')
ParseResult(scheme='hostname', netloc='', path='8080', params='', query='', fragment='')

Python 3.7

Python 3.7.9 (default, Aug 11 2022, 16:47:29) 
[Clang 13.0.0 (clang-1300.0.27.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urlparse
>>> urlparse('hostname:8080')
ParseResult(scheme='', netloc='', path='hostname:8080', params='', query='', fragment='')
>>> 

The only reason this happened to work previously is we did this (arguably bad) shuffle of replacing the netloc with the path for this case. That honestly should have never been done but was there to work around the oddities of url parsing in the standard library.

So the short of the story is all version of Requests will be affected by this as soon as you upgrade beyond Python 3.8. We have no reliable way to control this from Requests side anymore. I would recommend updating your usage to ensure proxies are passed with a scheme to avoid any surprises later.

Requests 2.25.1 behavior on Python 3.10

>>> response = requests.get(
...     "http://www.example.com",
...     proxies={"http": "myproxy:8080"},
...)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nateprewitt/.pyenv/versions/3.10.5/lib/python3.10/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/nateprewitt/.pyenv/versions/3.10.5/lib/python3.10/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/nateprewitt/.pyenv/versions/3.10.5/lib/python3.10/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/nateprewitt/.pyenv/versions/3.10.5/lib/python3.10/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/Users/nateprewitt/.pyenv/versions/3.10.5/lib/python3.10/site-packages/requests/adapters.py", line 412, in send
    conn = self.get_connection(request.url, proxies)
  File "/Users/nateprewitt/.pyenv/versions/3.10.5/lib/python3.10/site-packages/requests/adapters.py", line 309, in get_connection
    proxy_manager = self.proxy_manager_for(proxy)
  File "/Users/nateprewitt/.pyenv/versions/3.10.5/lib/python3.10/site-packages/requests/adapters.py", line 193, in proxy_manager_for
    manager = self.proxy_manager[proxy] = proxy_from_url(
  File "/Users/nateprewitt/.pyenv/versions/3.10.5/lib/python3.10/site-packages/urllib3/poolmanager.py", line 492, in proxy_from_url
    return ProxyManager(proxy_url=url, **kw)
  File "/Users/nateprewitt/.pyenv/versions/3.10.5/lib/python3.10/site-packages/urllib3/poolmanager.py", line 429, in __init__
    raise ProxySchemeUnknown(proxy.scheme)
urllib3.exceptions.ProxySchemeUnknown: Not supported proxy scheme myproxy

@itamaro
Copy link
Author

itamaro commented May 16, 2023

thanks for the background @nateprewitt !

I would recommend updating your usage to ensure proxies are passed with a scheme to avoid any surprises later.

totally agree this is the preferred solution, but it's not going to be easy doing that in our monorepo with millions of lines of code... (e.g. the entire Meta Python codebase 😬)

@turingnixstyx
Copy link

Guys should I try solving this or is this redundant?

@nateprewitt
Copy link
Member

nateprewitt commented May 17, 2023

@turingnixstyx I don't think there's anything to be solved. Requests cannot support schemeless proxies in Python 3.9+. It's an unfortunately tedious change for end-users, but we can't do much about the behavior in the standard library. What we're doing now is providing a consistent behavior across all Python versions.

@turingnixstyx

This comment was marked as off-topic.

@nateprewitt

This comment was marked as off-topic.

@turingnixstyx

This comment was marked as off-topic.

potiuk added a commit to potiuk/airflow that referenced this issue Aug 10, 2023
The urlparse stdlib call behaves differently on Python 3.9+ for
schemaless-URLs. It parses string hostname into a schema instead
of netloc. The issue is still open and discussed

* psf/requests#6455

and apparently it will not be solved, so we need to workaround it
if we still want to support legacy, schemaless URLs to be accepted
by Elasticsearch handler.
potiuk added a commit to apache/airflow that referenced this issue Aug 10, 2023
The urlparse stdlib call behaves differently on Python 3.9+ for
schemaless-URLs. It parses string hostname into a schema instead
of netloc. The issue is still open and discussed

* psf/requests#6455

and apparently it will not be solved, so we need to workaround it
if we still want to support legacy, schemaless URLs to be accepted
by Elasticsearch handler.
ferruzzi pushed a commit to aws-mwaa/upstream-to-airflow that referenced this issue Aug 17, 2023
The urlparse stdlib call behaves differently on Python 3.9+ for
schemaless-URLs. It parses string hostname into a schema instead
of netloc. The issue is still open and discussed

* psf/requests#6455

and apparently it will not be solved, so we need to workaround it
if we still want to support legacy, schemaless URLs to be accepted
by Elasticsearch handler.
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this issue Jul 18, 2024
The urlparse stdlib call behaves differently on Python 3.9+ for
schemaless-URLs. It parses string hostname into a schema instead
of netloc. The issue is still open and discussed

* psf/requests#6455

and apparently it will not be solved, so we need to workaround it
if we still want to support legacy, schemaless URLs to be accepted
by Elasticsearch handler.

GitOrigin-RevId: dd73a0bffa6c4de93a2dd8dc4460b64aedc51255
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this issue Sep 19, 2024
The urlparse stdlib call behaves differently on Python 3.9+ for
schemaless-URLs. It parses string hostname into a schema instead
of netloc. The issue is still open and discussed

* psf/requests#6455

and apparently it will not be solved, so we need to workaround it
if we still want to support legacy, schemaless URLs to be accepted
by Elasticsearch handler.

GitOrigin-RevId: dd73a0bffa6c4de93a2dd8dc4460b64aedc51255
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this issue Nov 8, 2024
The urlparse stdlib call behaves differently on Python 3.9+ for
schemaless-URLs. It parses string hostname into a schema instead
of netloc. The issue is still open and discussed

* psf/requests#6455

and apparently it will not be solved, so we need to workaround it
if we still want to support legacy, schemaless URLs to be accepted
by Elasticsearch handler.

GitOrigin-RevId: dd73a0bffa6c4de93a2dd8dc4460b64aedc51255
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants