gh-82150: Make urllib.parse.urlsplit and urllib.parse.urlunsplit preserve the '?' and '#' delimiters of empty query and fragment components #15642

geryogam · 2019-09-02T12:15:06Z

This PR provides the following changes:

update the urlsplit and urlunsplit functions of the urllib.parse submodule for making them preserve the delimiters of the query and fragment components when applied in sequence on a URI with an empty query component (i.e. '?') or empty fragment component (i.e. '#')—not to be confused with an undefined query component (i.e. '') nor an undefined fragment component (i.e. ''):
```
  >>> from urllib.parse import urlsplit, urlunsplit
  >>> urlunsplit(urlsplit('http://example.com/?#'))
  'http://example.com/?#'
```
This is required by RFC 3986:

Normalization should not remove delimiters when their associated component is empty unless licensed to do so by the scheme specification. For example, the URI http://example.com/? cannot be assumed to be equivalent to any of the examples above. Likewise, the presence or absence of delimiters within a userinfo subcomponent is usually significant to its interpretation. The fragment component is not subject to any scheme-based normalization; thus, two URIs that differ only by the suffix # are considered different regardless of the scheme.

Currently delimiters are dropped:
```
  >>> from urllib.parse import urlsplit, urlunsplit
  >>> urlunsplit(urlsplit('http://example.com/?#'))
  'http://example.com/'
```
To do so:
- the urlsplit function now decodes an undefined query component as None and an undefined fragment component as None (e.g. urlsplit('http://example.com/') == ('http', 'example.com', '/', None, None)), and still decodes an empty query component as '' and an empty fragment component as '' (e.g. urlsplit('http://example.com/?#') == ('http', 'example.com', '/', '', ''));
- the urlunsplit function now encodes a None query component as an undefined query component and a None fragment component as an undefined fragment component (e.g. urlunsplit(('http', 'example.com', '/', None, None)) == 'http://example.com/'), and now encodes a '' query component as an empty query component and a '' fragment component as an empty fragment component (e.g. urlunsplit(('http', 'example.com', '/', '', '')) == 'http://example.com/?#');
add and update the corresponding unit tests in the test.test_urlparse module;
update a unit test in the test.test_urllib2 module;
update the urllib.parse documentation accordingly.

Issue: Correct urllib.parse functions dropping the delimiters of empty URI components #82150

nicktimko · 2019-09-03T18:39:32Z

It's maybe a bit surprising to have some of the tuple fields sometimes be None (typing.Tuple[typing.Optional[str]] instead of typing.Tuple[str]), but I'm not sure of a more obvious solution.

The other alternative I thought about was to just explicitly dump in the delimiter if it's empty (e.g. 'http://example.com/?#' → 'http', 'example.com', '/', '?', '#'), but that's probably more surprising, rebuilding the URL is more complex, and what then if there's a URL like http://example.com/??.

I think you need to also describe the breaking change very clearly (haven't done it before, but I think that's what bedevere/news is for, i.e. these things), and leave hints in the actual documentation about the change ("changed in 3.9")

Housekeeping: I'd squash all the commits.

geryogam · 2019-09-03T20:19:51Z

Thank you for reviewing this @nicktimko! Yes the None solution for undefined query/fragment seemed the most straightforward and natural to me.

I have updated the PR description to detail the exact changes. Nice suggestion, I will make the news entry, documentation version note and commit squash. But before I would like to fix an issue: the documentation tests in Travis CI failed for an obscure reason (see below). Do you have any idea why?

nicktimko · 2019-09-03T21:09:37Z

I don't know, but the docs build looks like it's installing blurb, which might be related to the news, so maybe adding a news item would fix it? Does it run locally? Just guessing though.

geryogam · 2019-09-11T05:59:07Z

Thanks @nicktimko, I have added a news entry, but documentation tests still fail in Travis-CI.

orsenthil · 2019-09-11T21:12:57Z

I don't think the documentation failure is related to the code in this PR. Perhaps this PR needs to be rebased?

orsenthil

This is going to be a breaking change and will affect a plenty of downstream libraries and frameworks that had been relying upon the previous behavior.

I don't have any code comments, and the code changes look good to me.
I find the rational ok
I will request reviews from more active core developers and want to hear their opinion on this change too.

geryogam · 2019-09-28T21:15:57Z

Thanks for reviewing this @orsenthil.

geryogam · 2019-09-29T13:06:49Z

@nicktimko @orsenthil I have identified why the documentation tests and code coverage fail in Travis CI when running pip install.

This is because the _remove_md5_fragment function of the __init__ module of the pkg_resources package vendored with pip calls the startswith method on the fragment component of its parsed location parameter. But since with this PR it is no more '' but None, this fails and subsequently raises the AttributeError: 'PathMetadata' object has no attribute 'hashcmp' error.

The pip/_vendor/pkg_resources/__init__.py file should be updated like this to be compliant with this PR:

  def _remove_md5_fragment(location):
      if not location:
          return ''
      parsed = urllib.parse.urlparse(location)
-     if parsed[-1].startswith('md5='):
-         return urllib.parse.urlunparse(parsed[:-1] + ('',))
+     if parsed[-1] is not None and parsed[-1].startswith('md5='):
+         return urllib.parse.urlunparse(parsed[:-1] + (None,))

So as you said @orsenthil, this PR is a breaking change.

orsenthil · 2020-10-23T23:08:11Z

Adding a note again, this breaking change can be very costly

the urlsplit function now decodes an absent '' query component as None and an absent '' fragment component as None

Since a lot of packages depend upon urlparse for parsing and they suddenly geting a None value instead of expected '' empty string will break significant number of packages and applications.

I am still evaluating of this is bringing any benefit or is there a backwards compatible way. If there is none, I will be -1 on this patch.

geryogam · 2021-03-10T10:12:11Z

@orsenthil Yes since this PR requires pip to be patched (cf. my previous comment) and potentially other third-party packages relying on the current urllib.parse behaviour, this PR is very dangerous and should not be merged, unless we have an update strategy for pip or we find a backward compatible way.

What is the result of your analysis? Do you think we should move this PR forward or we should close it because its benefits (RFC 3986 compliance) do not outweigh its costs?

erlend-aasland · 2022-11-27T21:28:26Z

Based on Senthil's last comment, I suggest closing this PR.

nicktimko · 2022-11-27T23:07:09Z

Can always engage in some fence-sitting and support both compatible and compliance options. Pass/set a flag, use a different function/method. How about a new urllib2 😉?

Re: returning None and how it would break things, for compatiblity, some ✨Magic™️ objects (or even just a uniquely referenced sentinel string?) could be used and otherwise work like an empty string but is some sentinel object for checking.

erlend-aasland · 2022-11-28T08:09:30Z

@nicktimko Yep, discussing possible solutions is always welcome, but please do so on the issue. Discussions on the PR should be focused on implementation specific issues. (Note that I only suggested to close this PR, not the linked issue.)

geryogam · 2022-12-03T03:09:34Z

Like @erlend-aasland suggested, I am closing this PR because it is backward incompatible. I am also closing the associated issue #82150 which was too restricted (the problem also affects the authority component of URIs, as well as the urlparse() and urldefrag() functions in the urllib.parse module), poorly explained, and could not be edited since I had originally opened it on Round Up, before the migration to GitHub Issue.

I have just opened a new issue describing the problem more generally and clearly with a candidate solution based on @nicktimko’s last comment (feedback is welcome): #99962

Thank you @orsenthil and @nicktimko for reviewing this PR.

Update parse.py

838e934

the-knights-who-say-ni added the CLA signed label Sep 2, 2019

bedevere-bot added the awaiting review label Sep 2, 2019

geryogam added 6 commits September 2, 2019 14:16

Update parse.py

8bffcf3

Update urllib.parse.rst

e7e1f7a

Update parse.py

050de62

Update parse.py

4df550d

Update test_urllib2.py

2dfecf2

Update test_urlparse.py

98d25e7

geryogam changed the title ~~bpo-37969: Update parse.py~~ bpo-37969: Correct urllib.parse functions reporting false equivalent URIs Sep 2, 2019

geryogam added 7 commits September 2, 2019 19:15

Update test_urlparse.py

a56112a

Update test_urlparse.py

7dc5a12

Update parse.py

ad3994a

Update test_urlparse.py

6d70b49

Update test_urlparse.py

58947a5

Update test_urlparse.py

db39e96

Update test_urlparse.py

67465d2

geryogam marked this pull request as ready for review September 2, 2019 22:42

ned-deily requested a review from orsenthil September 4, 2019 06:13

Create 2019-09-10-17-01-35.bpo-37969.5Dz8e7.rst

d6bd27f

geryogam changed the title ~~bpo-37969: Correct urllib.parse functions reporting false equivalent URIs~~ bpo-37969: Correct urllib.parse functions dropping the delimiters of empty URI components Sep 11, 2019

orsenthil reviewed Sep 11, 2019

View reviewed changes

orsenthil requested review from vstinner and serhiy-storchaka September 11, 2019 21:23

orsenthil self-assigned this Sep 28, 2019

geryogam requested a review from orsenthil January 14, 2020 21:26

Merge branch 'master' into patch-8

8158fc2

ezio-melotti removed the CLA signed label Jul 13, 2022

erlend-aasland added the pending The issue will be closed if no feedback is provided label Nov 27, 2022

geryogam changed the title ~~bpo-37969: Correct urllib.parse functions dropping the delimiters of empty URI components~~ gh-82150: Correct urllib.parse functions dropping the delimiters of empty URI components Dec 1, 2022

geryogam changed the title ~~gh-82150: Correct urllib.parse functions dropping the delimiters of empty URI components~~ gh-82150: Make urllib.parse.urlsplit and urllib.parse.urlunsplit preserve the '?' and '#' delimiters of empty query and fragment components Dec 1, 2022

geryogam added 3 commits December 1, 2022 18:48

Merge branch 'main' into patch-8

f6a8466

Use RFC 3986 terminology (‘undefined’)

2c8a286

Update doc tests

8183cf7

geryogam mentioned this pull request Dec 3, 2022

Parsing functions in urllib.parse remove the delimiters of empty URI components in violation of RFC 3986 #99962

Open

geryogam closed this Dec 3, 2022

erlend-aasland mentioned this pull request Dec 3, 2022

Correct urllib.parse functions dropping the delimiters of empty URI components #82150

Closed

commonism mentioned this pull request Jul 25, 2024

RFC3986 compatible URL.join honoring empty segments aio-libs/yarl#1039

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-82150: Make urllib.parse.urlsplit and urllib.parse.urlunsplit preserve the '?' and '#' delimiters of empty query and fragment components #15642

gh-82150: Make urllib.parse.urlsplit and urllib.parse.urlunsplit preserve the '?' and '#' delimiters of empty query and fragment components #15642

geryogam commented Sep 2, 2019 •

edited

Loading

nicktimko commented Sep 3, 2019 •

edited

Loading

geryogam commented Sep 3, 2019 •

edited

Loading

nicktimko commented Sep 3, 2019

geryogam commented Sep 11, 2019 •

edited

Loading

orsenthil commented Sep 11, 2019

orsenthil left a comment

geryogam commented Sep 28, 2019

geryogam commented Sep 29, 2019 •

edited

Loading

orsenthil commented Oct 23, 2020

geryogam commented Mar 10, 2021

erlend-aasland commented Nov 27, 2022

nicktimko commented Nov 27, 2022

erlend-aasland commented Nov 28, 2022

geryogam commented Dec 3, 2022

gh-82150: Make urllib.parse.urlsplit and urllib.parse.urlunsplit preserve the '?' and '#' delimiters of empty query and fragment components #15642

gh-82150: Make urllib.parse.urlsplit and urllib.parse.urlunsplit preserve the '?' and '#' delimiters of empty query and fragment components #15642

Conversation

geryogam commented Sep 2, 2019 • edited Loading

nicktimko commented Sep 3, 2019 • edited Loading

geryogam commented Sep 3, 2019 • edited Loading

nicktimko commented Sep 3, 2019

geryogam commented Sep 11, 2019 • edited Loading

orsenthil commented Sep 11, 2019

orsenthil left a comment

Choose a reason for hiding this comment

geryogam commented Sep 28, 2019

geryogam commented Sep 29, 2019 • edited Loading

orsenthil commented Oct 23, 2020

geryogam commented Mar 10, 2021

erlend-aasland commented Nov 27, 2022

nicktimko commented Nov 27, 2022

erlend-aasland commented Nov 28, 2022

geryogam commented Dec 3, 2022

geryogam commented Sep 2, 2019 •

edited

Loading

nicktimko commented Sep 3, 2019 •

edited

Loading

geryogam commented Sep 3, 2019 •

edited

Loading

geryogam commented Sep 11, 2019 •

edited

Loading

geryogam commented Sep 29, 2019 •

edited

Loading