Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-82150: Make urllib.parse.urlsplit and urllib.parse.urlunsplit preserve the '?' and '#' delimiters of empty query and fragment components #15642

Closed
wants to merge 19 commits into from

Conversation

geryogam
Copy link
Contributor

@geryogam geryogam commented Sep 2, 2019

This PR provides the following changes:

  • update the urlsplit and urlunsplit functions of the urllib.parse submodule for making them preserve the delimiters of the query and fragment components when applied in sequence on a URI with an empty query component (i.e. '?') or empty fragment component (i.e. '#')—not to be confused with an undefined query component (i.e. '') nor an undefined fragment component (i.e. ''):

      >>> from urllib.parse import urlsplit, urlunsplit
      >>> urlunsplit(urlsplit('http://example.com/?#'))
      'http://example.com/?#'
    

    This is required by RFC 3986:

    Normalization should not remove delimiters when their associated component is empty unless licensed to do so by the scheme specification. For example, the URI http://example.com/? cannot be assumed to be equivalent to any of the examples above. Likewise, the presence or absence of delimiters within a userinfo subcomponent is usually significant to its interpretation. The fragment component is not subject to any scheme-based normalization; thus, two URIs that differ only by the suffix # are considered different regardless of the scheme.

    Currently delimiters are dropped:

      >>> from urllib.parse import urlsplit, urlunsplit
      >>> urlunsplit(urlsplit('http://example.com/?#'))
      'http://example.com/'
    

    To do so:

    • the urlsplit function now decodes an undefined query component as None and an undefined fragment component as None (e.g. urlsplit('http://example.com/') == ('http', 'example.com', '/', None, None)), and still decodes an empty query component as '' and an empty fragment component as '' (e.g. urlsplit('http://example.com/?#') == ('http', 'example.com', '/', '', ''));
    • the urlunsplit function now encodes a None query component as an undefined query component and a None fragment component as an undefined fragment component (e.g. urlunsplit(('http', 'example.com', '/', None, None)) == 'http://example.com/'), and now encodes a '' query component as an empty query component and a '' fragment component as an empty fragment component (e.g. urlunsplit(('http', 'example.com', '/', '', '')) == 'http://example.com/?#');
  • add and update the corresponding unit tests in the test.test_urlparse module;

  • update a unit test in the test.test_urllib2 module;

  • update the urllib.parse documentation accordingly.

@geryogam geryogam changed the title bpo-37969: Update parse.py bpo-37969: Correct urllib.parse functions reporting false equivalent URIs Sep 2, 2019
@geryogam geryogam marked this pull request as ready for review September 2, 2019 22:42
@nicktimko
Copy link

nicktimko commented Sep 3, 2019

It's maybe a bit surprising to have some of the tuple fields sometimes be None (typing.Tuple[typing.Optional[str]] instead of typing.Tuple[str]), but I'm not sure of a more obvious solution.

The other alternative I thought about was to just explicitly dump in the delimiter if it's empty (e.g. 'http://example.com/?#''http', 'example.com', '/', '?', '#'), but that's probably more surprising, rebuilding the URL is more complex, and what then if there's a URL like http://example.com/??.

I think you need to also describe the breaking change very clearly (haven't done it before, but I think that's what bedevere/news is for, i.e. these things), and leave hints in the actual documentation about the change ("changed in 3.9")

Housekeeping: I'd squash all the commits.

@geryogam
Copy link
Contributor Author

geryogam commented Sep 3, 2019

Thank you for reviewing this @nicktimko! Yes the None solution for undefined query/fragment seemed the most straightforward and natural to me.

I have updated the PR description to detail the exact changes. Nice suggestion, I will make the news entry, documentation version note and commit squash. But before I would like to fix an issue: the documentation tests in Travis CI failed for an obscure reason (see below). Do you have any idea why?

@nicktimko
Copy link

I don't know, but the docs build looks like it's installing blurb, which might be related to the news, so maybe adding a news item would fix it? Does it run locally? Just guessing though.

@ned-deily ned-deily requested a review from orsenthil September 4, 2019 06:13
@geryogam
Copy link
Contributor Author

geryogam commented Sep 11, 2019

Thanks @nicktimko, I have added a news entry, but documentation tests still fail in Travis-CI.

@geryogam geryogam changed the title bpo-37969: Correct urllib.parse functions reporting false equivalent URIs bpo-37969: Correct urllib.parse functions dropping the delimiters of empty URI components Sep 11, 2019
@orsenthil
Copy link
Member

I don't think the documentation failure is related to the code in this PR. Perhaps this PR needs to be rebased?

Copy link
Member

@orsenthil orsenthil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be a breaking change and will affect a plenty of downstream libraries and frameworks that had been relying upon the previous behavior.

  • I don't have any code comments, and the code changes look good to me.
  • I find the rational ok
  • I will request reviews from more active core developers and want to hear their opinion on this change too.

@geryogam
Copy link
Contributor Author

Thanks for reviewing this @orsenthil.

@orsenthil orsenthil self-assigned this Sep 28, 2019
@geryogam
Copy link
Contributor Author

geryogam commented Sep 29, 2019

@nicktimko @orsenthil I have identified why the documentation tests and code coverage fail in Travis CI when running pip install.

This is because the _remove_md5_fragment function of the __init__ module of the pkg_resources package vendored with pip calls the startswith method on the fragment component of its parsed location parameter. But since with this PR it is no more '' but None, this fails and subsequently raises the AttributeError: 'PathMetadata' object has no attribute 'hashcmp' error.

The pip/_vendor/pkg_resources/__init__.py file should be updated like this to be compliant with this PR:

  def _remove_md5_fragment(location):
      if not location:
          return ''
      parsed = urllib.parse.urlparse(location)
-     if parsed[-1].startswith('md5='):
-         return urllib.parse.urlunparse(parsed[:-1] + ('',))
+     if parsed[-1] is not None and parsed[-1].startswith('md5='):
+         return urllib.parse.urlunparse(parsed[:-1] + (None,))

So as you said @orsenthil, this PR is a breaking change.

@geryogam geryogam requested a review from orsenthil January 14, 2020 21:26
@orsenthil
Copy link
Member

Adding a note again, this breaking change can be very costly

the urlsplit function now decodes an absent '' query component as None and an absent '' fragment component as None

Since a lot of packages depend upon urlparse for parsing and they suddenly geting a None value instead of expected '' empty string will break significant number of packages and applications.

I am still evaluating of this is bringing any benefit or is there a backwards compatible way. If there is none, I will be -1 on this patch.

@geryogam
Copy link
Contributor Author

@orsenthil Yes since this PR requires pip to be patched (cf. my previous comment) and potentially other third-party packages relying on the current urllib.parse behaviour, this PR is very dangerous and should not be merged, unless we have an update strategy for pip or we find a backward compatible way.

What is the result of your analysis? Do you think we should move this PR forward or we should close it because its benefits (RFC 3986 compliance) do not outweigh its costs?

@erlend-aasland
Copy link
Contributor

Based on Senthil's last comment, I suggest closing this PR.

@erlend-aasland erlend-aasland added the pending The issue will be closed if no feedback is provided label Nov 27, 2022
@nicktimko
Copy link

Can always engage in some fence-sitting and support both compatible and compliance options. Pass/set a flag, use a different function/method. How about a new urllib2 😉?

Re: returning None and how it would break things, for compatiblity, some ✨Magic™️ objects (or even just a uniquely referenced sentinel string?) could be used and otherwise work like an empty string but is some sentinel object for checking.

@erlend-aasland
Copy link
Contributor

@nicktimko Yep, discussing possible solutions is always welcome, but please do so on the issue. Discussions on the PR should be focused on implementation specific issues. (Note that I only suggested to close this PR, not the linked issue.)

@geryogam geryogam changed the title bpo-37969: Correct urllib.parse functions dropping the delimiters of empty URI components gh-82150: Correct urllib.parse functions dropping the delimiters of empty URI components Dec 1, 2022
@geryogam geryogam changed the title gh-82150: Correct urllib.parse functions dropping the delimiters of empty URI components gh-82150: Make urllib.parse.urlsplit and urllib.parse.urlunsplit preserve the '?' and '#' delimiters of empty query and fragment components Dec 1, 2022
@geryogam
Copy link
Contributor Author

geryogam commented Dec 3, 2022

Like @erlend-aasland suggested, I am closing this PR because it is backward incompatible. I am also closing the associated issue #82150 which was too restricted (the problem also affects the authority component of URIs, as well as the urlparse() and urldefrag() functions in the urllib.parse module), poorly explained, and could not be edited since I had originally opened it on Round Up, before the migration to GitHub Issue.

I have just opened a new issue describing the problem more generally and clearly with a candidate solution based on @nicktimko’s last comment (feedback is welcome): #99962

Thank you @orsenthil and @nicktimko for reviewing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review pending The issue will be closed if no feedback is provided
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants