Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct urllib.parse functions dropping the delimiters of empty URI components #82150

Closed
maggyero mannequin opened this issue Aug 28, 2019 · 5 comments
Closed

Correct urllib.parse functions dropping the delimiters of empty URI components #82150

maggyero mannequin opened this issue Aug 28, 2019 · 5 comments
Labels
3.7 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@maggyero
Copy link
Mannequin

maggyero mannequin commented Aug 28, 2019

BPO 37969
Nosy @orsenthil, @jeremyhylton, @nicktimko, @maggyero, @openandclose
PRs
  • gh-82150: Make urllib.parse.urlsplit and urllib.parse.urlunsplit preserve the '?' and '#' delimiters of empty query and fragment components #15642
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2019-08-28.14:54:51.541>
    labels = ['3.7', 'type-bug', 'library']
    title = 'Correct urllib.parse functions dropping the delimiters of empty URI components'
    updated_at = <Date 2020-06-10.11:21:47.483>
    user = 'https://github.com/maggyero'

    bugs.python.org fields:

    activity = <Date 2020-06-10.11:21:47.483>
    actor = 'op368'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2019-08-28.14:54:51.541>
    creator = 'maggyero'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 37969
    keywords = ['patch']
    message_count = 4.0
    messages = ['350663', '350687', '351043', '371180']
    nosy_count = 5.0
    nosy_names = ['orsenthil', 'Jeremy.Hylton', 'nicktimko', 'maggyero', 'op368']
    pr_nums = ['15642']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue37969'
    versions = ['Python 3.7']

    @maggyero
    Copy link
    Mannequin Author

    maggyero mannequin commented Aug 28, 2019

    The Python library documentation of the urllib.parse.urlunparse <https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlunparse>_ and urllib.parse.urlunsplit <https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlunsplit>_ functions states:

    This may result in a slightly different, but equivalent URL, if the URL that was parsed originally had unnecessary delimiters (for example, a ? with an empty query; the RFC states that these are equivalent).
    

    So with the <http://example.com/?\> URI::

        >>> import urllib.parse
        >>> urllib.parse.urlunparse(urllib.parse.urlparse("http://example.com/?"))
        'http://example.com/'
        >>> urllib.parse.urlunsplit(urllib.parse.urlsplit("http://example.com/?"))
        'http://example.com/'

    But RFC 3986 <https://tools.ietf.org/html/rfc3986?#section-6.2.3>_ states the exact opposite:

    Normalization should not remove delimiters when their associated component is empty unless licensed to do so by the scheme specification.  For example, the URI "http://example.com/?" cannot be assumed to be equivalent to any of the examples above.  Likewise, the presence or absence of delimiters within a userinfo subcomponent is usually significant to its interpretation.  The fragment component is not subject to any scheme-based normalization; thus, two URIs that differ only by the suffix "#" are considered different regardless of the scheme.
    

    So maybe urllib.parse.urlunparseurllib.parse.urlparse and urllib.parse.urlunspliturllib.parse.urlsplit are not supposed to be used for syntax-based normalization <https://tools.ietf.org/html/rfc3986?#section-6>_ of URIs. But still, both urllib.parse.urlparse or urllib.parse.urlsplit lose the "delimiter + empty component" information of the URI string, so they report false equivalent URIs::

        >>> import urllib.parse
        >>> urllib.parse.urlparse("http://example.com/?") == urllib.parse.urlparse("http://example.com/")
        True
        >>> urllib.parse.urlsplit("http://example.com/?") == urllib.parse.urlsplit("http://example.com/")
        True

    P.-S. — Is there a syntax-based normalization function of URIs in the Python library?

    @maggyero maggyero mannequin added 3.7 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Aug 28, 2019
    @nicktimko
    Copy link
    Mannequin

    nicktimko mannequin commented Aug 28, 2019

    Looking at the history, the line in the docs used to say

    ... (for example, an empty query (the draft states that these are equivalent).

    which was changed to "the RFC" in April 2006 ad5177cf8da#diff-5b4cef771c997754f9e2feeae11d3b1eL68-R95

    The original language was added in February 1995 a12ef9433baf#diff-5b4cef771c997754f9e2feeae11d3b1eR48-R51

    So "the draft" probably meant the draft of RFC-1738 https://tools.ietf.org/html/rfc1738#section-3.3 which is kinda vague on it. It didn't help that rewording it as "the RFC" later when there are 3+ RFCs referenced in the lib docs, one of which obsoleted the another RFC and definitely changed the meaning of the loose "?".

    The draft of 2396 always seemed to have the opposite wording you point out, at least back in draft 07 (September 2004): https://tools.ietf.org/html/draft-fielding-uri-rfc2396bis-07#section-6.2.3 The draft 06 (April 2004) was silent on the matter https://tools.ietf.org/html/draft-fielding-uri-rfc2396bis-06#section-6.2.3

    @maggyero
    Copy link
    Mannequin Author

    maggyero mannequin commented Sep 2, 2019

    @nicktimko Thanks for the historical track.

    Here is a patch that solves this issue by updating the urlsplit and urlunsplit functions of the urllib.parse module to keep the '?' and '#' delimiters in URIs if present, even if their associated component is empty, as required by RFC 3986: #15642

    That way we get the correct behavior:

        >>> import urllib.parse
        >>> urllib.parse.urlunsplit(urllib.parse.urlsplit("http://example.com/?"))
        'http://example.com/?'
        >>> urllib.parse.urlunsplit(urllib.parse.urlsplit("http://example.com/#"))
        'http://example.com/#'

    Any feedback welcome.

    @maggyero maggyero mannequin changed the title urllib.parse functions reporting false equivalent URIs Correct urllib.parse functions dropping the delimiters of empty URI components Sep 11, 2019
    @openandclose
    Copy link
    Mannequin

    openandclose mannequin commented Jun 10, 2020

    This is a duplicate of bpo-22852 ('urllib.parse wrongly strips empty #fragment, ?query, //netloc').

    Also note that three alternative solutions have already proposed.

    (1) Add 'None' type to Result objects members like this one.

    But it is considering not only query and fragment, but also netloc,
    which may solve many other issues.
    

    (2) Add 'has_netloc', 'has_query' and 'has_fragment' attribute.

    (3) like (1), but conditional on 'allow_none' argument (similar to 'allow_fragments')

    @erlend-aasland
    Copy link
    Contributor

    Closing as superseded by #99962, as per #15642 (comment)

    @erlend-aasland erlend-aasland closed this as not planned Won't fix, can't repro, duplicate, stale Dec 3, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant