ValueError on path remove_dot_segments when there's extra dot-dot (`/../`) segments #536

besfahbod · 2020-10-14T19:45:55Z

🐞 Describe the bug
As described in RFC 3986 § 5.2.4. Remove Dot Segments, the remove_dot_segments algorithm removes any extra /../ parts of the URL, ignoring errors when the stack is empty.

However, yarl.URL() behavior at the moment is to raise an exception, ValueError, when that happens.

💡 To Reproduce
Instantiate URL class with such a URL string:

yarl.URL('https://example.com/alice/../../eve')

💡 Expected behavior
Follow the RFC, and ignore the error, to get it working like this:

In [16]: yarl.URL('https://example.com/alice/../eve')
Out[16]: URL('https://example.com/eve')

In [17]: yarl.URL('https://example.com/alice/../../eve')
Out[17]: URL('https://example.com/eve')

In [18]: yarl.URL('https://example.com/alice/../../../eve')
Out[18]: URL('https://example.com/eve')

We probably want to also test (and fix, if needed), any other parts of the API that would involve path resolution (including remove_dot_segments) steps.

📋 Logs/tracebacks

In [15]: yarl.URL('https://example.com/alice/../../eve')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-e72c86c9e132> in <module>()
----> 1 yarl.URL('https://example.com/alice/../../eve')

~/ans/venv/lib/python3.6/site-packages/yarl/_url.py in __new__(cls, val, encoded, strict)
    181                 path = cls._normalize_path(path)
    182
--> 183             cls._validate_authority_uri_abs_path(host=host, path=path)
    184             query = cls._QUERY_REQUOTER(val[3])
    185             fragment = cls._FRAGMENT_REQUOTER(val[4])

~/ans/venv/lib/python3.6/site-packages/yarl/_url.py in _validate_authority_uri_abs_path(host, path)
    675         if len(host) > 0 and len(path) > 0 and not path.startswith("/"):
    676             raise ValueError(
--> 677                 "Path in a URL with authority should start with a slash ('/') if set"
    678             )
    679

ValueError: Path in a URL with authority should start with a slash ('/') if set

📋 Your version of the Python

$ python --version
Python 3.6.10

📋 Your version of the aiohttp/yarl/multidict distributions

$ python -m pip show aiohttp
Name: aiohttp
Version: 3.6.2

$ python -m pip show multidict
Name: multidict
Version: 4.7.6

$ python -m pip show yarl
Name: yarl
Version: 1.6.2

📋 Additional context
The proposed behavior would put yarl.URL on par with DOM's URL object, which already follows the RFC on this:

> String(new URL("https://example.com/alice/../eve"))
< "https://example.com/eve"
> String(new URL("https://example.com/alice/../../eve"))
< "https://example.com/eve"
> String(new URL("https://example.com/alice/../../../eve"))
< "https://example.com/eve"

As well as rust-url library (from the unit tests):

  {
    "input": "http://example.com/foo/bar/../ton/../../a",
    // ...
    "pathname": "/a",
    // ...
  },
  {
    "input": "http://example.com/foo/../../..",
    // ...
    "pathname": "/",
    // ...
  },
  {
    "input": "http://example.com/foo/../../../ton",
    // ...
    "pathname": "/ton",
    // ...
  },

which has tests based on (warning: broken links):

  "# Based on http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/script-tests/segments.js",
  "# AS OF https://github.com/jsdom/whatwg-url/commit/35f04dfd3048cf6362f4398745bb13375c5020c2",

The text was updated successfully, but these errors were encountered:

asvetlov · 2020-11-14T20:06:16Z

Interesting.
I (maybe wrong) read the RFC as "remove_dot_segments is applicable to path joining procedure only".
I agree that the same behavior as mentioned JavaScript and Rust libraries is desirable.

A pull request is welcome!

besfahbod · 2021-01-15T23:15:42Z

So, here's the relevant pieces from the RFC:

And:

And the algorithm under section 5.2.2. Transform References.

I think what we need to pay attention to is the difference between a "Reference URL" and a "Target URL". A Target URL is expected to be resolved and normalized, meaning that it's expected to not have any . or .. sequences.

From that perspective, I believe the input to yarl.URL() is expected to be a Reference URL. Does that sound about right?

We could also consider comparing the rules with https://url.spec.whatwg.org/ , which seem to be more close to the common "URL Library" implementations.

Anyways, since we do desire the compatibility, as suggested above, I'm going to get it to work as suggestion.

mjpieters · 2022-06-14T16:09:02Z

which has tests based on (warning: broken links):

  "# Based on http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/script-tests/segments.js",
  "# AS OF https://github.com/jsdom/whatwg-url/commit/35f04dfd3048cf6362f4398745bb13375c5020c2",

The last revision in the Webkit LayoutTests repository with such a file is 149680; the next rev folded the segments.js file into segments.html. It doesn't actually contain any tests for .. and . handling.

If you were looking at urltestdata.json (on which the Rust tests are actually based), you may have missed that there are multiple # Based on comments, the relevant section is based on paths.js in the same Webkit test suite, nowadays at LayoutTests/fast/url/paths.html

mjpieters · 2022-06-14T17:27:31Z

I've converted those tests from JSON to a small Python script:

from yarl import URL
from urllib.parse import unquote, quote

cases = [
    ("/././foo", "/foo"),
    ("/./.foo", "/.foo"),
    ("/foo/.", "/foo/"),
    ("/foo/./", "/foo/"),
    ("/foo/bar/..", "/foo/"),
    ("/foo/bar/../", "/foo/"),
    ("/foo/..bar", "/foo/..bar"),
    ("/foo/bar/../ton", "/foo/ton"),
    ("/foo/bar/../ton/../../a", "/a"),
    ("/foo/../../..", "/"),
    ("/foo/../../../ton", "/ton"),
    ("/foo/%2e", "/foo/"),
    ("/foo/%2e%2", "/foo/%2e%2"),
    ("/foo/%2e./%2e%2e/.%2e/%2e.bar", "/%2e.bar"),
    ("////../..", "//"),
    ("/foo/bar//../..", "/foo/"),
    ("/foo/bar//..", "/foo/bar/"),
    ("/foo", "/foo"),
    ("/%20foo", "/%20foo"),
    ("/foo%", "/foo%"),
    ("/foo%2", "/foo%2"),
    ("/foo%2zbar", "/foo%2zbar"),
    ("/foo%2Â©zbar", "/foo%2%C3%82%C2%A9zbar"),
    ("/foo%41%7a", "/foo%41%7a"),
    ("/foo\t\u0091%91", "/foo%C2%91%91"),
    ("/foo%00%51", "/foo%00%51"),
    ("/(%28:%3A%29)", "/(%28:%3A%29)"),
    ("/%3A%3a%3C%3c", "/%3A%3a%3C%3c"),
    ("/foo\tbar", "/foobar"),
    ("\\\\foo\\\\bar", "//foo//bar"),
    ("/%7Ffp3%3Eju%3Dduvgw%3Dd", "/%7Ffp3%3Eju%3Dduvgw%3Dd"),
    ("/@asdf%40", "/@asdf%40"),
    ("/你好你好", "/%E4%BD%A0%E5%A5%BD%E4%BD%A0%E5%A5%BD"),
    ("/‥/foo", "/%E2%80%A5/foo"),
    ("//foo", "/%EF%BB%BF/foo"),
    ("/\u202e/foo/\u202d/bar", "/%E2%80%AE/foo/%E2%80%AD/bar"),
]
requote = False
for i, (test, expected) in enumerate(cases, 1):
    if requote:
        expected = quote(
            unquote(expected, encoding="latin1"), safe="/@:()=", encoding="latin1"
        )
    try:
        actual = URL(f"http://example.com{test}").raw_path
    except ValueError as e:
        actual = f"exception: {e}"

    if actual != expected:
        print(f"({i}) \x1b[31mFAIL\x1b[0m: {test}")
        print(f"         {actual}\n      != {expected}")

and this outputs:

(11) FAIL: /foo/../../../ton
         exception: Path in a URL with authority should start with a slash ('/') if set
      != /ton
(13) FAIL: /foo/%2e%2
         /foo/.%252
      != /foo/%2e%2
(14) FAIL: /foo/%2e./%2e%2e/.%2e/%2e.bar
         exception: Path in a URL with authority should start with a slash ('/') if set
      != /%2e.bar
(20) FAIL: /foo%
         /foo%25
      != /foo%
(21) FAIL: /foo%2
         /foo%252
      != /foo%2
(22) FAIL: /foo%2zbar
         /foo%252zbar
      != /foo%2zbar
(23) FAIL: /foo%2Â©zbar
         /foo%252%C3%82%C2%A9zbar
      != /foo%2%C3%82%C2%A9zbar
(24) FAIL: /foo%41%7a
         /fooAz
      != /foo%41%7a
(26) FAIL: /foo%00%51
         /foo%00Q
      != /foo%00%51
(27) FAIL: /(%28:%3A%29)
         /((::))
      != /(%28:%3A%29)
(28) FAIL: /%3A%3a%3C%3c
         /::%3C%3C
      != /%3A%3a%3C%3c
(30) FAIL: \\foo\\bar
         /
      != //foo//bar
(31) FAIL: /%7Ffp3%3Eju%3Dduvgw%3Dd
         /%7Ffp3%3Eju=duvgw=d
      != /%7Ffp3%3Eju%3Dduvgw%3Dd
(32) FAIL: /@asdf%40
         /@asdf@
      != /@asdf%40

Several of these are not actual failures; yarl.URL() gives us equivalent paths with valid %hh substitutions for 13, 20-24, 26-28, 31 and 32, and in addition 28 also replaced %3c with %3C, so uppercase hex instead of lowercase. Those are fine, and those disappear when you set requote to True.

That leaves 11, 14 and 30:

(11) FAIL: /foo/../../../ton
         exception: Path in a URL with authority should start with a slash ('/') if set
      != /ton
(14) FAIL: /foo/%2e./%2e%2e/.%2e/%2e.bar
         exception: Path in a URL with authority should start with a slash ('/') if set
      != /%2e.bar
(30) FAIL: \\foo\\bar
         /
      != //foo//bar

I'm not so sure about turning backslashes into forward slashes nor am I inclined to research that specific issue right now.

The following alteration to the _normalize_path() loop makes 11 and 14 pass:

        for seg in segments:
            if seg not in (".", ".."):
                resolved_path.append(seg)
            elif seg == ".." and (len(resolved_path) > 1 or resolved_path[-1]):
                resolved_path.pop()
            # seg == "."

besfahbod · 2022-06-14T17:41:39Z

Thanks for working on this, @mjpieters! I generally agree with the direction here.

Specifically, I think not-raising-an-exception is the most-important problem to fix here, as it causes processing of user-provided data to fail in a place where no such errors are expected.

mjpieters · 2022-06-14T19:04:25Z

Here's an alternative version, one avoids the (slower) if check for each path segment like the original did:

        prefix, resolved_path = "", []
        if path.startswith("/"):
            # preserve the "/" root element of absolute paths.
            prefix = "/"
            path = path[1:]

        segments = path.split("/")
        for seg in segments:
            if seg == "..":
                # ignore any .. segments that would otherwise cause an
                # IndexError when popped from resolved_path if
                # resolving for rfc3986
                with suppress(IndexError):
                    resolved_path.pop()
            elif seg != ".":
                resolved_path.append(seg)

        if segments and segments[-1] in (".", ".."):
            # do some post-processing here.
            # if the last segment was a relative dir,
            # then we need to append the trailing '/'
            resolved_path.append("")

        return prefix + "/".join(resolved_path)

besfahbod added the bug label Oct 14, 2020

mjpieters mentioned this issue Jun 14, 2022

Keep absolute paths absolute when normalising. #741

Merged

4 tasks

asvetlov closed this as completed in #741 Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError on path remove_dot_segments when there's extra dot-dot (`/../`) segments #536

ValueError on path remove_dot_segments when there's extra dot-dot (`/../`) segments #536

besfahbod commented Oct 14, 2020 •

edited

Loading

asvetlov commented Nov 14, 2020

besfahbod commented Jan 15, 2021

mjpieters commented Jun 14, 2022 •

edited

Loading

mjpieters commented Jun 14, 2022 •

edited

Loading

besfahbod commented Jun 14, 2022

mjpieters commented Jun 14, 2022

ValueError on path remove_dot_segments when there's extra dot-dot (/../) segments #536

ValueError on path remove_dot_segments when there's extra dot-dot (/../) segments #536

Comments

besfahbod commented Oct 14, 2020 • edited Loading

asvetlov commented Nov 14, 2020

besfahbod commented Jan 15, 2021

mjpieters commented Jun 14, 2022 • edited Loading

mjpieters commented Jun 14, 2022 • edited Loading

besfahbod commented Jun 14, 2022

mjpieters commented Jun 14, 2022

ValueError on path remove_dot_segments when there's extra dot-dot (`/../`) segments #536

ValueError on path remove_dot_segments when there's extra dot-dot (`/../`) segments #536

besfahbod commented Oct 14, 2020 •

edited

Loading

mjpieters commented Jun 14, 2022 •

edited

Loading

mjpieters commented Jun 14, 2022 •

edited

Loading