Use `urllib.parse.urljoin` when joining paths #88

joouha · 2023-03-24T17:19:44Z

Hello!

Should UPath._make_child replicate the behaviour of like pathlib.PurePath._make_child as it does currently, or should it behave like urllib.parse.urljoin?

>>> UPath("http://example.com/a") / "b/c"
HTTPPath('http://example.com/a/b/c')

>>> UPath("http://example.com/a/") / "b/c"
HTTPPath('http://example.com/a//b/c')  # I think this one is a bug...

>>> urljoin("http://example.com/a", "b/c")
'http://example.com/b/c'

>>> urljoin("http://example.com/a/", "b/c")
'http://example.com/a/b/c'

Personally I would expect it to behave like urljoin.

Thoughts?

The text was updated successfully, but these errors were encountered:

normanrz · 2023-03-24T18:53:43Z

Personally I would expect it to behave like urljoin.

I would agree. Is there actually a use case for double slashes in the middle of a url path?

joouha · 2023-03-25T10:43:19Z

Is there actually a use case for double slashes in the middle of a url path?

Most web servers will treat a double slash the same as a single slash, but a web server could respond with different responses, e.g. these two URIs point to different pages:

https://en.wikipedia.org/wiki/Python
https://en.wikipedia.org/wiki//Python

normanrz · 2023-03-26T10:58:37Z

I guess double slashes would then need to be constructed explicitly. Happy to review a PR, if you want to give the urljoin behaviour a try.

ap-- · 2023-03-28T10:05:58Z

I've been thinking about this for a bit, and I wonder what's the best way to address this.

For me it is easier to think about this in "pathlib-terms" if I rephrase this to: "Should specific file systems support empty path parts?"

If we assume some filesystem that supports "double slashes" I think an intuitive "pathlib-style" way to produce a double slash would be:

>>> UPath("protocol://somepath") / "" / "abc"
UPath("protocol://somepath//abc")

Thinking this through might be a little more involved though, since a lot of users might expect paths to handle similar between different file systems. For example on posix and windows because directories can't have the same name as a file, users (or at least me 😅) usually expect:

UPath("protocol://somepath") == UPath("protocol://somepath/") == UPath("protocol://somepath//")

which is why stdlib pathlib currently normalizes those paths to the same. So I guess for supporting empty parts we would actually need to implement behavior like:

>>> UPath("protocol://somepath") / ""
UPath("protocol://somepath//")

>>> assert UPath("protocol://somepath") == UPath("protocol://somepath/")
>>> assert UPath("protocol://somepath") != UPath("protocol://somepath//")

# but on a webserver
>>> UPath("protocol://somepath/a/b") != UPath("protocol://somepath/a/b/")

# --> so we should not normalize trailing slashes on those filesystems, I guess

And regarding the switch to urljoin: I usually find the urljoin behavior unintuitive. For example just check the behavior below:

from urllib.parse import urljoin

roots = [
    "http://example.com",
    "http://example.com/",
    "http://example.com/c",
    "http://example.com/c/",
]

paths = [
    "",
    "a/b",
    "/a/b",
    "//a/b",
    "///a/b",
    "////a/b",
    "/////a/b",
]

for root in roots:
    for path in paths:
        print(f"urljoin({root!r}, {path!r})".ljust(44), "==", repr(urljoin(root, path)))


# output of the above script
urljoin('http://example.com', '')            == 'http://example.com'
urljoin('http://example.com', 'a/b')         == 'http://example.com/a/b'
urljoin('http://example.com', '/a/b')        == 'http://example.com/a/b'
urljoin('http://example.com', '//a/b')       == 'http://a/b'
urljoin('http://example.com', '///a/b')      == 'http://example.com/a/b'
urljoin('http://example.com', '////a/b')     == 'http://example.com//a/b'
urljoin('http://example.com', '/////a/b')    == 'http://example.com///a/b'
urljoin('http://example.com/', '')           == 'http://example.com/'
urljoin('http://example.com/', 'a/b')        == 'http://example.com/a/b'
urljoin('http://example.com/', '/a/b')       == 'http://example.com/a/b'
urljoin('http://example.com/', '//a/b')      == 'http://a/b'
urljoin('http://example.com/', '///a/b')     == 'http://example.com/a/b'
urljoin('http://example.com/', '////a/b')    == 'http://example.com//a/b'
urljoin('http://example.com/', '/////a/b')   == 'http://example.com///a/b'
urljoin('http://example.com/c', '')          == 'http://example.com/c'
urljoin('http://example.com/c', 'a/b')       == 'http://example.com/a/b'
urljoin('http://example.com/c', '/a/b')      == 'http://example.com/a/b'
urljoin('http://example.com/c', '//a/b')     == 'http://a/b'
urljoin('http://example.com/c', '///a/b')    == 'http://example.com/a/b'
urljoin('http://example.com/c', '////a/b')   == 'http://example.com//a/b'
urljoin('http://example.com/c', '/////a/b')  == 'http://example.com///a/b'
urljoin('http://example.com/c/', '')         == 'http://example.com/c/'
urljoin('http://example.com/c/', 'a/b')      == 'http://example.com/c/a/b'
urljoin('http://example.com/c/', '/a/b')     == 'http://example.com/a/b'
urljoin('http://example.com/c/', '//a/b')    == 'http://a/b'
urljoin('http://example.com/c/', '///a/b')   == 'http://example.com/a/b'
urljoin('http://example.com/c/', '////a/b')  == 'http://example.com//a/b'
urljoin('http://example.com/c/', '/////a/b') == 'http://example.com///a/b'

I think we should go through all of this using a concrete example and define the behavior beforehand. I would also check and see how fsspec handles this for http filesystems to make sure that this all is supported upstream, before introducing special functionality in universal_pathlib. @joouha where did this issue pop up initially?

joouha · 2023-03-30T17:20:30Z

Hi,

For a bit of background, I encountered this issue when trying to load resources from web-pages. I wanted a universal interface to be able to load resources from a range of protocols, so universal pathlib seemed like a good option.

Say I load the page http://www.example.com/a/b/index.html with the following content:

<img src="image.png">
<img src="../image.png">
<img src="/image.png">
<img src="ftp://other.com/image.png">
<img src="//other.com/image.png">

I would expect to be able to join the page's URL with any resource link using the / operator,
and end up at the same resources which a browser would load (which is also urljoin's behaviour):

>>> UPath("http://www.example.com/a/b/index.html") / "image.png?version=1"
HTTPPath("http://www.example.com/page/image.png?version=1")

>>> UPath("http://www.example.com/a/b/index.html") / "../image.png"
HTTPPath("http://www.example.com/a/image.png")

>>> UPath("http://www.example.com/a/b/index.html") / "/image.png"
HTTPPath("http://www.example.com/image.png")

>>> UPath("http://www.example.com/a/b/index.html") / "ftp://other.com/image.png"
UPath("ftp://other.com/image.png")

>>> UPath("http://www.example.com/a/b/index.html") / "//other.com/image.png"
HTTPPath("http://other.com/image.png")

Since upath works with URIs, I would expect its behaviour to follow the the standards for the URI protocol defined in RCF3986.

I would expect UPath normalization and joining rules to differ from pathlib, since pathlib works with POSIX and Windows paths. These are not URIs - they follow their own behaviour patterns defined elsewhere.

So as a user, I would expect the following posix paths to be equivalent:

PosixPath("/somepath") == PosixPath("//somepath/") == PosixPath("//somepath//")

but I would not expect the following URIs to be equivalent, because RFC3986 states that they might point to different resources:

HTTPPath("https://en.wikipedia.org/wiki/Film") != HTTPPath("https://en.wikipedia.org/wiki/Film/") != HTTPPath("https://en.wikipedia.org/wiki//Film") != HTTPPath("https://en.wikipedia.org/wiki//Film/")

(which they actually do).

RFC3986 defines how many of the methods in universal pathlib should be implemented when dealing with URIs, such a joining URI paths, normalizing URIs, and URI equivalence.

Also, I like this as a way of constructing URI paths with double slashes - very elegant!

>>> UPath("protocol://somepath") / "" / "abc"
UPath("protocol://somepath//abc")

ap-- mentioned this issue Apr 2, 2023

URI query component is ignored when opening a file #92

Closed

ap-- mentioned this issue May 13, 2023

Copy url params in format_parsed_parts #101

Closed

ap-- mentioned this issue Aug 2, 2023

Change UPath.__new__ behavior #125

Merged

ap-- added the bug 🐛 Something isn't working label Aug 28, 2023

ap-- self-assigned this Feb 15, 2024

ap-- added this to the v0.2.1 milestone Feb 15, 2024

ap-- mentioned this issue Feb 18, 2024

Implement UPath.joinuri #189

Merged

ap-- closed this as completed in #189 Feb 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `urllib.parse.urljoin` when joining paths #88

Use `urllib.parse.urljoin` when joining paths #88

joouha commented Mar 24, 2023

normanrz commented Mar 24, 2023

joouha commented Mar 25, 2023

normanrz commented Mar 26, 2023

ap-- commented Mar 28, 2023

joouha commented Mar 30, 2023 •

edited

Loading

Use urllib.parse.urljoin when joining paths #88

Use urllib.parse.urljoin when joining paths #88

Comments

joouha commented Mar 24, 2023

normanrz commented Mar 24, 2023

joouha commented Mar 25, 2023

normanrz commented Mar 26, 2023

ap-- commented Mar 28, 2023

joouha commented Mar 30, 2023 • edited Loading

Use `urllib.parse.urljoin` when joining paths #88

Use `urllib.parse.urljoin` when joining paths #88

joouha commented Mar 30, 2023 •

edited

Loading