Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use urllib.parse.urljoin when joining paths #88

Closed
joouha opened this issue Mar 24, 2023 · 5 comments · Fixed by #189
Closed

Use urllib.parse.urljoin when joining paths #88

joouha opened this issue Mar 24, 2023 · 5 comments · Fixed by #189
Assignees
Labels
bug 🐛 Something isn't working
Milestone

Comments

@joouha
Copy link
Contributor

joouha commented Mar 24, 2023

Hello!

Should UPath._make_child replicate the behaviour of like pathlib.PurePath._make_child as it does currently, or should it behave like urllib.parse.urljoin?

>>> UPath("http://example.com/a") / "b/c"
HTTPPath('http://example.com/a/b/c')

>>> UPath("http://example.com/a/") / "b/c"
HTTPPath('http://example.com/a//b/c')  # I think this one is a bug...

>>> urljoin("http://example.com/a", "b/c")
'http://example.com/b/c'

>>> urljoin("http://example.com/a/", "b/c")
'http://example.com/a/b/c'

Personally I would expect it to behave like urljoin.

Thoughts?

@normanrz
Copy link
Collaborator

Personally I would expect it to behave like urljoin.

I would agree. Is there actually a use case for double slashes in the middle of a url path?

@joouha
Copy link
Contributor Author

joouha commented Mar 25, 2023

Is there actually a use case for double slashes in the middle of a url path?

Most web servers will treat a double slash the same as a single slash, but a web server could respond with different responses, e.g. these two URIs point to different pages:

https://en.wikipedia.org/wiki/Python
https://en.wikipedia.org/wiki//Python

@normanrz
Copy link
Collaborator

I guess double slashes would then need to be constructed explicitly. Happy to review a PR, if you want to give the urljoin behaviour a try.

@ap--
Copy link
Collaborator

ap-- commented Mar 28, 2023

I've been thinking about this for a bit, and I wonder what's the best way to address this.

For me it is easier to think about this in "pathlib-terms" if I rephrase this to: "Should specific file systems support empty path parts?"

If we assume some filesystem that supports "double slashes" I think an intuitive "pathlib-style" way to produce a double slash would be:

>>> UPath("protocol://somepath") / "" / "abc"
UPath("protocol://somepath//abc")

Thinking this through might be a little more involved though, since a lot of users might expect paths to handle similar between different file systems. For example on posix and windows because directories can't have the same name as a file, users (or at least me 😅) usually expect:

UPath("protocol://somepath") == UPath("protocol://somepath/") == UPath("protocol://somepath//")

which is why stdlib pathlib currently normalizes those paths to the same. So I guess for supporting empty parts we would actually need to implement behavior like:

>>> UPath("protocol://somepath") / ""
UPath("protocol://somepath//")

>>> assert UPath("protocol://somepath") == UPath("protocol://somepath/")
>>> assert UPath("protocol://somepath") != UPath("protocol://somepath//")

# but on a webserver
>>> UPath("protocol://somepath/a/b") != UPath("protocol://somepath/a/b/")

# --> so we should not normalize trailing slashes on those filesystems, I guess

And regarding the switch to urljoin: I usually find the urljoin behavior unintuitive. For example just check the behavior below:

from urllib.parse import urljoin

roots = [
    "http://example.com",
    "http://example.com/",
    "http://example.com/c",
    "http://example.com/c/",
]

paths = [
    "",
    "a/b",
    "/a/b",
    "//a/b",
    "///a/b",
    "////a/b",
    "/////a/b",
]

for root in roots:
    for path in paths:
        print(f"urljoin({root!r}, {path!r})".ljust(44), "==", repr(urljoin(root, path)))


# output of the above script
urljoin('http://example.com', '')            == 'http://example.com'
urljoin('http://example.com', 'a/b')         == 'http://example.com/a/b'
urljoin('http://example.com', '/a/b')        == 'http://example.com/a/b'
urljoin('http://example.com', '//a/b')       == 'http://a/b'
urljoin('http://example.com', '///a/b')      == 'http://example.com/a/b'
urljoin('http://example.com', '////a/b')     == 'http://example.com//a/b'
urljoin('http://example.com', '/////a/b')    == 'http://example.com///a/b'
urljoin('http://example.com/', '')           == 'http://example.com/'
urljoin('http://example.com/', 'a/b')        == 'http://example.com/a/b'
urljoin('http://example.com/', '/a/b')       == 'http://example.com/a/b'
urljoin('http://example.com/', '//a/b')      == 'http://a/b'
urljoin('http://example.com/', '///a/b')     == 'http://example.com/a/b'
urljoin('http://example.com/', '////a/b')    == 'http://example.com//a/b'
urljoin('http://example.com/', '/////a/b')   == 'http://example.com///a/b'
urljoin('http://example.com/c', '')          == 'http://example.com/c'
urljoin('http://example.com/c', 'a/b')       == 'http://example.com/a/b'
urljoin('http://example.com/c', '/a/b')      == 'http://example.com/a/b'
urljoin('http://example.com/c', '//a/b')     == 'http://a/b'
urljoin('http://example.com/c', '///a/b')    == 'http://example.com/a/b'
urljoin('http://example.com/c', '////a/b')   == 'http://example.com//a/b'
urljoin('http://example.com/c', '/////a/b')  == 'http://example.com///a/b'
urljoin('http://example.com/c/', '')         == 'http://example.com/c/'
urljoin('http://example.com/c/', 'a/b')      == 'http://example.com/c/a/b'
urljoin('http://example.com/c/', '/a/b')     == 'http://example.com/a/b'
urljoin('http://example.com/c/', '//a/b')    == 'http://a/b'
urljoin('http://example.com/c/', '///a/b')   == 'http://example.com/a/b'
urljoin('http://example.com/c/', '////a/b')  == 'http://example.com//a/b'
urljoin('http://example.com/c/', '/////a/b') == 'http://example.com///a/b'

I think we should go through all of this using a concrete example and define the behavior beforehand. I would also check and see how fsspec handles this for http filesystems to make sure that this all is supported upstream, before introducing special functionality in universal_pathlib. @joouha where did this issue pop up initially?

@joouha
Copy link
Contributor Author

joouha commented Mar 30, 2023

Hi,

For a bit of background, I encountered this issue when trying to load resources from web-pages. I wanted a universal interface to be able to load resources from a range of protocols, so universal pathlib seemed like a good option.

Say I load the page http://www.example.com/a/b/index.html with the following content:

<img src="image.png">
<img src="../image.png">
<img src="/image.png">
<img src="ftp://other.com/image.png">
<img src="//other.com/image.png">

I would expect to be able to join the page's URL with any resource link using the / operator,
and end up at the same resources which a browser would load (which is also urljoin's behaviour):

>>> UPath("http://www.example.com/a/b/index.html") / "image.png?version=1"
HTTPPath("http://www.example.com/page/image.png?version=1")

>>> UPath("http://www.example.com/a/b/index.html") / "../image.png"
HTTPPath("http://www.example.com/a/image.png")

>>> UPath("http://www.example.com/a/b/index.html") / "/image.png"
HTTPPath("http://www.example.com/image.png")

>>> UPath("http://www.example.com/a/b/index.html") / "ftp://other.com/image.png"
UPath("ftp://other.com/image.png")

>>> UPath("http://www.example.com/a/b/index.html") / "//other.com/image.png"
HTTPPath("http://other.com/image.png")

Since upath works with URIs, I would expect its behaviour to follow the the standards for the URI protocol defined in RCF3986.

I would expect UPath normalization and joining rules to differ from pathlib, since pathlib works with POSIX and Windows paths. These are not URIs - they follow their own behaviour patterns defined elsewhere.

So as a user, I would expect the following posix paths to be equivalent:

PosixPath("/somepath") == PosixPath("//somepath/") == PosixPath("//somepath//")

but I would not expect the following URIs to be equivalent, because RFC3986 states that they might point to different resources:

HTTPPath("https://en.wikipedia.org/wiki/Film") != HTTPPath("https://en.wikipedia.org/wiki/Film/") != HTTPPath("https://en.wikipedia.org/wiki//Film") != HTTPPath("https://en.wikipedia.org/wiki//Film/")

(which they actually do).

RFC3986 defines how many of the methods in universal pathlib should be implemented when dealing with URIs, such a joining URI paths, normalizing URIs, and URI equivalence.


Also, I like this as a way of constructing URI paths with double slashes - very elegant!

>>> UPath("protocol://somepath") / "" / "abc"
UPath("protocol://somepath//abc")

@ap-- ap-- added the bug 🐛 Something isn't working label Aug 28, 2023
@ap-- ap-- self-assigned this Feb 15, 2024
@ap-- ap-- added this to the v0.2.1 milestone Feb 15, 2024
@ap-- ap-- closed this as completed in #189 Feb 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants