-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use urllib.parse.urljoin
when joining paths
#88
Comments
I would agree. Is there actually a use case for double slashes in the middle of a url path? |
Most web servers will treat a double slash the same as a single slash, but a web server could respond with different responses, e.g. these two URIs point to different pages:
|
I guess double slashes would then need to be constructed explicitly. Happy to review a PR, if you want to give the |
I've been thinking about this for a bit, and I wonder what's the best way to address this. For me it is easier to think about this in "pathlib-terms" if I rephrase this to: "Should specific file systems support empty path parts?" If we assume some filesystem that supports "double slashes" I think an intuitive "pathlib-style" way to produce a double slash would be: >>> UPath("protocol://somepath") / "" / "abc"
UPath("protocol://somepath//abc") Thinking this through might be a little more involved though, since a lot of users might expect paths to handle similar between different file systems. For example on posix and windows because directories can't have the same name as a file, users (or at least me 😅) usually expect: UPath("protocol://somepath") == UPath("protocol://somepath/") == UPath("protocol://somepath//") which is why stdlib pathlib currently normalizes those paths to the same. So I guess for supporting empty parts we would actually need to implement behavior like: >>> UPath("protocol://somepath") / ""
UPath("protocol://somepath//")
>>> assert UPath("protocol://somepath") == UPath("protocol://somepath/")
>>> assert UPath("protocol://somepath") != UPath("protocol://somepath//")
# but on a webserver
>>> UPath("protocol://somepath/a/b") != UPath("protocol://somepath/a/b/")
# --> so we should not normalize trailing slashes on those filesystems, I guess And regarding the switch to from urllib.parse import urljoin
roots = [
"http://example.com",
"http://example.com/",
"http://example.com/c",
"http://example.com/c/",
]
paths = [
"",
"a/b",
"/a/b",
"//a/b",
"///a/b",
"////a/b",
"/////a/b",
]
for root in roots:
for path in paths:
print(f"urljoin({root!r}, {path!r})".ljust(44), "==", repr(urljoin(root, path)))
# output of the above script
urljoin('http://example.com', '') == 'http://example.com'
urljoin('http://example.com', 'a/b') == 'http://example.com/a/b'
urljoin('http://example.com', '/a/b') == 'http://example.com/a/b'
urljoin('http://example.com', '//a/b') == 'http://a/b'
urljoin('http://example.com', '///a/b') == 'http://example.com/a/b'
urljoin('http://example.com', '////a/b') == 'http://example.com//a/b'
urljoin('http://example.com', '/////a/b') == 'http://example.com///a/b'
urljoin('http://example.com/', '') == 'http://example.com/'
urljoin('http://example.com/', 'a/b') == 'http://example.com/a/b'
urljoin('http://example.com/', '/a/b') == 'http://example.com/a/b'
urljoin('http://example.com/', '//a/b') == 'http://a/b'
urljoin('http://example.com/', '///a/b') == 'http://example.com/a/b'
urljoin('http://example.com/', '////a/b') == 'http://example.com//a/b'
urljoin('http://example.com/', '/////a/b') == 'http://example.com///a/b'
urljoin('http://example.com/c', '') == 'http://example.com/c'
urljoin('http://example.com/c', 'a/b') == 'http://example.com/a/b'
urljoin('http://example.com/c', '/a/b') == 'http://example.com/a/b'
urljoin('http://example.com/c', '//a/b') == 'http://a/b'
urljoin('http://example.com/c', '///a/b') == 'http://example.com/a/b'
urljoin('http://example.com/c', '////a/b') == 'http://example.com//a/b'
urljoin('http://example.com/c', '/////a/b') == 'http://example.com///a/b'
urljoin('http://example.com/c/', '') == 'http://example.com/c/'
urljoin('http://example.com/c/', 'a/b') == 'http://example.com/c/a/b'
urljoin('http://example.com/c/', '/a/b') == 'http://example.com/a/b'
urljoin('http://example.com/c/', '//a/b') == 'http://a/b'
urljoin('http://example.com/c/', '///a/b') == 'http://example.com/a/b'
urljoin('http://example.com/c/', '////a/b') == 'http://example.com//a/b'
urljoin('http://example.com/c/', '/////a/b') == 'http://example.com///a/b' I think we should go through all of this using a concrete example and define the behavior beforehand. I would also check and see how fsspec handles this for http filesystems to make sure that this all is supported upstream, before introducing special functionality in universal_pathlib. @joouha where did this issue pop up initially? |
Hi, For a bit of background, I encountered this issue when trying to load resources from web-pages. I wanted a universal interface to be able to load resources from a range of protocols, so universal pathlib seemed like a good option. Say I load the page <img src="image.png">
<img src="../image.png">
<img src="/image.png">
<img src="ftp://other.com/image.png">
<img src="//other.com/image.png"> I would expect to be able to join the page's URL with any resource link using the >>> UPath("http://www.example.com/a/b/index.html") / "image.png?version=1"
HTTPPath("http://www.example.com/page/image.png?version=1")
>>> UPath("http://www.example.com/a/b/index.html") / "../image.png"
HTTPPath("http://www.example.com/a/image.png")
>>> UPath("http://www.example.com/a/b/index.html") / "/image.png"
HTTPPath("http://www.example.com/image.png")
>>> UPath("http://www.example.com/a/b/index.html") / "ftp://other.com/image.png"
UPath("ftp://other.com/image.png")
>>> UPath("http://www.example.com/a/b/index.html") / "//other.com/image.png"
HTTPPath("http://other.com/image.png") Since I would expect UPath normalization and joining rules to differ from So as a user, I would expect the following posix paths to be equivalent: PosixPath("/somepath") == PosixPath("//somepath/") == PosixPath("//somepath//") but I would not expect the following URIs to be equivalent, because RFC3986 states that they might point to different resources: HTTPPath("https://en.wikipedia.org/wiki/Film") != HTTPPath("https://en.wikipedia.org/wiki/Film/") != HTTPPath("https://en.wikipedia.org/wiki//Film") != HTTPPath("https://en.wikipedia.org/wiki//Film/") (which they actually do). RFC3986 defines how many of the methods in universal pathlib should be implemented when dealing with URIs, such a joining URI paths, normalizing URIs, and URI equivalence. Also, I like this as a way of constructing URI paths with double slashes - very elegant! >>> UPath("protocol://somepath") / "" / "abc"
UPath("protocol://somepath//abc") |
Hello!
Should
UPath._make_child
replicate the behaviour of likepathlib.PurePath._make_child
as it does currently, or should it behave likeurllib.parse.urljoin
?Personally I would expect it to behave like
urljoin
.Thoughts?
The text was updated successfully, but these errors were encountered: