-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix unencoded url components #90
Conversation
I didn't change the last function of the file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the contribution... lgtm overall with minor suggestions
if you have extra time would be nice if you could adjust/amend a test in test_nessie_client.py
or maybe even/also test_nessie_cli.py
to use an exotic (i.e. with /
) branch/tag name.
pynessie/client/_endpoints.py
Outdated
@@ -17,6 +17,7 @@ | |||
|
|||
import os | |||
from typing import Any, Optional, Union, cast | |||
from urllib.parse import quote_plus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wondering: why did we use quote_plus
instead of quote
without safe chars?
Like quote(), but also replace spaces by plus signs, as required for quoting HTML form values when building up a query string to go into a URL
so wondering whether it matters for spaces but i guess according to this spaces are not supported so we only use quote_plus
for its empty safe
chars parameter...
maybe we could have a little helper method _url_quote_ref
or the like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can definitely use quote('', safe='')
indeed!
I'm going in my next push to normalize all the URLs using a new helper function named _sanitize_url
Thanks!
pynessie/client/_endpoints.py
Outdated
@@ -278,7 +279,9 @@ def get_content( | |||
params = {"ref": ref} | |||
if hash_on_ref: | |||
params["hashOnRef"] = hash_on_ref | |||
return cast(dict, _get(base_url + "/contents/{}".format(content_key.to_path_string()), auth, ssl_verify=ssl_verify, params=params)) | |||
return cast( | |||
dict, _get(base_url + "/contents/{}".format(quote_plus(content_key.to_path_string())), auth, ssl_verify=ssl_verify, params=params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc of to_path_string
claims to provide an url encoding already but implementation doesnt look like it really:
Lines 164 to 168 in 332e1a6
def to_path_string(self) -> str: | |
"""Convert this key to a url encoded path string.""" | |
# false positives in pylint | |
# pylint: disable=E1133 | |
return ".".join(i.replace(".", "\00") if i else "" for i in self.elements) |
not sure if we allow /
in content keys, but if yes we should push the quoting inside the method most likely (and add a test in test_nessie_models.py
for it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/
in content keys is allows if the key has @
(whether followed by a hash specification or not)... However, it is probably easier an safer to encode all special chars using the %XY
notation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code I'm working on will produce this output:
from urllib.parse import quote
def _sanitize_url(url: str, *args: list[str]) -> str:
return url.format(*[quote(arg, safe='') for arg in args])
def to_path_string(elements: list[str]) -> str:
return ".".join([i.replace(".", "\00") if i else "" for i in elements])
print(">> result:", _sanitize_url('/my/path/{}/{}', "my/test", to_path_string(["a/b", "c", "d.e"])))
# >> result: /my/path/my%2Ftest/a%2Fb.c.d%00e
Is it ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems OK but part of my original comment was that docs of to_path_string
claim this is already handled internally (now we are handling it externally... we can live with it i guess)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@XN137 I see, but I'm a bit torn on it.
If we change the original to_path_string
function, we should also rewrite this:
url = _sanitize_url(base_url + "/contents/{}", content_key.to_path_string())
into:
url = _sanitize_url(base_url + "/contents/{content_key.to_path_string()}")
which is fine, but it can become a problem in the future if somebody is not fully aware of why we're using the _sanitize_url()
function, copies the block of code for implementing a new endpoint, and then we end up facing the same issue we had before.
At the same time to_path_string
is converting only the .
inside the elements into \00
/%00
, but it's not the standard %2E
(I think %00
is parsed and mapped back by the nessie server once it receives the string), but at the same time, the function is concatenating all the strings using regular dots.
Maybe to_path_string()
should be renamed into something different?
Apologies, crazy days at work :) sure I'll work on it late this morning (CET time). |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the update and added tests
lgtm overall with minor suggestions
tests/test_nessie_client.py
Outdated
assert _sanitize_url(base_url + "trees/tree/{}", "my/tag with mixed.types") == base_url + "trees/tree/my%2Ftag%20with%20mixed.types" | ||
assert ( | ||
_sanitize_url(base_url + "trees/tree/{}/{}", "tag/name", "other/string@with-at") | ||
== base_url + "trees/tree/tag%2Fname/other%2Fstring%40with-at" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if we can modify/extend an existing CLI test for this as well (since it runs against an actual server)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's missing. I'm on it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding the tests, pytest parameterize usage it good 👍
i probably wasnt clear but i was specifically asking for tests around the special @
part/endpoint to confirm its handled correctly by an actual server
but we have made enough iterations already, its probably working fine atm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, @
is not supported by the server, I tested it but all my tests failed (the server accepts only ./_-
):
Reference name must start with a letter, followed by letters, digits, one of the ./_- characters, not end with a slash or dot, not contain '..' - but was: branch/[email protected]
For this reason, I didn't add a specific test for it, but in this specific piece of code I'm just testing that the user is safe, not that it's a valid URL for Nessie
tests/test_nessie_client.py
Outdated
client = init() | ||
base_url = client.get_base_url() | ||
assert _sanitize_url(base_url) == base_url | ||
assert _sanitize_url(base_url + "trees/tree") == base_url + "trees/tree" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wondering: in actual client we always use leading /
is it not "required" here i.e. get_base_url()
already includes it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that get_base_url()
get the base_url value using config["endpoint"].get()
. I can add the "/" at the beginning of my URL (to be consistent with the rest of the codebase), but it should not impact the type of test I'm doing here ;)
pynessie/client/_endpoints.py
Outdated
@@ -278,7 +279,9 @@ def get_content( | |||
params = {"ref": ref} | |||
if hash_on_ref: | |||
params["hashOnRef"] = hash_on_ref | |||
return cast(dict, _get(base_url + "/contents/{}".format(content_key.to_path_string()), auth, ssl_verify=ssl_verify, params=params)) | |||
return cast( | |||
dict, _get(base_url + "/contents/{}".format(quote_plus(content_key.to_path_string())), auth, ssl_verify=ssl_verify, params=params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems OK but part of my original comment was that docs of to_path_string
claim this is already handled internally (now we are handling it externally... we can live with it i guess)
@XN137 done; I started using
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks again for the update, lgtm
@bigluck thanks for the PR! |
it's my pleasure to help out :) |
In May the Spark implementation and the UI were fixed in order to support arbitrary strings like "my/branch" as branch names:
But not
pynessie v0.64.2
fails when invoking.list_keys(ref='big/test')
:In this PR I'm using
urllib.parse.quote_plus
for encoding all user's inputs that dynamically compose the final URL.