Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unencoded url components #90

Merged
merged 4 commits into from
Nov 15, 2023
Merged

Fix unencoded url components #90

merged 4 commits into from
Nov 15, 2023

Conversation

bigluck
Copy link
Contributor

@bigluck bigluck commented Nov 9, 2023

In May the Spark implementation and the UI were fixed in order to support arbitrary strings like "my/branch" as branch names:

But not pynessie v0.64.2 fails when invoking .list_keys(ref='big/test'):

Entity not found at https://..../api/v1/trees/tree/big/test/entries?maxRecords=500

In this PR I'm using urllib.parse.quote_plus for encoding all user's inputs that dynamically compose the final URL.

@CLAassistant
Copy link

CLAassistant commented Nov 9, 2023

CLA assistant check
All committers have signed the CLA.

@bigluck
Copy link
Contributor Author

bigluck commented Nov 9, 2023

I didn't change the last function of the file get_diff because I'm not sure what should be encoded.

@dimas-b dimas-b requested a review from XN137 November 9, 2023 17:02
XN137
XN137 previously approved these changes Nov 10, 2023
Copy link
Contributor

@XN137 XN137 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the contribution... lgtm overall with minor suggestions

if you have extra time would be nice if you could adjust/amend a test in test_nessie_client.py or maybe even/also test_nessie_cli.py to use an exotic (i.e. with /) branch/tag name.

@@ -17,6 +17,7 @@

import os
from typing import Any, Optional, Union, cast
from urllib.parse import quote_plus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering: why did we use quote_plus instead of quote without safe chars?

Like quote(), but also replace spaces by plus signs, as required for quoting HTML form values when building up a query string to go into a URL

so wondering whether it matters for spaces but i guess according to this spaces are not supported so we only use quote_plus for its empty safe chars parameter...

maybe we could have a little helper method _url_quote_ref or the like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can definitely use quote('', safe='') indeed!

I'm going in my next push to normalize all the URLs using a new helper function named _sanitize_url

Thanks!

@@ -278,7 +279,9 @@ def get_content(
params = {"ref": ref}
if hash_on_ref:
params["hashOnRef"] = hash_on_ref
return cast(dict, _get(base_url + "/contents/{}".format(content_key.to_path_string()), auth, ssl_verify=ssl_verify, params=params))
return cast(
dict, _get(base_url + "/contents/{}".format(quote_plus(content_key.to_path_string())), auth, ssl_verify=ssl_verify, params=params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc of to_path_string claims to provide an url encoding already but implementation doesnt look like it really:

pynessie/pynessie/model.py

Lines 164 to 168 in 332e1a6

def to_path_string(self) -> str:
"""Convert this key to a url encoded path string."""
# false positives in pylint
# pylint: disable=E1133
return ".".join(i.replace(".", "\00") if i else "" for i in self.elements)

not sure if we allow / in content keys, but if yes we should push the quoting inside the method most likely (and add a test in test_nessie_models.py for it)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ in content keys is allows if the key has @ (whether followed by a hash specification or not)... However, it is probably easier an safer to encode all special chars using the %XY notation.

Copy link
Contributor Author

@bigluck bigluck Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code I'm working on will produce this output:

from urllib.parse import quote

def _sanitize_url(url: str, *args: list[str]) -> str:
    return url.format(*[quote(arg, safe='') for arg in args])

def to_path_string(elements: list[str]) -> str:
    return ".".join([i.replace(".", "\00") if i else "" for i in elements])

print(">> result:", _sanitize_url('/my/path/{}/{}', "my/test", to_path_string(["a/b", "c", "d.e"])))
# >> result: /my/path/my%2Ftest/a%2Fb.c.d%00e

Is it ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems OK but part of my original comment was that docs of to_path_string claim this is already handled internally (now we are handling it externally... we can live with it i guess)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XN137 I see, but I'm a bit torn on it.

If we change the original to_path_string function, we should also rewrite this:

url = _sanitize_url(base_url + "/contents/{}", content_key.to_path_string())

into:

url = _sanitize_url(base_url + "/contents/{content_key.to_path_string()}")

which is fine, but it can become a problem in the future if somebody is not fully aware of why we're using the _sanitize_url() function, copies the block of code for implementing a new endpoint, and then we end up facing the same issue we had before.

At the same time to_path_string is converting only the . inside the elements into \00/%00, but it's not the standard %2E (I think %00 is parsed and mapped back by the nessie server once it receives the string), but at the same time, the function is concatenating all the strings using regular dots.

Maybe to_path_string() should be renamed into something different?

@bigluck
Copy link
Contributor Author

bigluck commented Nov 14, 2023

Apologies, crazy days at work :) sure I'll work on it late this morning (CET time).

@bigluck
Copy link
Contributor Author

bigluck commented Nov 14, 2023

@XN137 & @dimas-b have a look now

@bigluck
Copy link
Contributor Author

bigluck commented Nov 14, 2023

.pkg: _exit> python /Users/bigluck/d/pynessie/venv/lib/python3.9/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
  py38: OK (9.16=setup[2.29]+cmd[6.87] seconds)
  py39: OK (8.62=setup[1.84]+cmd[6.77] seconds)
  py310: OK (8.59=setup[1.88]+cmd[6.71] seconds)
  py311: OK (8.92=setup[2.00]+cmd[6.92] seconds)
  format: OK (5.55=setup[5.09]+cmd[0.14,0.33] seconds)
  lint: OK (8.98=setup[1.92]+cmd[1.05,5.40,0.61] seconds)
  safety: OK (2.76=setup[1.70]+cmd[1.05] seconds)
  docs: OK (3.15=setup[2.92]+cmd[0.23] seconds)
  docs_sync_check: OK (1.84=setup[1.83]+cmd[0.01] seconds)
  congratulations :) (57.63 seconds)
(venv) 
[nix-shell:~/d/pynessie]$ 

XN137
XN137 previously approved these changes Nov 14, 2023
Copy link
Contributor

@XN137 XN137 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the update and added tests

lgtm overall with minor suggestions

tests/test_nessie_client.py Outdated Show resolved Hide resolved
assert _sanitize_url(base_url + "trees/tree/{}", "my/tag with mixed.types") == base_url + "trees/tree/my%2Ftag%20with%20mixed.types"
assert (
_sanitize_url(base_url + "trees/tree/{}/{}", "tag/name", "other/string@with-at")
== base_url + "trees/tree/tag%2Fname/other%2Fstring%40with-at"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we can modify/extend an existing CLI test for this as well (since it runs against an actual server)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's missing. I'm on it :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding the tests, pytest parameterize usage it good 👍

i probably wasnt clear but i was specifically asking for tests around the special @ part/endpoint to confirm its handled correctly by an actual server

but we have made enough iterations already, its probably working fine atm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, @ is not supported by the server, I tested it but all my tests failed (the server accepts only ./_-):

Reference name must start with a letter, followed by letters, digits, one of the ./_- characters, not end with a slash or dot, not contain '..' - but was: branch/[email protected]

For this reason, I didn't add a specific test for it, but in this specific piece of code I'm just testing that the user is safe, not that it's a valid URL for Nessie

client = init()
base_url = client.get_base_url()
assert _sanitize_url(base_url) == base_url
assert _sanitize_url(base_url + "trees/tree") == base_url + "trees/tree"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering: in actual client we always use leading / is it not "required" here i.e. get_base_url() already includes it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that get_base_url() get the base_url value using config["endpoint"].get(). I can add the "/" at the beginning of my URL (to be consistent with the rest of the codebase), but it should not impact the type of test I'm doing here ;)

@@ -278,7 +279,9 @@ def get_content(
params = {"ref": ref}
if hash_on_ref:
params["hashOnRef"] = hash_on_ref
return cast(dict, _get(base_url + "/contents/{}".format(content_key.to_path_string()), auth, ssl_verify=ssl_verify, params=params))
return cast(
dict, _get(base_url + "/contents/{}".format(quote_plus(content_key.to_path_string())), auth, ssl_verify=ssl_verify, params=params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems OK but part of my original comment was that docs of to_path_string claim this is already handled internally (now we are handling it externally... we can live with it i guess)

@bigluck
Copy link
Contributor Author

bigluck commented Nov 14, 2023

@XN137 done; I started using @pytest.mark.parametrize in some of the key tests to validate how the cli works using different types of branch names.

  py38: OK (11.02=setup[2.03]+cmd[8.99] seconds)
  py39: OK (13.99=setup[4.88]+cmd[9.11] seconds)
  py310: OK (10.84=setup[1.91]+cmd[8.93] seconds)
  py311: OK (11.17=setup[1.75]+cmd[9.43] seconds)
  format: OK (2.40=setup[1.94]+cmd[0.15,0.32] seconds)
  lint: OK (9.13=setup[1.88]+cmd[1.06,5.58,0.60] seconds)
  safety: OK (5.88=setup[1.68]+cmd[4.21] seconds)
  docs: OK (2.08=setup[1.85]+cmd[0.23] seconds)
  docs_sync_check: OK (1.80=setup[1.79]+cmd[0.01] seconds)
  congratulations :) (68.37 seconds)

Copy link
Contributor

@XN137 XN137 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks again for the update, lgtm

@snazy snazy merged commit 059a787 into projectnessie:main Nov 15, 2023
@snazy
Copy link
Member

snazy commented Nov 15, 2023

@bigluck thanks for the PR!

@bigluck
Copy link
Contributor Author

bigluck commented Nov 15, 2023

@bigluck thanks for the PR!

it's my pleasure to help out :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants