Fix unencoded url components #90

bigluck · 2023-11-09T13:37:05Z

In May the Spark implementation and the UI were fixed in order to support arbitrary strings like "my/branch" as branch names:

But not pynessie v0.64.2 fails when invoking .list_keys(ref='big/test'):

Entity not found at https://..../api/v1/trees/tree/big/test/entries?maxRecords=500

In this PR I'm using urllib.parse.quote_plus for encoding all user's inputs that dynamically compose the final URL.

CLAassistant · 2023-11-09T13:37:12Z

All committers have signed the CLA.

bigluck · 2023-11-09T13:50:20Z

I didn't change the last function of the file get_diff because I'm not sure what should be encoded.

XN137

thanks for the contribution... lgtm overall with minor suggestions

if you have extra time would be nice if you could adjust/amend a test in test_nessie_client.py or maybe even/also test_nessie_cli.py to use an exotic (i.e. with /) branch/tag name.

XN137 · 2023-11-10T09:56:14Z

pynessie/client/_endpoints.py

@@ -17,6 +17,7 @@

 import os
 from typing import Any, Optional, Union, cast
+from urllib.parse import quote_plus


just wondering: why did we use quote_plus instead of quote without safe chars?

Like quote(), but also replace spaces by plus signs, as required for quoting HTML form values when building up a query string to go into a URL

so wondering whether it matters for spaces but i guess according to this spaces are not supported so we only use quote_plus for its empty safe chars parameter...

maybe we could have a little helper method _url_quote_ref or the like.

We can definitely use quote('', safe='') indeed!

I'm going in my next push to normalize all the URLs using a new helper function named _sanitize_url

Thanks!

XN137 · 2023-11-10T09:58:57Z

pynessie/client/_endpoints.py

@@ -278,7 +279,9 @@ def get_content(
    params = {"ref": ref}
    if hash_on_ref:
        params["hashOnRef"] = hash_on_ref
-    return cast(dict, _get(base_url + "/contents/{}".format(content_key.to_path_string()), auth, ssl_verify=ssl_verify, params=params))
+    return cast(
+        dict, _get(base_url + "/contents/{}".format(quote_plus(content_key.to_path_string())), auth, ssl_verify=ssl_verify, params=params)


doc of to_path_string claims to provide an url encoding already but implementation doesnt look like it really:

pynessie/pynessie/model.py

Lines 164 to 168 in 332e1a6

def to_path_string(self) -> str:

"""Convert this key to a url encoded path string."""

# false positives in pylint

# pylint: disable=E1133

return ".".join(i.replace(".", "\00") if i else "" for i in self.elements)

not sure if we allow / in content keys, but if yes we should push the quoting inside the method most likely (and add a test in test_nessie_models.py for it)

/ in content keys is allows if the key has @ (whether followed by a hash specification or not)... However, it is probably easier an safer to encode all special chars using the %XY notation.

The code I'm working on will produce this output:

from urllib.parse import quote def _sanitize_url(url: str, *args: list[str]) -> str: return url.format(*[quote(arg, safe='') for arg in args]) def to_path_string(elements: list[str]) -> str: return ".".join([i.replace(".", "\00") if i else "" for i in elements]) print(">> result:", _sanitize_url('/my/path/{}/{}', "my/test", to_path_string(["a/b", "c", "d.e"]))) # >> result: /my/path/my%2Ftest/a%2Fb.c.d%00e

Is it ok?

seems OK but part of my original comment was that docs of to_path_string claim this is already handled internally (now we are handling it externally... we can live with it i guess)

@XN137 I see, but I'm a bit torn on it.

If we change the original to_path_string function, we should also rewrite this:

url = _sanitize_url(base_url + "/contents/{}", content_key.to_path_string())

into:

url = _sanitize_url(base_url + "/contents/{content_key.to_path_string()}")

which is fine, but it can become a problem in the future if somebody is not fully aware of why we're using the _sanitize_url() function, copies the block of code for implementing a new endpoint, and then we end up facing the same issue we had before.

At the same time to_path_string is converting only the . inside the elements into \00/%00, but it's not the standard %2E (I think %00 is parsed and mapped back by the nessie server once it receives the string), but at the same time, the function is concatenating all the strings using regular dots.

Maybe to_path_string() should be renamed into something different?

bigluck · 2023-11-14T06:48:53Z

Apologies, crazy days at work :) sure I'll work on it late this morning (CET time).

bigluck · 2023-11-14T10:36:43Z

@XN137 & @dimas-b have a look now

bigluck · 2023-11-14T10:43:20Z

.pkg: _exit> python /Users/bigluck/d/pynessie/venv/lib/python3.9/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__
  py38: OK (9.16=setup[2.29]+cmd[6.87] seconds)
  py39: OK (8.62=setup[1.84]+cmd[6.77] seconds)
  py310: OK (8.59=setup[1.88]+cmd[6.71] seconds)
  py311: OK (8.92=setup[2.00]+cmd[6.92] seconds)
  format: OK (5.55=setup[5.09]+cmd[0.14,0.33] seconds)
  lint: OK (8.98=setup[1.92]+cmd[1.05,5.40,0.61] seconds)
  safety: OK (2.76=setup[1.70]+cmd[1.05] seconds)
  docs: OK (3.15=setup[2.92]+cmd[0.23] seconds)
  docs_sync_check: OK (1.84=setup[1.83]+cmd[0.01] seconds)
  congratulations :) (57.63 seconds)
(venv) 
[nix-shell:~/d/pynessie]$

XN137

thanks for the update and added tests

lgtm overall with minor suggestions

tests/test_nessie_client.py

XN137 · 2023-11-14T11:03:49Z

tests/test_nessie_client.py

+    assert _sanitize_url(base_url + "trees/tree/{}", "my/tag with mixed.types") == base_url + "trees/tree/my%2Ftag%20with%20mixed.types"
+    assert (
+        _sanitize_url(base_url + "trees/tree/{}/{}", "tag/name", "other/string@with-at")
+        == base_url + "trees/tree/tag%2Fname/other%2Fstring%40with-at"


wondering if we can modify/extend an existing CLI test for this as well (since it runs against an actual server)

yes, it's missing. I'm on it :)

thanks for adding the tests, pytest parameterize usage it good 👍

i probably wasnt clear but i was specifically asking for tests around the special @ part/endpoint to confirm its handled correctly by an actual server

but we have made enough iterations already, its probably working fine atm

oh, @ is not supported by the server, I tested it but all my tests failed (the server accepts only ./_-):

Reference name must start with a letter, followed by letters, digits, one of the ./_- characters, not end with a slash or dot, not contain '..' - but was: branch/[email protected]

For this reason, I didn't add a specific test for it, but in this specific piece of code I'm just testing that the user is safe, not that it's a valid URL for Nessie

XN137 · 2023-11-14T11:07:08Z

tests/test_nessie_client.py

+    client = init()
+    base_url = client.get_base_url()
+    assert _sanitize_url(base_url) == base_url
+    assert _sanitize_url(base_url + "trees/tree") == base_url + "trees/tree"


just wondering: in actual client we always use leading / is it not "required" here i.e. get_base_url() already includes it?

Seems that get_base_url() get the base_url value using config["endpoint"].get(). I can add the "/" at the beginning of my URL (to be consistent with the rest of the codebase), but it should not impact the type of test I'm doing here ;)

XN137 · 2023-11-14T11:08:40Z

pynessie/client/_endpoints.py

@@ -278,7 +279,9 @@ def get_content(
    params = {"ref": ref}
    if hash_on_ref:
        params["hashOnRef"] = hash_on_ref
-    return cast(dict, _get(base_url + "/contents/{}".format(content_key.to_path_string()), auth, ssl_verify=ssl_verify, params=params))
+    return cast(
+        dict, _get(base_url + "/contents/{}".format(quote_plus(content_key.to_path_string())), auth, ssl_verify=ssl_verify, params=params)


seems OK but part of my original comment was that docs of to_path_string claim this is already handled internally (now we are handling it externally... we can live with it i guess)

bigluck · 2023-11-14T12:00:41Z

@XN137 done; I started using @pytest.mark.parametrize in some of the key tests to validate how the cli works using different types of branch names.

  py38: OK (11.02=setup[2.03]+cmd[8.99] seconds)
  py39: OK (13.99=setup[4.88]+cmd[9.11] seconds)
  py310: OK (10.84=setup[1.91]+cmd[8.93] seconds)
  py311: OK (11.17=setup[1.75]+cmd[9.43] seconds)
  format: OK (2.40=setup[1.94]+cmd[0.15,0.32] seconds)
  lint: OK (9.13=setup[1.88]+cmd[1.06,5.58,0.60] seconds)
  safety: OK (5.88=setup[1.68]+cmd[4.21] seconds)
  docs: OK (2.08=setup[1.85]+cmd[0.23] seconds)
  docs_sync_check: OK (1.80=setup[1.79]+cmd[0.01] seconds)
  congratulations :) (68.37 seconds)

XN137

thanks again for the update, lgtm

snazy · 2023-11-15T09:28:42Z

@bigluck thanks for the PR!

bigluck · 2023-11-15T09:43:41Z

@bigluck thanks for the PR!

it's my pleasure to help out :)

Fix unencoded url components

6385eb7

dimas-b requested a review from XN137 November 9, 2023 17:02

XN137 previously approved these changes Nov 10, 2023

View reviewed changes

bigluck added 2 commits November 14, 2023 11:29

All the urls are sanitized using _sanitize_url

4b93904

Added unit tests

02ed801

bigluck dismissed XN137’s stale review via 02ed801 November 14, 2023 10:36

XN137 previously approved these changes Nov 14, 2023

View reviewed changes

Extended and improved unit tests against branch names containing "/"

d715e14

bigluck dismissed XN137’s stale review via d715e14 November 14, 2023 11:44

XN137 approved these changes Nov 14, 2023

View reviewed changes

snazy merged commit 059a787 into projectnessie:main Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unencoded url components #90

Fix unencoded url components #90

bigluck commented Nov 9, 2023

CLAassistant commented Nov 9, 2023 •

edited

Loading

bigluck commented Nov 9, 2023

XN137 left a comment

XN137 Nov 10, 2023

bigluck Nov 14, 2023

XN137 Nov 10, 2023

dimas-b Nov 14, 2023

bigluck Nov 14, 2023 •

edited

Loading

XN137 Nov 14, 2023

bigluck Nov 14, 2023

bigluck commented Nov 14, 2023

bigluck commented Nov 14, 2023

bigluck commented Nov 14, 2023

XN137 left a comment

XN137 Nov 14, 2023

bigluck Nov 14, 2023

XN137 Nov 14, 2023

bigluck Nov 14, 2023

XN137 Nov 14, 2023

bigluck Nov 14, 2023

XN137 Nov 14, 2023

bigluck commented Nov 14, 2023

XN137 left a comment

snazy commented Nov 15, 2023

bigluck commented Nov 15, 2023

	def to_path_string(self) -> str:
	"""Convert this key to a url encoded path string."""
	# false positives in pylint
	# pylint: disable=E1133
	return ".".join(i.replace(".", "\00") if i else "" for i in self.elements)

Fix unencoded url components #90

Fix unencoded url components #90

Conversation

bigluck commented Nov 9, 2023

CLAassistant commented Nov 9, 2023 • edited Loading

bigluck commented Nov 9, 2023

XN137 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bigluck Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bigluck commented Nov 14, 2023

bigluck commented Nov 14, 2023

bigluck commented Nov 14, 2023

XN137 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bigluck commented Nov 14, 2023

XN137 left a comment

Choose a reason for hiding this comment

snazy commented Nov 15, 2023

bigluck commented Nov 15, 2023

CLAassistant commented Nov 9, 2023 •

edited

Loading

bigluck Nov 14, 2023 •

edited

Loading