Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Python 3: Convert some unicode/bytes uses #3569

Merged
merged 28 commits into from
Aug 1, 2018
Merged

Conversation

hawkowl
Copy link
Contributor

@hawkowl hawkowl commented Jul 20, 2018

Also fixes #3306 by actually normalising passwords. This has the potential for breaking people's passwords, but no more than the issue of using a different browser or device breaking someone's password.

@hawkowl hawkowl requested a review from a team July 20, 2018 12:24
return bcrypt.hashpw(password.encode('utf8') + self.hs.config.password_pepper,
bcrypt.gensalt(self.bcrypt_rounds))
# Ensure that we normalise the password
if isinstance(password, bytes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we decide if it's meant to be a str or a bytes? or document it as such if it's really meant to be either?

(also, shouldn't this be written:

if isinstance(password, bytes):
    password = password.decode('utf8')
pw = unicodedata.normalize("NFKC", password)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a nicer way of writing it, but password isn't in a writable scope (because it's inside another function).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrt the type -- it's being passed in as str, but since it's not clear that it would only be decoded on py2, I've changed it to be a Py3 check.

@@ -876,8 +885,12 @@ def validate_hash(self, password, stored_hash):
"""

def _do_validate_hash():
if isinstance(password, bytes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

if PY3:
return urllib.parse.unquote(s)
else:
return urllib.parse.unquote(s.encode('utf8')).decode('utf8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we now encoding this when we weren't before? And why doesn't it need to happen on py3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't encoding this because we now parse the incoming arguments to Unicode in _get_handler_for_request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -296,8 +298,14 @@ def _async_render(self, request):
# here. If it throws an exception, that is handled by the wrapper
# installed by @request_handler.

def _parse(s):
if PY3:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we decode when we have PY3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the URL encoded sequence string "%E2%98%83", encoded as a unicode string:

Python2:

>>> urllib.unquote(u"%E2%98%83")
u'\xe2\x98\x83'

This is wrong, it returns Unicode but the ASCII escaped character codes.

Python 2:

urllib.unquote(u"%E2%98%83".encode('ascii')).decode('utf8')
u'\u2603'

Correct, returns the Unicode literal, but escaped for display as Py2 will usually not print real Unicode characters by itself.

Python 3:

>>> urllib.parse.unquote(u"%E2%98%83")
'☃'

Correct, returns the Unicode literal (not escaped, as Python 3 has the correct terminal encoding set up).

@@ -327,7 +335,7 @@ def _get_handler_for_request(self, request):
# Loop through all the registered callbacks to check if the method
# and path regex match
for path_entry in self.path_regexs.get(request.method, []):
m = path_entry.pattern.match(request.path)
m = path_entry.pattern.match(request.path.decode())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decode() without an explicit encoding makes me twitchy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -383,7 +391,7 @@ def __init__(self, path):
self.url = path

def render_GET(self, request):
return redirectTo(self.url, request)
return redirectTo(self.url.encode(), request)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise encode()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if PY3:
return frozendict_json_encoder.encode(json_object)
else:
return frozendict_json_encoder.encode(json_object).decode("utf-8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we only do this for py2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json/simplejson returns str (so unicode on py3, bytes on py2). We want it as unicode, so we have to decode it on Py2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also the decodes in the functions below have been moved to this one spot)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok comments please to explain this.

else:
done = {}
for k, v in txn:
if not isinstance(v, bytes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does it vary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't, so I removed the if.

done = {}
for k, v in txn:
if not isinstance(v, bytes):
done[k] = v.encode('ascii')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update the docstring to note that this returns a dict[str,bytes] please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@hawkowl hawkowl requested a review from a team July 25, 2018 10:59
@@ -626,6 +629,10 @@ def validate_login(self, username, login_submission):
# special case to check for "password" for the check_password interface
# for the auth providers
password = login_submission.get("password")

if password and PY2:
password = password.decode('utf8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary? surely login_submission is coming from the json body so should already have been un-utf8'ed?

(and if it is necessary, why don't we do it on PY3?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(should it not just be password = unicode(password) ?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json.loads() does not do any decoding, it returns a dict with str keys/value, so bytes on Python 2 and unicode on Py3. We then need to decode it to utf8 to get the Unicode which we then need to pass to unicodedata.normalize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unicode(password) also will decode it with ASCII, as it's forced decoding.

@@ -708,6 +715,7 @@ def _check_local_password(self, user_id, password):

Args:
user_id (str): complete @user:id
password (unicode): the provided password
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be worth clarifying that it will be a str on python3. Something like:

password (str|unicode): the provided password. On python2, *must* be a unicode.

(and similar on validate_hash etc etc)

Of course what's really happening is that elsewhere we have been sloppy about saying str when we mean str|unicode, though I don't suggest we change that...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's clearer to use bytes (flat bytes), str (str on either platform), or unicode (unicode), even though there is no such thing as unicode on Python 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, as discussed, unicode seems the right answer here. However, the difference with user_id and the return type is now very striking; suggest you update them too.

@@ -131,7 +131,7 @@ def register(
Args:
localpart : The local part of the user ID to register. If None,
one will be generated.
password (str) : The password to assign to this user so they can
password (unicode) : The password to assign to this user so they can

This comment was marked as resolved.

@@ -296,8 +298,14 @@ def _async_render(self, request):
# here. If it throws an exception, that is handled by the wrapper
# installed by @request_handler.

def _unquote(s):
if PY3:
return urllib.parse.unquote(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we not utf-8 decode here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already decoded by _get_handler_for_request.

if PY3:
return urllib.parse.unquote(s)
else:
return urllib.parse.unquote(s.encode('ascii')).decode('utf8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we encode('ascii') here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URLs are 7-bit ASCII.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes they are, but that's not the point.

the question is why we do the encoding here and not for python3. Likewise, I still don't understand why we decode the result on py2 and not on python3.

Is it that urllib.parse.unquote takes and returns raw bytes on python2 and unicode on python3?

whatever the answer is, could we have some explanation in comments?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, what a mess. thanks for clarifying this.

@@ -175,6 +175,7 @@ def on_POST(self, request):
from synapse.rest.client.v2_alpha.register import RegisterRestServlet
register = RegisterRestServlet(self.hs)

password = password.decode('utf-8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there's a reason for doing password.decode('utf-8') rather than body["password"], can you add a comment to say what it is?

@@ -165,6 +166,9 @@ def on_POST(self, request):
assert_params_in_dict(params, ["new_password"])
new_password = params['new_password']

if PY2:
new_password = new_password.decode('utf8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, why not unicode(new_password) ?

new_password = params.get("password", None)

if new_password and isinstance(new_password, bytes):
# We may not need to decode the password, if it came from the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this smells wrong. Surely we should be storing the same type in the session as is in the original params?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, haven't params already been decoded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original params are str (so bytes on Python 2) because json.loads returns strs, not Unicode, we don't want that.

if PY3:
return frozendict_json_encoder.encode(json_object)
else:
return frozendict_json_encoder.encode(json_object).decode("utf-8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok comments please to explain this.

@hawkowl hawkowl dismissed richvdh’s stale review July 26, 2018 11:32

fixed the utf8 shenanegans

@hawkowl hawkowl requested a review from a team July 26, 2018 12:02
@hawkowl hawkowl removed their assignment Jul 26, 2018
@@ -849,14 +852,19 @@ def hash(self, password):
"""Computes a secure hash of password.

Args:
password (str): Password to hash.
password (unicode): Password to hash.

Returns:
Deferred(str): Hashed password.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/str/bytes/ ?

@@ -868,16 +876,19 @@ def validate_hash(self, password, stored_hash):
"""Validates that self.hash(password) == stored_hash.

Args:
password (str): Password to hash.
password (unicode): Password to hash.
stored_hash (str): Expected hash value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/str/bytes/ ?

if PY3:
return urllib.parse.unquote(s)
else:
return urllib.parse.unquote(s.encode('ascii')).decode('utf8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes they are, but that's not the point.

the question is why we do the encoding here and not for python3. Likewise, I still don't understand why we decode the result on py2 and not on python3.

Is it that urllib.parse.unquote takes and returns raw bytes on python2 and unicode on python3?

whatever the answer is, could we have some explanation in comments?

@@ -74,15 +74,21 @@ def _get_event_reference_hashes_txn(self, txn, event_id):
txn (cursor):
event_id (str): Id for the Event.
Returns:
A dict of algorithm -> hash.
A dict[str, bytes] of algorithm -> hash.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/str/unicode/ ?

"""
query = (
"SELECT algorithm, hash"
" FROM event_reference_hashes"
" WHERE event_id = ?"
)
txn.execute(query, (event_id, ))
return {k: v for k, v in txn}
if six.PY2:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not use the PY3 path under py2 as well?

else:
done = {}
for k, v in txn:
done[k] = v.encode('ascii')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw I'd find a dict comprehension clearer here:

return {k: v.encode('ascii') for k, v in txn}

@@ -66,5 +66,7 @@ def _handle_frozendict(obj):

# A JSONEncoder which is capable of encoding frozendics without barfing
frozendict_json_encoder = json.JSONEncoder(
ensure_ascii=False,
encoding="utf8",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the default <shrug>

@@ -66,5 +66,7 @@ def _handle_frozendict(obj):

# A JSONEncoder which is capable of encoding frozendics without barfing
frozendict_json_encoder = json.JSONEncoder(
ensure_ascii=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be necessary (and it makes simplejson much slower). Could you add comments to explain why it is necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, let's just coerce the result to unicode

@hawkowl hawkowl requested a review from a team July 27, 2018 14:42
@hawkowl hawkowl removed their assignment Jul 27, 2018
@richvdh richvdh self-assigned this Jul 31, 2018
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good except frozendict shenanigans

@@ -66,5 +66,7 @@ def _handle_frozendict(obj):

# A JSONEncoder which is capable of encoding frozendics without barfing
frozendict_json_encoder = json.JSONEncoder(
ensure_ascii=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, let's just coerce the result to unicode

@richvdh richvdh assigned hawkowl and unassigned richvdh Jul 31, 2018
@hawkowl hawkowl merged commit da77851 into develop Aug 1, 2018
@hawkowl hawkowl deleted the hawkowl/bytes-clean-2 branch August 1, 2018 14:54
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants