-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Python 3: Convert some unicode/bytes uses #3569
Conversation
synapse/handlers/auth.py
Outdated
return bcrypt.hashpw(password.encode('utf8') + self.hs.config.password_pepper, | ||
bcrypt.gensalt(self.bcrypt_rounds)) | ||
# Ensure that we normalise the password | ||
if isinstance(password, bytes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we decide if it's meant to be a str or a bytes? or document it as such if it's really meant to be either?
(also, shouldn't this be written:
if isinstance(password, bytes):
password = password.decode('utf8')
pw = unicodedata.normalize("NFKC", password)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be a nicer way of writing it, but password isn't in a writable scope (because it's inside another function).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrt the type -- it's being passed in as str, but since it's not clear that it would only be decoded on py2, I've changed it to be a Py3 check.
synapse/handlers/auth.py
Outdated
@@ -876,8 +885,12 @@ def validate_hash(self, password, stored_hash): | |||
""" | |||
|
|||
def _do_validate_hash(): | |||
if isinstance(password, bytes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above
synapse/http/server.py
Outdated
if PY3: | ||
return urllib.parse.unquote(s) | ||
else: | ||
return urllib.parse.unquote(s.encode('utf8')).decode('utf8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we now encoding this when we weren't before? And why doesn't it need to happen on py3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't encoding this because we now parse the incoming arguments to Unicode in _get_handler_for_request
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(as to why we encode it on py2, see https://github.com/matrix-org/synapse/pull/3569/files/4831ead29a575a299d8e377f26bc8fc35dc49cc2#r205017485 )
synapse/http/server.py
Outdated
@@ -296,8 +298,14 @@ def _async_render(self, request): | |||
# here. If it throws an exception, that is handled by the wrapper | |||
# installed by @request_handler. | |||
|
|||
def _parse(s): | |||
if PY3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we decode when we have PY3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the URL encoded sequence string "%E2%98%83", encoded as a unicode string:
Python2:
>>> urllib.unquote(u"%E2%98%83")
u'\xe2\x98\x83'
This is wrong, it returns Unicode but the ASCII escaped character codes.
Python 2:
urllib.unquote(u"%E2%98%83".encode('ascii')).decode('utf8')
u'\u2603'
Correct, returns the Unicode literal, but escaped for display as Py2 will usually not print real Unicode characters by itself.
Python 3:
>>> urllib.parse.unquote(u"%E2%98%83")
'☃'
Correct, returns the Unicode literal (not escaped, as Python 3 has the correct terminal encoding set up).
synapse/http/server.py
Outdated
@@ -327,7 +335,7 @@ def _get_handler_for_request(self, request): | |||
# Loop through all the registered callbacks to check if the method | |||
# and path regex match | |||
for path_entry in self.path_regexs.get(request.method, []): | |||
m = path_entry.pattern.match(request.path) | |||
m = path_entry.pattern.match(request.path.decode()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decode()
without an explicit encoding makes me twitchy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
synapse/http/server.py
Outdated
@@ -383,7 +391,7 @@ def __init__(self, path): | |||
self.url = path | |||
|
|||
def render_GET(self, request): | |||
return redirectTo(self.url, request) | |||
return redirectTo(self.url.encode(), request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise encode()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
synapse/storage/events.py
Outdated
if PY3: | ||
return frozendict_json_encoder.encode(json_object) | ||
else: | ||
return frozendict_json_encoder.encode(json_object).decode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we only do this for py2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json/simplejson returns str
(so unicode on py3, bytes on py2). We want it as unicode, so we have to decode it on Py2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also the decodes in the functions below have been moved to this one spot)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok comments please to explain this.
synapse/storage/signatures.py
Outdated
else: | ||
done = {} | ||
for k, v in txn: | ||
if not isinstance(v, bytes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does it vary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't, so I removed the if.
synapse/storage/signatures.py
Outdated
done = {} | ||
for k, v in txn: | ||
if not isinstance(v, bytes): | ||
done[k] = v.encode('ascii') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we update the docstring to note that this returns a dict[str,bytes]
please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
synapse/handlers/auth.py
Outdated
@@ -626,6 +629,10 @@ def validate_login(self, username, login_submission): | |||
# special case to check for "password" for the check_password interface | |||
# for the auth providers | |||
password = login_submission.get("password") | |||
|
|||
if password and PY2: | |||
password = password.decode('utf8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this necessary? surely login_submission is coming from the json body so should already have been un-utf8'ed?
(and if it is necessary, why don't we do it on PY3?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(should it not just be password = unicode(password)
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json.loads() does not do any decoding, it returns a dict with str keys/value, so bytes on Python 2 and unicode on Py3. We then need to decode it to utf8 to get the Unicode which we then need to pass to unicodedata.normalize
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unicode(password)
also will decode it with ASCII, as it's forced decoding.
synapse/handlers/auth.py
Outdated
@@ -708,6 +715,7 @@ def _check_local_password(self, user_id, password): | |||
|
|||
Args: | |||
user_id (str): complete @user:id | |||
password (unicode): the provided password |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be worth clarifying that it will be a str
on python3. Something like:
password (str|unicode): the provided password. On python2, *must* be a unicode.
(and similar on validate_hash
etc etc)
Of course what's really happening is that elsewhere we have been sloppy about saying str
when we mean str|unicode
, though I don't suggest we change that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's clearer to use bytes
(flat bytes), str
(str on either platform), or unicode
(unicode), even though there is no such thing as unicode
on Python 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, as discussed, unicode
seems the right answer here. However, the difference with user_id
and the return type is now very striking; suggest you update them too.
@@ -131,7 +131,7 @@ def register( | |||
Args: | |||
localpart : The local part of the user ID to register. If None, | |||
one will be generated. | |||
password (str) : The password to assign to this user so they can | |||
password (unicode) : The password to assign to this user so they can |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
synapse/http/server.py
Outdated
@@ -296,8 +298,14 @@ def _async_render(self, request): | |||
# here. If it throws an exception, that is handled by the wrapper | |||
# installed by @request_handler. | |||
|
|||
def _unquote(s): | |||
if PY3: | |||
return urllib.parse.unquote(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we not utf-8 decode here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's already decoded by _get_handler_for_request
.
synapse/http/server.py
Outdated
if PY3: | ||
return urllib.parse.unquote(s) | ||
else: | ||
return urllib.parse.unquote(s.encode('ascii')).decode('utf8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we encode('ascii')
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
URLs are 7-bit ASCII.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes they are, but that's not the point.
the question is why we do the encoding here and not for python3. Likewise, I still don't understand why we decode the result on py2 and not on python3.
Is it that urllib.parse.unquote
takes and returns raw bytes on python2 and unicode on python3?
whatever the answer is, could we have some explanation in comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh, what a mess. thanks for clarifying this.
synapse/rest/client/v1/admin.py
Outdated
@@ -175,6 +175,7 @@ def on_POST(self, request): | |||
from synapse.rest.client.v2_alpha.register import RegisterRestServlet | |||
register = RegisterRestServlet(self.hs) | |||
|
|||
password = password.decode('utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there's a reason for doing password.decode('utf-8')
rather than body["password"]
, can you add a comment to say what it is?
@@ -165,6 +166,9 @@ def on_POST(self, request): | |||
assert_params_in_dict(params, ["new_password"]) | |||
new_password = params['new_password'] | |||
|
|||
if PY2: | |||
new_password = new_password.decode('utf8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, why not unicode(new_password)
?
new_password = params.get("password", None) | ||
|
||
if new_password and isinstance(new_password, bytes): | ||
# We may not need to decode the password, if it came from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this smells wrong. Surely we should be storing the same type in the session as is in the original params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, haven't params
already been decoded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original params are str
(so bytes on Python 2) because json.loads returns strs, not Unicode, we don't want that.
synapse/storage/events.py
Outdated
if PY3: | ||
return frozendict_json_encoder.encode(json_object) | ||
else: | ||
return frozendict_json_encoder.encode(json_object).decode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok comments please to explain this.
synapse/handlers/auth.py
Outdated
@@ -849,14 +852,19 @@ def hash(self, password): | |||
"""Computes a secure hash of password. | |||
|
|||
Args: | |||
password (str): Password to hash. | |||
password (unicode): Password to hash. | |||
|
|||
Returns: | |||
Deferred(str): Hashed password. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/str/bytes/ ?
synapse/handlers/auth.py
Outdated
@@ -868,16 +876,19 @@ def validate_hash(self, password, stored_hash): | |||
"""Validates that self.hash(password) == stored_hash. | |||
|
|||
Args: | |||
password (str): Password to hash. | |||
password (unicode): Password to hash. | |||
stored_hash (str): Expected hash value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/str/bytes/ ?
synapse/http/server.py
Outdated
if PY3: | ||
return urllib.parse.unquote(s) | ||
else: | ||
return urllib.parse.unquote(s.encode('ascii')).decode('utf8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes they are, but that's not the point.
the question is why we do the encoding here and not for python3. Likewise, I still don't understand why we decode the result on py2 and not on python3.
Is it that urllib.parse.unquote
takes and returns raw bytes on python2 and unicode on python3?
whatever the answer is, could we have some explanation in comments?
synapse/storage/signatures.py
Outdated
@@ -74,15 +74,21 @@ def _get_event_reference_hashes_txn(self, txn, event_id): | |||
txn (cursor): | |||
event_id (str): Id for the Event. | |||
Returns: | |||
A dict of algorithm -> hash. | |||
A dict[str, bytes] of algorithm -> hash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/str/unicode/ ?
synapse/storage/signatures.py
Outdated
""" | ||
query = ( | ||
"SELECT algorithm, hash" | ||
" FROM event_reference_hashes" | ||
" WHERE event_id = ?" | ||
) | ||
txn.execute(query, (event_id, )) | ||
return {k: v for k, v in txn} | ||
if six.PY2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we not use the PY3 path under py2 as well?
synapse/storage/signatures.py
Outdated
else: | ||
done = {} | ||
for k, v in txn: | ||
done[k] = v.encode('ascii') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw I'd find a dict comprehension clearer here:
return {k: v.encode('ascii') for k, v in txn}
synapse/util/frozenutils.py
Outdated
@@ -66,5 +66,7 @@ def _handle_frozendict(obj): | |||
|
|||
# A JSONEncoder which is capable of encoding frozendics without barfing | |||
frozendict_json_encoder = json.JSONEncoder( | |||
ensure_ascii=False, | |||
encoding="utf8", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the default <shrug>
synapse/util/frozenutils.py
Outdated
@@ -66,5 +66,7 @@ def _handle_frozendict(obj): | |||
|
|||
# A JSONEncoder which is capable of encoding frozendics without barfing | |||
frozendict_json_encoder = json.JSONEncoder( | |||
ensure_ascii=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be necessary (and it makes simplejson much slower). Could you add comments to explain why it is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed, let's just coerce the result to unicode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good except frozendict shenanigans
synapse/util/frozenutils.py
Outdated
@@ -66,5 +66,7 @@ def _handle_frozendict(obj): | |||
|
|||
# A JSONEncoder which is capable of encoding frozendics without barfing | |||
frozendict_json_encoder = json.JSONEncoder( | |||
ensure_ascii=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed, let's just coerce the result to unicode
Also fixes #3306 by actually normalising passwords. This has the potential for breaking people's passwords, but no more than the issue of using a different browser or device breaking someone's password.