Python 3: Convert some unicode/bytes uses #3569

hawkowl · 2018-07-20T06:56:12Z

Also fixes #3306 by actually normalising passwords. This has the potential for breaking people's passwords, but no more than the issue of using a different browser or device breaking someone's password.

richvdh · 2018-07-23T15:11:59Z

synapse/handlers/auth.py

-            return bcrypt.hashpw(password.encode('utf8') + self.hs.config.password_pepper,
-                                 bcrypt.gensalt(self.bcrypt_rounds))
+            # Ensure that we normalise the password
+            if isinstance(password, bytes):


shouldn't we decide if it's meant to be a str or a bytes? or document it as such if it's really meant to be either?

(also, shouldn't this be written:

if isinstance(password, bytes): password = password.decode('utf8') pw = unicodedata.normalize("NFKC", password)

That would be a nicer way of writing it, but password isn't in a writable scope (because it's inside another function).

wrt the type -- it's being passed in as str, but since it's not clear that it would only be decoded on py2, I've changed it to be a Py3 check.

richvdh · 2018-07-23T15:12:08Z

synapse/handlers/auth.py

@@ -876,8 +885,12 @@ def validate_hash(self, password, stored_hash):
        """

        def _do_validate_hash():
+            if isinstance(password, bytes):


richvdh · 2018-07-23T15:15:57Z

synapse/http/server.py

+            if PY3:
+                return urllib.parse.unquote(s)
+            else:
+                return urllib.parse.unquote(s.encode('utf8')).decode('utf8')


why are we now encoding this when we weren't before? And why doesn't it need to happen on py3?

We aren't encoding this because we now parse the incoming arguments to Unicode in _get_handler_for_request.

(as to why we encode it on py2, see https://github.com/matrix-org/synapse/pull/3569/files/4831ead29a575a299d8e377f26bc8fc35dc49cc2#r205017485 )

richvdh · 2018-07-23T15:16:15Z

synapse/http/server.py

@@ -296,8 +298,14 @@ def _async_render(self, request):
        # here. If it throws an exception, that is handled by the wrapper
        # installed by @request_handler.

+        def _parse(s):
+            if PY3:


why don't we decode when we have PY3?

Given the URL encoded sequence string "%E2%98%83", encoded as a unicode string:

Python2:

>>> urllib.unquote(u"%E2%98%83") u'\xe2\x98\x83'

This is wrong, it returns Unicode but the ASCII escaped character codes.

Python 2:

urllib.unquote(u"%E2%98%83".encode('ascii')).decode('utf8') u'\u2603'

Correct, returns the Unicode literal, but escaped for display as Py2 will usually not print real Unicode characters by itself.

Python 3:

>>> urllib.parse.unquote(u"%E2%98%83") '☃'

Correct, returns the Unicode literal (not escaped, as Python 3 has the correct terminal encoding set up).

richvdh · 2018-07-23T15:18:31Z

synapse/http/server.py

@@ -327,7 +335,7 @@ def _get_handler_for_request(self, request):
        # Loop through all the registered callbacks to check if the method
        # and path regex match
        for path_entry in self.path_regexs.get(request.method, []):
-            m = path_entry.pattern.match(request.path)
+            m = path_entry.pattern.match(request.path.decode())


decode() without an explicit encoding makes me twitchy.

richvdh · 2018-07-23T15:18:43Z

synapse/http/server.py

@@ -383,7 +391,7 @@ def __init__(self, path):
        self.url = path

    def render_GET(self, request):
-        return redirectTo(self.url, request)
+        return redirectTo(self.url.encode(), request)


likewise encode()

richvdh · 2018-07-23T15:20:10Z

synapse/storage/events.py

+    if PY3:
+        return frozendict_json_encoder.encode(json_object)
+    else:
+        return frozendict_json_encoder.encode(json_object).decode("utf-8")


why do we only do this for py2?

json/simplejson returns str (so unicode on py3, bytes on py2). We want it as unicode, so we have to decode it on Py2.

(also the decodes in the functions below have been moved to this one spot)

ok comments please to explain this.

richvdh · 2018-07-23T15:21:11Z

synapse/storage/signatures.py

+        else:
+            done = {}
+            for k, v in txn:
+                if not isinstance(v, bytes):


why does it vary?

It shouldn't, so I removed the if.

richvdh · 2018-07-23T15:21:56Z

synapse/storage/signatures.py

+            done = {}
+            for k, v in txn:
+                if not isinstance(v, bytes):
+                    done[k] = v.encode('ascii')


can we update the docstring to note that this returns a dict[str,bytes] please?

…wkowl/bytes-clean-2

fixed

richvdh · 2018-07-25T12:33:57Z

synapse/handlers/auth.py

@@ -626,6 +629,10 @@ def validate_login(self, username, login_submission):
        # special case to check for "password" for the check_password interface
        # for the auth providers
        password = login_submission.get("password")
+
+        if password and PY2:
+            password = password.decode('utf8')


why is this necessary? surely login_submission is coming from the json body so should already have been un-utf8'ed?

(and if it is necessary, why don't we do it on PY3?)

(should it not just be password = unicode(password) ?)

json.loads() does not do any decoding, it returns a dict with str keys/value, so bytes on Python 2 and unicode on Py3. We then need to decode it to utf8 to get the Unicode which we then need to pass to unicodedata.normalize.

unicode(password) also will decode it with ASCII, as it's forced decoding.

richvdh · 2018-07-25T12:48:49Z

synapse/handlers/auth.py

@@ -708,6 +715,7 @@ def _check_local_password(self, user_id, password):

        Args:
            user_id (str): complete @user:id
+            password (unicode): the provided password


I think it might be worth clarifying that it will be a str on python3. Something like:

password (str|unicode): the provided password. On python2, *must* be a unicode.

(and similar on validate_hash etc etc)

Of course what's really happening is that elsewhere we have been sloppy about saying str when we mean str|unicode, though I don't suggest we change that...

It's clearer to use bytes (flat bytes), str (str on either platform), or unicode (unicode), even though there is no such thing as unicode on Python 3.

ok, as discussed, unicode seems the right answer here. However, the difference with user_id and the return type is now very striking; suggest you update them too.

synapse/handlers/register.py

@@ -131,7 +131,7 @@ def register(
        Args:
            localpart : The local part of the user ID to register. If None,
              one will be generated.
-            password (str) : The password to assign to this user so they can
+            password (unicode) : The password to assign to this user so they can


richvdh · 2018-07-25T12:52:14Z

synapse/http/server.py

@@ -296,8 +298,14 @@ def _async_render(self, request):
        # here. If it throws an exception, that is handled by the wrapper
        # installed by @request_handler.

+        def _unquote(s):
+            if PY3:
+                return urllib.parse.unquote(s)


why do we not utf-8 decode here?

It's already decoded by _get_handler_for_request.

richvdh · 2018-07-25T12:52:46Z

synapse/http/server.py

+            if PY3:
+                return urllib.parse.unquote(s)
+            else:
+                return urllib.parse.unquote(s.encode('ascii')).decode('utf8')


why do we encode('ascii') here?

URLs are 7-bit ASCII.

yes they are, but that's not the point.

the question is why we do the encoding here and not for python3. Likewise, I still don't understand why we decode the result on py2 and not on python3.

Is it that urllib.parse.unquote takes and returns raw bytes on python2 and unicode on python3?

whatever the answer is, could we have some explanation in comments?

ugh, what a mess. thanks for clarifying this.

richvdh · 2018-07-25T12:55:59Z

synapse/rest/client/v1/admin.py

@@ -175,6 +175,7 @@ def on_POST(self, request):
        from synapse.rest.client.v2_alpha.register import RegisterRestServlet
        register = RegisterRestServlet(self.hs)

+        password = password.decode('utf-8')


if there's a reason for doing password.decode('utf-8') rather than body["password"], can you add a comment to say what it is?

richvdh · 2018-07-25T12:56:42Z

synapse/rest/client/v2_alpha/account.py

@@ -165,6 +166,9 @@ def on_POST(self, request):
        assert_params_in_dict(params, ["new_password"])
        new_password = params['new_password']

+        if PY2:
+            new_password = new_password.decode('utf8')


again, why not unicode(new_password) ?

richvdh · 2018-07-25T12:59:24Z

synapse/rest/client/v2_alpha/register.py

+            new_password = params.get("password", None)
+
+            if new_password and isinstance(new_password, bytes):
+                # We may not need to decode the password, if it came from the


this smells wrong. Surely we should be storing the same type in the session as is in the original params?

also, haven't params already been decoded?

The original params are str (so bytes on Python 2) because json.loads returns strs, not Unicode, we don't want that.

richvdh · 2018-07-25T13:00:09Z

synapse/storage/events.py

+    if PY3:
+        return frozendict_json_encoder.encode(json_object)
+    else:
+        return frozendict_json_encoder.encode(json_object).decode("utf-8")


ok comments please to explain this.

…wkowl/bytes-clean-2

fixed the utf8 shenanegans

richvdh · 2018-07-26T16:47:21Z

synapse/handlers/auth.py

@@ -849,14 +852,19 @@ def hash(self, password):
        """Computes a secure hash of password.

        Args:
-            password (str): Password to hash.
+            password (unicode): Password to hash.

        Returns:
            Deferred(str): Hashed password.


s/str/bytes/ ?

richvdh · 2018-07-26T16:48:12Z

synapse/handlers/auth.py

@@ -868,16 +876,19 @@ def validate_hash(self, password, stored_hash):
        """Validates that self.hash(password) == stored_hash.

        Args:
-            password (str): Password to hash.
+            password (unicode): Password to hash.
            stored_hash (str): Expected hash value.


s/str/bytes/ ?

richvdh · 2018-07-26T16:52:54Z

synapse/http/server.py

+            if PY3:
+                return urllib.parse.unquote(s)
+            else:
+                return urllib.parse.unquote(s.encode('ascii')).decode('utf8')


yes they are, but that's not the point.

the question is why we do the encoding here and not for python3. Likewise, I still don't understand why we decode the result on py2 and not on python3.

Is it that urllib.parse.unquote takes and returns raw bytes on python2 and unicode on python3?

whatever the answer is, could we have some explanation in comments?

richvdh · 2018-07-26T17:00:53Z

synapse/storage/signatures.py

@@ -74,15 +74,21 @@ def _get_event_reference_hashes_txn(self, txn, event_id):
            txn (cursor):
            event_id (str): Id for the Event.
        Returns:
-            A dict of algorithm -> hash.
+            A dict[str, bytes] of algorithm -> hash.


s/str/unicode/ ?

richvdh · 2018-07-26T17:02:08Z

synapse/storage/signatures.py

        """
        query = (
            "SELECT algorithm, hash"
            " FROM event_reference_hashes"
            " WHERE event_id = ?"
        )
        txn.execute(query, (event_id, ))
-        return {k: v for k, v in txn}
+        if six.PY2:


can we not use the PY3 path under py2 as well?

richvdh · 2018-07-26T17:03:04Z

synapse/storage/signatures.py

+        else:
+            done = {}
+            for k, v in txn:
+                done[k] = v.encode('ascii')


fwiw I'd find a dict comprehension clearer here:

return {k: v.encode('ascii') for k, v in txn}

richvdh · 2018-07-26T17:08:10Z

synapse/util/frozenutils.py

@@ -66,5 +66,7 @@ def _handle_frozendict(obj):

 # A JSONEncoder which is capable of encoding frozendics without barfing
 frozendict_json_encoder = json.JSONEncoder(
+    ensure_ascii=False,
+    encoding="utf8",


this is the default <shrug>

richvdh · 2018-07-26T17:08:13Z

synapse/util/frozenutils.py

@@ -66,5 +66,7 @@ def _handle_frozendict(obj):

 # A JSONEncoder which is capable of encoding frozendics without barfing
 frozendict_json_encoder = json.JSONEncoder(
+    ensure_ascii=False,


I don't think this should be necessary (and it makes simplejson much slower). Could you add comments to explain why it is necessary?

as discussed, let's just coerce the result to unicode

fixed

richvdh

looks good except frozendict shenanigans

richvdh · 2018-07-31T11:31:23Z

synapse/util/frozenutils.py

@@ -66,5 +66,7 @@ def _handle_frozendict(obj):

 # A JSONEncoder which is capable of encoding frozendics without barfing
 frozendict_json_encoder = json.JSONEncoder(
+    ensure_ascii=False,


as discussed, let's just coerce the result to unicode

hawkowl added 11 commits July 20, 2018 16:51

fix up bytestrings throughout

cb6b689

changelog

8801cb8

fix import

e729bdf

scoping is heck

a14df28

py2 compat

ede1ace

py3 import

f8172ef

update to fix urllib

df8f3e3

encode the hash, too

e0bf614

fixes

f0a00f0

fix

35a41ab

isort

4831ead

hawkowl requested a review from a team July 20, 2018 12:24

richvdh previously requested changes Jul 23, 2018

View reviewed changes

richvdh assigned hawkowl Jul 23, 2018

hawkowl added 5 commits July 25, 2018 17:55

Merge branch 'develop' of ssh://github.com/matrix-org/synapse into ha…

f04033e

…wkowl/bytes-clean-2

review comments

e1bdb58

make auth completely unicode for passwords

521a920

encodings

58df4a0

cleanups

f152316

hawkowl requested a review from a team July 25, 2018 10:59

richvdh previously requested changes Jul 25, 2018

View reviewed changes

hawkowl added 6 commits July 26, 2018 20:51

Merge branch 'develop' of ssh://github.com/matrix-org/synapse into ha…

9fc33fd

…wkowl/bytes-clean-2

do unicode properly

6040be4

return Unicode directly from the JSON encoder

6745999

stylistic cleanups

ed08bcb

fix sytests

da3502a

pep8

616864e

hawkowl requested a review from a team July 26, 2018 12:02

hawkowl removed their assignment Jul 26, 2018

richvdh previously requested changes Jul 26, 2018

View reviewed changes

richvdh assigned hawkowl Jul 26, 2018

hawkowl added 3 commits July 27, 2018 23:21

type cleanups

e876bcd

fixes

d5b735e

Merge remote-tracking branch 'origin/develop' into hawkowl/bytes-clean-2

3cc58ea

hawkowl requested a review from a team July 27, 2018 14:42

hawkowl removed their assignment Jul 27, 2018

richvdh self-assigned this Jul 31, 2018

richvdh suggested changes Jul 31, 2018

View reviewed changes

richvdh assigned hawkowl and unassigned richvdh Jul 31, 2018

hawkowl added 3 commits August 2, 2018 00:41

decode so we always put unicode into the db

b3a8de6

Merge remote-tracking branch 'origin/develop' into hawkowl/bytes-clean-2

bfe288c

docstring

df8c45a

hawkowl merged commit da77851 into develop Aug 1, 2018

hawkowl deleted the hawkowl/bytes-clean-2 branch August 1, 2018 14:54

Python 3: Convert some unicode/bytes uses #3569

Python 3: Convert some unicode/bytes uses #3569

Conversation

hawkowl commented Jul 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richvdh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment