Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Python 3: Convert some unicode/bytes uses #3569

Merged
merged 28 commits into from
Aug 1, 2018
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
cb6b689
fix up bytestrings throughout
hawkowl Jul 20, 2018
8801cb8
changelog
hawkowl Jul 20, 2018
e729bdf
fix import
hawkowl Jul 20, 2018
a14df28
scoping is heck
hawkowl Jul 20, 2018
ede1ace
py2 compat
hawkowl Jul 20, 2018
f8172ef
py3 import
hawkowl Jul 20, 2018
df8f3e3
update to fix urllib
hawkowl Jul 20, 2018
e0bf614
encode the hash, too
hawkowl Jul 20, 2018
f0a00f0
fixes
hawkowl Jul 20, 2018
35a41ab
fix
hawkowl Jul 20, 2018
4831ead
isort
hawkowl Jul 20, 2018
f04033e
Merge branch 'develop' of ssh://github.com/matrix-org/synapse into ha…
hawkowl Jul 25, 2018
e1bdb58
review comments
hawkowl Jul 25, 2018
521a920
make auth completely unicode for passwords
hawkowl Jul 25, 2018
58df4a0
encodings
hawkowl Jul 25, 2018
f152316
cleanups
hawkowl Jul 25, 2018
9fc33fd
Merge branch 'develop' of ssh://github.com/matrix-org/synapse into ha…
hawkowl Jul 26, 2018
6040be4
do unicode properly
hawkowl Jul 26, 2018
6745999
return Unicode directly from the JSON encoder
hawkowl Jul 26, 2018
ed08bcb
stylistic cleanups
hawkowl Jul 26, 2018
da3502a
fix sytests
hawkowl Jul 26, 2018
616864e
pep8
hawkowl Jul 26, 2018
e876bcd
type cleanups
hawkowl Jul 27, 2018
d5b735e
fixes
hawkowl Jul 27, 2018
3cc58ea
Merge remote-tracking branch 'origin/develop' into hawkowl/bytes-clean-2
hawkowl Jul 27, 2018
b3a8de6
decode so we always put unicode into the db
hawkowl Aug 1, 2018
bfe288c
Merge remote-tracking branch 'origin/develop' into hawkowl/bytes-clean-2
hawkowl Aug 1, 2018
df8c45a
docstring
hawkowl Aug 1, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/3569.bugfix
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Unicode passwords are now normalised before hashing, preventing the instance where two different devices or browsers might send a different UTF-8 sequence for the password.
4 changes: 2 additions & 2 deletions synapse/api/auth.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,10 +252,10 @@ def _get_appservice_user_id(self, request):
if ip_address not in app_service.ip_range_whitelist:
defer.returnValue((None, None))

if "user_id" not in request.args:
if b"user_id" not in request.args:
defer.returnValue((app_service.sender, app_service))

user_id = request.args["user_id"][0]
user_id = request.args[b"user_id"][0].decode('utf8')
if app_service.sender == user_id:
defer.returnValue((app_service.sender, app_service))

Expand Down
2 changes: 1 addition & 1 deletion synapse/federation/transport/server.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ def _parse_auth_header(header_bytes):
param_dict = dict(kv.split("=") for kv in params)

def strip_quotes(value):
if value.startswith(b"\""):
if value.startswith("\""):
return value[1:-1]
else:
return value
Expand Down
21 changes: 16 additions & 5 deletions synapse/handlers/auth.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
# limitations under the License.

import logging
import unicodedata

import attr
import bcrypt
Expand Down Expand Up @@ -626,6 +627,7 @@ def validate_login(self, username, login_submission):
# special case to check for "password" for the check_password interface
# for the auth providers
password = login_submission.get("password")

if login_type == LoginType.PASSWORD:
if not self._password_enabled:
raise SynapseError(400, "Password login has been disabled.")
Expand Down Expand Up @@ -708,6 +710,7 @@ def _check_local_password(self, user_id, password):

Args:
user_id (str): complete @user:id
password (unicode): the provided password
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be worth clarifying that it will be a str on python3. Something like:

password (str|unicode): the provided password. On python2, *must* be a unicode.

(and similar on validate_hash etc etc)

Of course what's really happening is that elsewhere we have been sloppy about saying str when we mean str|unicode, though I don't suggest we change that...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's clearer to use bytes (flat bytes), str (str on either platform), or unicode (unicode), even though there is no such thing as unicode on Python 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, as discussed, unicode seems the right answer here. However, the difference with user_id and the return type is now very striking; suggest you update them too.

Returns:
(str) the canonical_user_id, or None if unknown user / bad password
"""
Expand Down Expand Up @@ -849,14 +852,19 @@ def hash(self, password):
"""Computes a secure hash of password.

Args:
password (str): Password to hash.
password (unicode): Password to hash.

Returns:
Deferred(str): Hashed password.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/str/bytes/ ?

"""
def _do_hash():
return bcrypt.hashpw(password.encode('utf8') + self.hs.config.password_pepper,
bcrypt.gensalt(self.bcrypt_rounds))
# Normalise the Unicode in the password
pw = unicodedata.normalize("NFKC", password)

return bcrypt.hashpw(
pw.encode('utf8') + self.hs.config.password_pepper.encode("utf8"),
bcrypt.gensalt(self.bcrypt_rounds),
)

return make_deferred_yieldable(
threads.deferToThreadPool(
Expand All @@ -868,16 +876,19 @@ def validate_hash(self, password, stored_hash):
"""Validates that self.hash(password) == stored_hash.

Args:
password (str): Password to hash.
password (unicode): Password to hash.
stored_hash (str): Expected hash value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/str/bytes/ ?


Returns:
Deferred(bool): Whether self.hash(password) == stored_hash.
"""

def _do_validate_hash():
# Normalise the Unicode in the password
pw = unicodedata.normalize("NFKC", password)

return bcrypt.checkpw(
password.encode('utf8') + self.hs.config.password_pepper,
pw.encode('utf8') + self.hs.config.password_pepper.encode("utf8"),
stored_hash.encode('utf8')
)

Expand Down
2 changes: 1 addition & 1 deletion synapse/handlers/register.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ def register(
Args:
localpart : The local part of the user ID to register. If None,
one will be generated.
password (str) : The password to assign to this user so they can
password (unicode) : The password to assign to this user so they can

This comment was marked as resolved.

login again. This can be None which means they cannot login again
via a password (e.g. the user is an application service user).
generate_token (bool): Whether a new access token should be
Expand Down
24 changes: 17 additions & 7 deletions synapse/http/server.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,13 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import cgi
import collections
import logging
import urllib

from six.moves import http_client
from six import PY3
from six.moves import http_client, urllib

from canonicaljson import encode_canonical_json, encode_pretty_printed_json, json

Expand Down Expand Up @@ -264,6 +265,7 @@ def __init__(self, hs, canonical_json=True):
self.hs = hs

def register_paths(self, method, path_patterns, callback):
method = method.encode("utf-8") # method is bytes on py3
for path_pattern in path_patterns:
logger.debug("Registering for %s %s", method, path_pattern.pattern)
self.path_regexs.setdefault(method, []).append(
Expand Down Expand Up @@ -296,8 +298,14 @@ def _async_render(self, request):
# here. If it throws an exception, that is handled by the wrapper
# installed by @request_handler.

def _unquote(s):
if PY3:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we decode when we have PY3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the URL encoded sequence string "%E2%98%83", encoded as a unicode string:

Python2:

>>> urllib.unquote(u"%E2%98%83")
u'\xe2\x98\x83'

This is wrong, it returns Unicode but the ASCII escaped character codes.

Python 2:

urllib.unquote(u"%E2%98%83".encode('ascii')).decode('utf8')
u'\u2603'

Correct, returns the Unicode literal, but escaped for display as Py2 will usually not print real Unicode characters by itself.

Python 3:

>>> urllib.parse.unquote(u"%E2%98%83")
'☃'

Correct, returns the Unicode literal (not escaped, as Python 3 has the correct terminal encoding set up).

return urllib.parse.unquote(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we not utf-8 decode here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already decoded by _get_handler_for_request.

else:
return urllib.parse.unquote(s.encode('ascii')).decode('utf8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we encode('ascii') here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URLs are 7-bit ASCII.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes they are, but that's not the point.

the question is why we do the encoding here and not for python3. Likewise, I still don't understand why we decode the result on py2 and not on python3.

Is it that urllib.parse.unquote takes and returns raw bytes on python2 and unicode on python3?

whatever the answer is, could we have some explanation in comments?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, what a mess. thanks for clarifying this.


kwargs = intern_dict({
name: urllib.unquote(value).decode("UTF-8") if value else value
name: _unquote(value) if value else value
for name, value in group_dict.items()
})

Expand Down Expand Up @@ -327,7 +335,7 @@ def _get_handler_for_request(self, request):
# Loop through all the registered callbacks to check if the method
# and path regex match
for path_entry in self.path_regexs.get(request.method, []):
m = path_entry.pattern.match(request.path)
m = path_entry.pattern.match(request.path.decode('ascii'))
if m:
# We found a match!
return path_entry.callback, m.groupdict()
Expand Down Expand Up @@ -383,7 +391,7 @@ def __init__(self, path):
self.url = path

def render_GET(self, request):
return redirectTo(self.url, request)
return redirectTo(self.url.encode('ascii'), request)

def getChild(self, name, request):
if len(name) == 0:
Expand All @@ -404,12 +412,14 @@ def respond_with_json(request, code, json_object, send_cors=False,
return

if pretty_print:
json_bytes = encode_pretty_printed_json(json_object) + "\n"
json_bytes = (encode_pretty_printed_json(json_object) + "\n"
).encode("utf-8")
else:
if canonical_json or synapse.events.USE_FROZEN_DICTS:
# canonicaljson already encodes to bytes
json_bytes = encode_canonical_json(json_object)
else:
json_bytes = json.dumps(json_object)
json_bytes = json.dumps(json_object).encode("utf-8")

return respond_with_json_bytes(
request, code, json_bytes,
Expand Down
10 changes: 9 additions & 1 deletion synapse/http/servlet.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,8 +171,16 @@ def parse_json_value_from_request(request, allow_empty_body=False):
if not content_bytes and allow_empty_body:
return None

# Decode to Unicode so that simplejson will return Unicode strings on
# Python 2
try:
content = json.loads(content_bytes)
content_unicode = content_bytes.decode('utf8')
except UnicodeDecodeError:
logger.warn("Unable to decode UTF-8")
raise SynapseError(400, "Content not JSON.", errcode=Codes.NOT_JSON)

try:
content = json.loads(content_unicode)
except Exception as e:
logger.warn("Unable to parse JSON: %s", e)
raise SynapseError(400, "Content not JSON.", errcode=Codes.NOT_JSON)
Expand Down
22 changes: 15 additions & 7 deletions synapse/rest/client/v1/admin.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
import hmac
import logging

from six import text_type
from six.moves import http_client

from twisted.internet import defer
Expand Down Expand Up @@ -131,7 +132,10 @@ def on_POST(self, request):
400, "username must be specified", errcode=Codes.BAD_JSON,
)
else:
if (not isinstance(body['username'], str) or len(body['username']) > 512):
if (
not isinstance(body['username'], text_type)
or len(body['username']) > 512
):
raise SynapseError(400, "Invalid username")

username = body["username"].encode("utf-8")
Expand All @@ -143,7 +147,10 @@ def on_POST(self, request):
400, "password must be specified", errcode=Codes.BAD_JSON,
)
else:
if (not isinstance(body['password'], str) or len(body['password']) > 512):
if (
not isinstance(body['password'], text_type)
or len(body['password']) > 512
):
raise SynapseError(400, "Invalid password")

password = body["password"].encode("utf-8")
Expand All @@ -166,17 +173,18 @@ def on_POST(self, request):
want_mac.update(b"admin" if admin else b"notadmin")
want_mac = want_mac.hexdigest()

if not hmac.compare_digest(want_mac, got_mac):
raise SynapseError(
403, "HMAC incorrect",
)
if not hmac.compare_digest(want_mac, got_mac.encode('ascii')):
raise SynapseError(403, "HMAC incorrect")

# Reuse the parts of RegisterRestServlet to reduce code duplication
from synapse.rest.client.v2_alpha.register import RegisterRestServlet

register = RegisterRestServlet(self.hs)

(user_id, _) = yield register.registration_handler.register(
localpart=username.lower(), password=password, admin=bool(admin),
localpart=body['username'].lower(),
password=body["password"],
admin=bool(admin),
generate_token=False,
)

Expand Down
12 changes: 6 additions & 6 deletions synapse/rest/client/v2_alpha/register.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,15 +193,15 @@ def __init__(self, hs):
def on_POST(self, request):
body = parse_json_object_from_request(request)

kind = "user"
if "kind" in request.args:
kind = request.args["kind"][0]
kind = b"user"
if b"kind" in request.args:
kind = request.args[b"kind"][0]

if kind == "guest":
if kind == b"guest":
ret = yield self._do_guest_registration(body)
defer.returnValue(ret)
return
elif kind != "user":
elif kind != b"user":
raise UnrecognizedRequestError(
"Do not understand membership kind: %s" % (kind,)
)
Expand Down Expand Up @@ -389,8 +389,8 @@ def on_POST(self, request):
assert_params_in_dict(params, ["password"])

desired_username = params.get("username", None)
new_password = params.get("password", None)
guest_access_token = params.get("guest_access_token", None)
new_password = params.get("password", None)

if desired_username is not None:
desired_username = desired_username.lower()
Expand Down
2 changes: 1 addition & 1 deletion synapse/rest/media/v1/media_storage.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ def ensure_media_is_in_local_cache(self, file_info):
if res:
with res:
consumer = BackgroundFileConsumer(
open(local_path, "w"), self.hs.get_reactor())
open(local_path, "wb"), self.hs.get_reactor())
yield res.write_to_consumer(consumer)
yield consumer.wait()
defer.returnValue(local_path)
Expand Down
2 changes: 1 addition & 1 deletion synapse/state.py
Original file line number Diff line number Diff line change
Expand Up @@ -577,7 +577,7 @@ def _make_state_cache_entry(

def _ordered_events(events):
def key_func(e):
return -int(e.depth), hashlib.sha1(e.event_id.encode()).hexdigest()
return -int(e.depth), hashlib.sha1(e.event_id.encode('ascii')).hexdigest()

return sorted(events, key=key_func)

Expand Down
16 changes: 6 additions & 10 deletions synapse/storage/events.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,10 +64,6 @@
"synapse_storage_events_state_delta_reuse_delta", "")


def encode_json(json_object):
return frozendict_json_encoder.encode(json_object)


class _EventPeristenceQueue(object):
"""Queues up events so that they can be persisted in bulk with only one
concurrent transaction per room.
Expand Down Expand Up @@ -1053,9 +1049,9 @@ def _update_outliers_txn(self, txn, events_and_contexts):
logger.exception("")
raise

metadata_json = encode_json(
metadata_json = frozendict_json_encoder.encode(
event.internal_metadata.get_dict()
).decode("UTF-8")
)

sql = (
"UPDATE event_json SET internal_metadata = ?"
Expand Down Expand Up @@ -1167,10 +1163,10 @@ def event_dict(event):
{
"event_id": event.event_id,
"room_id": event.room_id,
"internal_metadata": encode_json(
"internal_metadata": frozendict_json_encoder.encode(
event.internal_metadata.get_dict()
).decode("UTF-8"),
"json": encode_json(event_dict(event)).decode("UTF-8"),
),
"json": frozendict_json_encoder.encode(event_dict(event)),
}
for event, _ in events_and_contexts
],
Expand All @@ -1189,7 +1185,7 @@ def event_dict(event):
"type": event.type,
"processed": True,
"outlier": event.internal_metadata.is_outlier(),
"content": encode_json(event.content).decode("UTF-8"),
"content": frozendict_json_encoder.encode(event.content),
"origin_server_ts": int(event.origin_server_ts),
"received_ts": self._clock.time_msec(),
"sender": event.sender,
Expand Down
10 changes: 8 additions & 2 deletions synapse/storage/signatures.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,15 +74,21 @@ def _get_event_reference_hashes_txn(self, txn, event_id):
txn (cursor):
event_id (str): Id for the Event.
Returns:
A dict of algorithm -> hash.
A dict[str, bytes] of algorithm -> hash.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/str/unicode/ ?

"""
query = (
"SELECT algorithm, hash"
" FROM event_reference_hashes"
" WHERE event_id = ?"
)
txn.execute(query, (event_id, ))
return {k: v for k, v in txn}
if six.PY2:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not use the PY3 path under py2 as well?

return {k: v for k, v in txn}
else:
done = {}
for k, v in txn:
done[k] = v.encode('ascii')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw I'd find a dict comprehension clearer here:

return {k: v.encode('ascii') for k, v in txn}

return done


class SignatureStore(SignatureWorkerStore):
Expand Down
2 changes: 1 addition & 1 deletion synapse/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ def __deepcopy__(self, memo):
@classmethod
def from_string(cls, s):
"""Parse the string given by 's' into a structure object."""
if len(s) < 1 or s[0] != cls.SIGIL:
if len(s) < 1 or s[0:1] != cls.SIGIL:
raise SynapseError(400, "Expected %s string to start with '%s'" % (
cls.__name__, cls.SIGIL,
))
Expand Down
Loading