Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trio.ssl handling for Unicode (IDNA) domains is deeply broken #11

Open
njsmith opened this issue Jan 22, 2017 · 19 comments
Open

trio.ssl handling for Unicode (IDNA) domains is deeply broken #11

njsmith opened this issue Jan 22, 2017 · 19 comments

Comments

@njsmith
Copy link
Member

njsmith commented Jan 22, 2017

trio.ssl relies on the stdlib ssl module for hostname checking. Unfortunately, when it comes to non-ASCII domain names, it's totally broken. Which means that trio.ssl is also totally broken. In other ways, trio handles IDNA well (better than the stdlib). But this is hard to work around.

We don't have a lot of great options here. We could:

  • Substantially rewrite trio.ssl to use PyOpenSSL instead. But this requires adding a bunch of tricky code, and also would break our API. (We currently take a stdlib ssl.SSLContext object as input, and it's impossible to read the SNI callback and other critical attributes off of an ssl.SSLContext object, which means that it's impossible to correctly convert a stdlib SSLContext into a PyOpenSSL equivalent. See bpo-32359.)

  • Hope that upstream ssl gets better, or fix it ourselves. Not impossible, but maybe not a thing to hold our breath on either.

  • Hope that PEP 543 comes along and saves us.

  • ....?

I'm not sure what to do, but at least now we have a bug to track the problem...

Original text:


IDNA support in trio.socket

Help I have no idea what I'm doing here.

Probably also touches on TLS support, #9.

@njsmith
Copy link
Member Author

njsmith commented May 26, 2017

I have learned a little bit:

For hostname lookup, the versions of python we care about will automatically idna-encode unicode strings passed to socket.getaddrinfo, so that's nice. However, they use an old version of idna that doesn't always work.

The idna package on pypi supports a newer version of idna. I don't know if this always works, but requests seems to use it, so that's a vote of confidence. For hostname lookups it looks like we can just import idna; getaddrinfo(idna.encode(...), ...) and it should work.

For SSL hostname verification...... we certainly can't assume that stdlib ssl will do the right thing. (I guess at least it fails closed?) And it's not terribly flexible. ...maybe @Lukasa's new PEP will save us? Like everything involving SSL, this appears to be not at all trivial: https://github.com/kennethreitz/requests/blob/362da46e9a46da6e86e1907f03014384ab210151/requests/packages/urllib3/contrib/pyopenssl.py#L141

@Lukasa
Copy link

Lukasa commented May 27, 2017

For hostname lookups it looks like we can just import idna; getaddrinfo(idna.encode(...), ...) and it should work.

You'll need to make sure that the domain name actually is an IDNA, otherwise that'll blow up. The idna module is pretty strict about its input. ;)

@njsmith
Copy link
Member Author

njsmith commented May 27, 2017

Are there things that idna blows up on where we don't want getaddrinfo to blow up? From a quick check it happily accepts "google.com" and "127.0.0.1"... I guess we do need to check that it's a string, because bytestrings and None are also valid hostnames to getaddrinfo... and maybe enable this uts46=True thing...

@njsmith
Copy link
Member Author

njsmith commented May 27, 2017

Oh, but idna.encode("::1") gives InvalidCodepoint: Codepoint U+0027 at position 2 of "b'::1'" not allowed, which just gets more worrisome the more I look at it. It... encoded my text to a bytestring, then took the repr of that bytestring, then got cranky at the apostrophe? do you even python 3.

This is making step 2, "figure out how to make idna and x.509 play together" look extra exciting let me tell you. how does anyone do anything.

@Lukasa
Copy link

Lukasa commented May 27, 2017

It's not cranky at the apostrophe, it's the colon it doesn't like.

@Lukasa
Copy link

Lukasa commented May 27, 2017

The error message is confusing, but the ultimate issue is that the colon isn't allowed.

@njsmith
Copy link
Member Author

njsmith commented May 27, 2017

...ok, idna.encode("::1", uts46=True) gives the more reasonable but still problematic InvalidCodepoint: Codepoint U+003A not allowed at position 1 in '::1'. So as usual Cory is right and we do need to do some sort of check to "make sure that the domain name actually is an IDNA". Or at least my guess is he's right, but I have no idea what those words actually mean in practice.

Why is there no documentation on how you look up domain names.

@njsmith
Copy link
Member Author

njsmith commented May 27, 2017

Yeah, I figured, but it's one of those bugs that makes you nervous about what else is going on in there :-)

@njsmith
Copy link
Member Author

njsmith commented May 27, 2017

It looks like I had idna 2.1 installed, and upgrading to 2.5 replaces the really weird error message with a more sensible error message, so that's OK.

OK, how to actually look up hostnames: from looking at requests it looks like their rule is that if the string has non-ascii characters, they try calling idna.encode(..., uts46=True), and if that fails they give up. Otherwise, if it's all-ascii, they treat it as good and carry on... except if it starts with *, in which case they specifically blow up.

This bug has a lot of context. This PR landed the workarounds for idna's strictness. This particular strategy appears to be based as much on trial-and-error as anything else, but at least I can learn from their suffering.

The one thing that bothers me then is that I do not at all understand the special case ruling out * and haven't yet found any rationale for it. I assume it somehow somewhere has to do with TLS validation? From reading through PR 3695 it looks like the check was added to keep passing some pre-existing tests that asserted that requests would request URLs like http://*.google.com. I would like to think that getaddrinfo will also reject *.google.com as a hostname, and don't know why this case is so special it needs to be checked for explicitly, but what do I know.

They also make sure to convert back to unicode before continuing, and if you pass a unicode to socket.getaddrinfo it calls .encode("idna") on it, which is an extra chance for things to go wrong so we should probably skip it and stick with bytes once we have them. (I'm not sure whether they eventually pass the unicode to getaddrinfo or convert back first; the actual getaddrinfo call is way downstream of the URL validation code that I'm looking at in requests/models.py.)

Conclusion:

def getaddrinfo(host, ...):
    if isinstance(host, str):
        try:
            host = host.encode("ascii")
        except UnicodeEncodeError:
            host = idna.encode(host, uts46=True)
    return real_getaddrinfo(host, ...)

and then write tests where host is: None, bytes, and str containing: boring ascii hostname, valid IDN hostname, ipv4 address, ipv6 address, already-stringprepped xn-- hostname

...and figure out why/whether * is special.

@njsmith
Copy link
Member Author

njsmith commented May 27, 2017

It looks like the tests that make sure http://*.google.com errors out were added to check that idna was correctly rejecting them in this commit. So... maybe the * thing is just a leftover? I should probably ask @Lukasa, ideally at some later date after he's found time to review #107 and has stopped being annoyed at me for spamming him with my notes-to-self on this bug.

It also looks like the stdlib ssl module might do the right thing if we hand it a hostname that's been pre-encoded to xn--whatever, but then ascii-decoded back into a str - it definitely blows up if it gets the server name as a bytes object. But I need to double-check this with someone who knows what they're doing. [Edit: And also check whether the answer remains the same after @tiran's pending rewrite to use openssl's built-in hostname validation. And whether passing in an xn-- name will do the right thing with SNI. Though it sounds like openssl wants xn-- names for SNI purposes]

@njsmith
Copy link
Member Author

njsmith commented May 27, 2017

And it looks like @tiran's new way of doing hostname checking ultimately comes down to a call to X509_check_host, whose man page says "Per section 6.4.2 of RFC 6125, name values representing international domain names must be given in A-label form", where "A-label form" means one that starts with "xn--".

@tiran
Copy link

tiran commented May 27, 2017

Python's stdlib isn't a good example. IDNA is a messy, here are some fun examples and guide lines

  • there are multiple IDNA encoding variants that are partly incompatible (e.g. ß is encoded differently)
  • IDNA variant depends on TLD, e.g. .de is always IDNA 2008, .com may still be IDNA 2003 (unverified)
  • user facing applications should use UTS46 instead of IDNA 2008. UTS46 normalizes upper case chars to lower case chars.
  • TLDs often have additional restrictions to avoid homoglyphic confusion attacks
  • IDNA encoding/decoding is not a bijective mapping, Python stdlib treats encoding/decoding as bijective
  • IDNA for partial wildcard matches is not defined
  • partial wildcards (www*.example.org) is deprecated and no longer supported by some browsers
  • DNS and X.509 always use IDN A-labels (punycoded ASCII representation), user facing code is usually IDN U-label (unicode)

For IDNA 2008 / UTS46 you have to split the FQDN on ., encode each label, ignore trailing * and join them together again.

@njsmith
Copy link
Member Author

njsmith commented Jun 8, 2017

Oh, here's an obstacle to doing our own IDNA handling while continuing to use the stdlib ssl module: the stdlib ssl module does hostname_bytes.decode("idna") before calling the SNI callback, so currently a trio server doesn't get to see the actual SNI hostname that was requested, only the result of running it through IDNA 2003 decoding.

I think this might just be an annoying bug that doesn't stop us from upgrading IDNA in other contexts, though.

@njsmith
Copy link
Member Author

njsmith commented Jun 8, 2017

ah nice, in fact if you run a server at faß.com, and someone connects and correctly sets their SNI to xn--fa-hia.com, then it looks like the ssl module will error out and fail the handshake without even invoking the SNI callback:

In [3]: idna.encode("faß.com").decode("idna")
UnicodeError: ('IDNA does not round-trip', b'xn--fa-hia', b'fass')

@njsmith
Copy link
Member Author

njsmith commented Jun 8, 2017

I was wrong, actually the stdlib ssl module's IDN handling is just completely broken; basically it's just impossible to connect to IDNs using ssl. Not only does it insist on decoding SNI using IDNA2003, but if you pass in server_hostname= as a pre-encoded byte string then it will first decode it using IDNA 2003 and then re-encode it using 2003, so ssl_object.server_hostname is always a U-label. Furthermore, the match_hostname logic then compares this U-label directly against the A-labels in the certificate (!!!!!), which will never work (except perhaps accidentally via the CN field; this is that thing where the CN field can contain anything which is a security bug so they're being removed, but in particular they sometimes contain the U-label because why not).

This is not something we can fix; as long as ssl works like this, and we use ssl, we just can't handle IDNA in ssl. I guess we could give a nicer error message if someone tries?

What we can do now is switch to using IDNA 2008 correctly in getaddrinfo, so at least if someone tries to connect to https://faß.de then they'll go to the right place, and then fail the certificate check, instead of going to the wrong place and passing the certificate check. (And also go to the right place if they try http://faß.de.) [Also, getaddrinfo with AI_CANONNAME, and getnameinfo both return hostnames -- but they appear to always return A-labels packed into strs. Which is good.]

And in the longer run, I guess either ssl needs to get fixed (maybe 3.7?), or we need to switch to PyOpenSSL (but I'm very nervous about this because that means taking on a ton of really tricky responsibilities), or maybe PEP 543 will save us all.

@Lukasa
Copy link

Lukasa commented Jun 8, 2017

The above post will inevitably interest @tiran.

@njsmith
Copy link
Member Author

njsmith commented Jun 8, 2017

References:

  • ssl.match_hostname is totally borked for IDNs: bpo-28414
  • lack of IDNA2008 support in the stdlib: bpo-17305

njsmith added a commit to njsmith/trio that referenced this issue Jun 12, 2017
njsmith added a commit to njsmith/trio that referenced this issue Jun 12, 2017
Specifically:

- Use IDNA 2008 in getaddrinfo

- Require pre-resolved names in getnameinfo (b/c otherwise it
  makes an implicit call to getaddrinfo using IDNA 2003; also, this
  makes it more consistent with the rest of trio.socket).

This fixes part, but not all, of python-triogh-11.
@njsmith
Copy link
Member Author

njsmith commented Jun 12, 2017

#206 makes it so that we now correctly handle IDNA at the trio.socket layer. So I'm going to repurpose this bug as being specifically about broken IDNA in trio.ssl.

@njsmith njsmith changed the title International domain (IDNA) handling in trio.socket? trio.ssl handling for Unicode (IDNA) domains is deeply broken Jun 12, 2017
@njsmith
Copy link
Member Author

njsmith commented Feb 24, 2018

IDNA handling in ssl is fixed in 3.7, so I think we just need to make sure we're handling it correctly (in particular, encoding things correctly before setting server_hostname), and then if anyone runs into this tell them they need to upgrade their Python.

njsmith added a commit to njsmith/trio that referenced this issue Jul 3, 2018
SSLContext.sni_callback is a better replacement for
set_servername_callback (that was added in part because of my whining
about trio bug python-triogh-11). Let's link to it instead.

This also fixes a silly issue where we were using :meth: to link to
SSLContext.set_servername_callback, but in the 3.7 docs that's
documented as being an attribute. Even though it's a method. Oh well,
whatever, if we don't link to it we don't have to care.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants