Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad Request when sending utf-8 encoded http path under python3 #1577

Closed
rslinckx opened this issue Aug 24, 2017 · 8 comments
Closed

Bad Request when sending utf-8 encoded http path under python3 #1577

rslinckx opened this issue Aug 24, 2017 · 8 comments

Comments

@rslinckx
Copy link
Contributor

This is the sent data:

import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('127.0.0.1', 8080))
print(s.send('GET /à%20k HTTP/1.0\r\n\r\n'.encode('utf-8')))
print(s.recv(100000).decode('ascii'))

This is the response

HTTP/1.1 400 Bad Request
Connection: close
Content-Type: text/html
Content-Length: 181

<html>
  <head>
    <title>Bad Request</title>
  </head>
  <body>
    <h1><p>Bad Request</p></h1>
    Invalid HTTP Version 'Invalid HTTP Version: '%20k HTTP/1.0''
  </body>
</html>

This is because the request line is first decoded as latin1 using the _compat:bytes_to_str this causes the "à" to be returned as "\xc3\xa0", then the request line is split using line.split(None, 2) which will consider the \xa0 (non breaking space) as whitespace and strip it, thus rendering the request line invalid.

A first attempt would be to use line.split(' ', 2) but then the split will no longer eat up all consecutive whitespaces and may introduce other bugs.

I'm not sure what would be the best solution here.

@rslinckx
Copy link
Contributor Author

Note that python2 is unaffected because bytes_to_str is a no-op in this case:

python2:

>>> 'GET /\xc3\xa0%20 HTTP'.split(None, 2)
['GET', '/\xc3\xa0%20', 'HTTP']

python3:

>>> b'GET /\xc3\xa0%20 HTTP'.decode('latin1').split(None, 2)
['GET', '/Ã', '%20 HTTP']

@tilgovi
Copy link
Collaborator

tilgovi commented Aug 24, 2017

Could we split before decoding? [part.decode('latin1') for part in b'GET /\xc3\xa0%20 HTTP'.split(None, 2)]?

@rslinckx
Copy link
Contributor Author

That would at least preserve the python2 behavior, i don't know if there's an encoding that might break if split on space, i guess not.

@tilgovi
Copy link
Collaborator

tilgovi commented Aug 24, 2017

I would think not or else the HTTP request line would be malformed, since the encoding there is pretty strict.

@tilgovi
Copy link
Collaborator

tilgovi commented Aug 24, 2017

Thanks for raising this issue. Would you be willing to make a PR for it?

@benoitc
Copy link
Owner

benoitc commented Aug 24, 2017

mmm shouldn't the start line be encoded in us-ascii though ?

http://httpwg.org/specs/rfc7230.html#rfc.section.3

@rslinckx
Copy link
Contributor Author

This section describe exactly the problem gunicorn is having in this case:

A recipient MUST parse an HTTP message as a sequence of octets in an encoding that is a superset of US-ASCII [USASCII]. Parsing an HTTP message as a stream of Unicode characters, without regard for the specific encoding, creates security vulnerabilities due to the varying ways that string processing libraries handle invalid multibyte character sequences that contain the octet LF (%x0A). String-based parsers can only be safely used within protocol elements after the element has been extracted from the message, such as within a header field-value after message parsing has delineated the individual fields.

In any case the parsing of the request line should be done on bytes.

It's debatable if the request line should ascii-only. The spec is a bit unclear. However there are clients in the wild that just send those request lines and I'd rather handle them in the application which can do further processing to handle them properly.

@rslinckx
Copy link
Contributor Author

I made a pull request with a test case which failed under py3 and now passes.

I'm not sure about bytes_to_str() when raising InvalidRequestLine. I fear that having byte strings as message is not expected hence the conversion.

@benoitc benoitc closed this as completed in 15e901a Sep 2, 2017
benoitc added a commit that referenced this issue Sep 2, 2017
http/message: Split request line as bytes to avoid splitting on 0x0A. Fixes #1577
mjjbell pushed a commit to mjjbell/gunicorn that referenced this issue Mar 16, 2018
mjjbell pushed a commit to mjjbell/gunicorn that referenced this issue Mar 16, 2018
http/message: Split request line as bytes to avoid splitting on 0x0A. Fixes benoitc#1577
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants