Unicode character parsing #1329

sjsadowski · 2018-09-26T12:38:06Z

Per #539 there may be a continued problem in parsing unicode characters due to a dependency.

Per @vltr: /中国 or even /jacaré would break on request.py, mostly because of httptools.parse_url that receives bytes and is unable to parse non-ascii chars. I'll need to dig deeper to see if this is a restraint in httptools itself or just in the Python bindings.

vltr · 2018-09-26T14:23:21Z

Just passing my eyes through Node's http-parser source code (base of httptools), I've seen in their tests that some level of unicode characters should be possible to use. So, the next step would be to test httptools itself - in a next chapter 😉

yunstanford · 2018-09-27T03:47:07Z

I remembered there was a PR addressing this #1081

ahopkins · 2018-09-27T07:51:31Z

@huge-success/sanic-release-managers add this to 18.12LTS milestone?

sjsadowski · 2018-09-27T13:21:38Z

We need to re-test, per #1081 this may have been solved as it was merged back in january.

yunstanford · 2018-09-28T02:42:21Z

there is a unit test for verifying.

vltr · 2018-09-28T12:07:19Z

@yunstanford would this be a valid test?

from sanic.request import Request

uri = "/jacaré"

Request(
    url_bytes=uri.encode("utf-8"),
    headers={},
    version=None,
    method='GET',
    transport=None
)

The result:

$ python test-sanic-unicode.py
Traceback (most recent call last):
  File "test-sanic-unicode.py", line 10, in <module>
    transport=None
  File "/home/richard/.pyenv/versions/playground-3.7.0/lib/python3.7/site-packages/sanic/request.py", line 57, in __init__
    self._parsed_url = parse_url(url_bytes)
  File "httptools/parser/parser.pyx", line 468, in httptools.parser.parser.parse_url
httptools.parser.errors.HttpParserInvalidURLError: invalid url b'/jacar\xc3\xa9'

sjsadowski · 2018-09-28T13:00:00Z

Test looks valid to me.

After dugging, it looks like that the issue still exists in the dependency. It actually seems to exist upstream from there in http-tools (see here, where they are still relying on RFC2616 for valid url characters - which are only latin-1). After digging, while RFC2616 was updated by RFC3986 and obsoleted by RFC7230, the latin-1 character sets (backreferenced to RFC3986) and percent encoding remain the only official valid characters for URIs.

So we have two problems, one of which I don't care about. i18n and non-latin-1 characters are still technically disallowed - don't care. W3C allows internationalization of domain names using non latin-1 characters, and so should we.

The second part is that our dependency chain DOES care. The http-tools parser will and does fail if it's trying to parse any portion of the schema that non-latin-1 encoded. And I don't think we can fix this unless we eliminate the dependency, and I'm sure that's going to be a problem. So I do care about this issue.

I think there is a workaround though, but I think it will be costly: if the unicode characters are intercepted before parsing and translated to their %characters, that solves the url parsing. But we also have to do that for route registration. I'd like to get some more eyes on this issue to make sure I'm not crazy. @ahopkins @r0fls @seemethere @yunstanford

sjsadowski · 2018-09-28T13:21:10Z

I'm also concerned because @vltr has a test that is clearly failing yet the test from #1081 passes

vltr · 2018-09-28T13:30:51Z

@sjsadowski for this simple script I made, using latin1 provides the same error:

from sanic.request import Request

uri = "/jacaré"

Request(
    url_bytes=uri.encode("latin1"),
    headers={},
    version=None,
    method='GET',
    transport=None
)

Result:

$ python test-sanic-unicode.py
Traceback (most recent call last):
  File "test-sanic-unicode.py", line 10, in <module>
    transport=None
  File "/home/richard/.pyenv/versions/playground-3.7.0/lib/python3.7/site-packages/sanic/request.py", line 57, in __init__
    self._parsed_url = parse_url(url_bytes)
  File "httptools/parser/parser.pyx", line 468, in httptools.parser.parser.parse_url
httptools.parser.errors.HttpParserInvalidURLError: invalid url b'/jacar\xe9'

httptools version:

$ pip freeze | grep httptools
httptools==0.0.11

But, even if I create a simple server:

from sanic import Sanic
from sanic.response import text

app = Sanic(__name__)


@app.route("/test/<hello>")
async def myroute(request, hello):
    return text("hello, {}!".format(hello))


if __name__ == '__main__':
    app.run(host="0.0.0.0", port=8000, workers=1)

It will raise (basically) the same error.

Server:

$ python test-sanic-unicode.py
[2018-09-28 10:28:54 -0300] [8411] [INFO] Goin' Fast @ http://0.0.0.0:8000
[2018-09-28 10:28:54 -0300] [8411] [INFO] Starting worker [8411]
[2018-09-28 10:29:05 -0300] [8411] [ERROR] Traceback (most recent call last):
  File "/home/richard/.pyenv/versions/playground-3.7.0/lib/python3.7/site-packages/sanic/server.py", line 216, in data_received
    self.parser.feed_data(data)
  File "httptools/parser/parser.pyx", line 193, in httptools.parser.parser.HttpParser.feed_data
httptools.parser.errors.HttpParserInvalidURLError: invalid URL

cURL:

$ curl -v http://127.0.0.1:8000/test/jacaré
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8000 (#0)
> GET /test/jacaré HTTP/1.1
> Host: 127.0.0.1:8000
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 400 Bad Request
< Connection: close
< Content-Length: 18
< Content-Type: text/plain; charset=utf-8
< 
* Closing connection 0
Error: Bad Request

sjsadowski · 2018-09-28T13:34:56Z

I think the encoding is supposed to be 'latin-1' but either way é is not valid in RFC2616.

here is a direct link to the character table for RFC2616 that is implemented in http-tools

vltr · 2018-09-28T13:43:53Z

Oh. Sorry, my bad. I'll pay more attention to the RFC 😁

Regarding the dash on latin-1, it works with or without:

>>> "é".encode("utf-8")
b'\xc3\xa9'

>>> "é".encode("utf8")
b'\xc3\xa9'

>>> "é".encode("latin1")
b'\xe9'

>>> "é".encode("latin-1")
b'\xe9'

sjsadowski · 2018-09-28T13:45:30Z

Yeah it's just dumb. It's very outdated (1999 - almost 20 years) and even though it's been superseded twice, they haven't updated the "legal" characters for a URI.

vltr · 2018-09-28T13:45:50Z

We need to see if any updated RFC brings this question forward, since 2616 is obsoleted and updated.

vltr · 2018-09-28T13:47:05Z

@sjsadowski wow, almost at the same time (the comments). I'll take a look into these other RFCs just for precaution.

sjsadowski · 2018-09-28T13:48:01Z

@vltr you want 3986 which is an update and 7230 which obsoletes 2616

vltr · 2018-09-28T13:53:31Z

Thanks, @sjsadowski, I'll read them carefully when I get some free time and will get back here with (or without) new conclusions 👍

ahopkins · 2018-09-28T15:00:06Z

I have not had a full chance to review the issue.... but my first thought when I read the idea of catching non Latin characters and translating them would be the potential performance hit compared to the use case. I'll come back after the weekend with some more thoughts.

vltr · 2018-09-28T15:16:53Z

@ahopkins indeed, depending on how to address this issue. I would not want to address this inside Sanic because of the performance hit. If possible, I would like to address this on httptools because it seems to work on Node's http-parser (see here).

yunstanford · 2018-09-29T02:54:59Z

didn't get time to dig into this issue, will take a look when getting a chance.

If you guys've investigated and thought it's because of httptools, we can pull @1st1 in for discussion

sjsadowski · 2018-10-01T13:24:22Z

yeah it's not httptools, it's the upstream parser it relies on http-tools. I don't think the httptools package can implement a fix.

I wonder if we want to consider a workaround that would enable unicode parsing with a flag.

vltr · 2018-10-01T13:36:56Z

Everyone, I did some more digging into this and things got a little more complicated. From the RFC 3986, see the topic "2.4. When to Encode or Decode".

From what I could understand:

For every non US-ASCII character (let's pick a simple example that I used a lot here, é) that is used, it should be encoded to its "percent-encoded octet", which in this example would be %C3%A9;
By doing this, httptools would correctly parse everything and all is good.

What happens:

Some clients (like cURL) doesn't encode anything prior to the request, it just sends the request "as is", that would of course break inside Sanic for not being encoded properly first;
We can see that some other web frameworks can handle this kind of values, but the question is: are they doing it to "embrace the world" and make users happy, instead of properly handle the issue?

What can be done:

Nothing. Sanic is working strictly under the terms of the RFC, so if anything comes wrong, a bad request will be raised;
Half embrace the world [1]: try to encode these "strange chars" (using urllib.parse.quote) prior to send to httptools, adding some overhead;
Half embrace the world [2]: try do decode these "strange chars" (using urllib.parse.unquote) after what httptools provides us (if it not breaks); my personal opinion is that this should be considered (I can send /%C3%A9 and read /é); or
Full embrace the world: try to encode everything weird before and after, to get the whole world satisfied.

Thoughts?

sjsadowski · 2018-10-01T13:53:13Z

I'd lean towards nothing. If a client is supposed to be % encoding non-ascii characters that are in the RFC2616 charmap, then bare unicode characters should not be parsed, and httptools/http-tools are doing the right thing.

So long as routes that have unicode characters in them are getting % encoded as well, we have done (what I think) are our jobs, as the routes will match the correct client encoding.

vltr · 2018-10-01T14:06:34Z

@sjsadowski that would be my opinion too. We can perhaps add a server option to unquote after httptools have returned correctly, but that's just an idea -- the developer can implement this as well. It may have some more trouble than expected because it needs to overwrite Request.__init__ and use the newer class in the application constructor if intended to be used with the Router, so perhaps this should be a matter of heavy documentation on the subject?

sjsadowski · 2018-10-01T14:15:11Z

@vltr unless we get a chorus of 'no' I'm going to say we just need to document the hell out of it. A special section on unicode characters or something for handling unicode in Request.

vltr · 2018-10-01T14:16:42Z

@sjsadowski I completely agree with you.

vltr · 2018-10-01T14:19:08Z

Perhaps we can even provide some pre-built examples of extending the Request class to deal with all these questions - even though I personally think that these kind of parsing should be done inside the server, but anyway ...

sjsadowski · 2019-02-08T15:26:58Z

I'm actually closing this per #1424

vltr · 2019-02-08T15:43:06Z

@sjsadowski I'm sorry, I really forgot to document some examples with the Request class. Please, leave it open and I'll create a PR for this in a couple of hours - I just have to revise what I have wrote in here already and etc. I'm really sorry.

sjsadowski · 2019-02-08T16:02:15Z

@vltr no worries, just trying to do some house keeping!

vltr · 2019-02-08T16:06:04Z

@sjsadowski I know 😉 And I'm glad you're doing it because I'm so full of stuff to do that this went straight through my todo list and I completely forgot about.

vltr · 2019-02-08T18:46:12Z

@sjsadowski I made a mess in my mind here - I thought this was related to httptools on the Request object, but it is in fact in the HttpProtocol. It's better to close this or take the bumpy road: add optional (and configurable) hooks to the protocol, which might not even solve the issue 😕

sjsadowski · 2019-02-08T19:09:10Z

I'm going to close it. Solving the issue itself is going to require significant work and should be tracked separately - if we choose to take it on - and I think this issue number has run its course.

sjsadowski added the needs investigation label Sep 26, 2018

sjsadowski assigned vltr Sep 26, 2018

vltr mentioned this issue Sep 26, 2018

Incomplete URL path parsing #539

Closed

sjsadowski mentioned this issue Sep 27, 2018

upload file error when file's name is Chinese #1123

Closed

vltr mentioned this issue Dec 15, 2018

Documentation Enhancements #1424

Merged

11 tasks

vltr mentioned this issue Jan 28, 2019

Please describe how to allow forward-slash encoded as %2F within a route #1473

Closed

sjsadowski closed this as completed Feb 8, 2019

sjsadowski reopened this Feb 8, 2019

sjsadowski closed this as completed Feb 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode character parsing #1329

Unicode character parsing #1329

sjsadowski commented Sep 26, 2018

vltr commented Sep 26, 2018

yunstanford commented Sep 27, 2018

ahopkins commented Sep 27, 2018

sjsadowski commented Sep 27, 2018

yunstanford commented Sep 28, 2018

vltr commented Sep 28, 2018

sjsadowski commented Sep 28, 2018

sjsadowski commented Sep 28, 2018

vltr commented Sep 28, 2018

sjsadowski commented Sep 28, 2018 •

edited

Loading

vltr commented Sep 28, 2018

sjsadowski commented Sep 28, 2018

vltr commented Sep 28, 2018

vltr commented Sep 28, 2018

sjsadowski commented Sep 28, 2018

vltr commented Sep 28, 2018

ahopkins commented Sep 28, 2018

vltr commented Sep 28, 2018

yunstanford commented Sep 29, 2018

sjsadowski commented Oct 1, 2018

vltr commented Oct 1, 2018

sjsadowski commented Oct 1, 2018

vltr commented Oct 1, 2018

sjsadowski commented Oct 1, 2018

vltr commented Oct 1, 2018

vltr commented Oct 1, 2018

sjsadowski commented Feb 8, 2019

vltr commented Feb 8, 2019

sjsadowski commented Feb 8, 2019

vltr commented Feb 8, 2019

vltr commented Feb 8, 2019

sjsadowski commented Feb 8, 2019

Unicode character parsing #1329

Unicode character parsing #1329

Comments

sjsadowski commented Sep 26, 2018

vltr commented Sep 26, 2018

yunstanford commented Sep 27, 2018

ahopkins commented Sep 27, 2018

sjsadowski commented Sep 27, 2018

yunstanford commented Sep 28, 2018

vltr commented Sep 28, 2018

sjsadowski commented Sep 28, 2018

sjsadowski commented Sep 28, 2018

vltr commented Sep 28, 2018

sjsadowski commented Sep 28, 2018 • edited Loading

vltr commented Sep 28, 2018

sjsadowski commented Sep 28, 2018

vltr commented Sep 28, 2018

vltr commented Sep 28, 2018

sjsadowski commented Sep 28, 2018

vltr commented Sep 28, 2018

ahopkins commented Sep 28, 2018

vltr commented Sep 28, 2018

yunstanford commented Sep 29, 2018

sjsadowski commented Oct 1, 2018

vltr commented Oct 1, 2018

sjsadowski commented Oct 1, 2018

vltr commented Oct 1, 2018

sjsadowski commented Oct 1, 2018

vltr commented Oct 1, 2018

vltr commented Oct 1, 2018

sjsadowski commented Feb 8, 2019

vltr commented Feb 8, 2019

sjsadowski commented Feb 8, 2019

vltr commented Feb 8, 2019

vltr commented Feb 8, 2019

sjsadowski commented Feb 8, 2019

sjsadowski commented Sep 28, 2018 •

edited

Loading