-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode character parsing #1329
Comments
Just passing my eyes through Node's |
I remembered there was a PR addressing this #1081 |
@huge-success/sanic-release-managers add this to 18.12LTS milestone? |
We need to re-test, per #1081 this may have been solved as it was merged back in january. |
there is a unit test for verifying. |
@yunstanford would this be a valid test?
The result:
|
Test looks valid to me. After dugging, it looks like that the issue still exists in the dependency. It actually seems to exist upstream from there in http-tools (see here, where they are still relying on RFC2616 for valid url characters - which are only latin-1). After digging, while RFC2616 was updated by RFC3986 and obsoleted by RFC7230, the latin-1 character sets (backreferenced to RFC3986) and percent encoding remain the only official valid characters for URIs. So we have two problems, one of which I don't care about. i18n and non-latin-1 characters are still technically disallowed - don't care. W3C allows internationalization of domain names using non latin-1 characters, and so should we. The second part is that our dependency chain DOES care. The http-tools parser will and does fail if it's trying to parse any portion of the schema that non-latin-1 encoded. And I don't think we can fix this unless we eliminate the dependency, and I'm sure that's going to be a problem. So I do care about this issue. I think there is a workaround though, but I think it will be costly: if the unicode characters are intercepted before parsing and translated to their %characters, that solves the url parsing. But we also have to do that for route registration. I'd like to get some more eyes on this issue to make sure I'm not crazy. @ahopkins @r0fls @seemethere @yunstanford |
@sjsadowski for this simple script I made, using from sanic.request import Request
uri = "/jacaré"
Request(
url_bytes=uri.encode("latin1"),
headers={},
version=None,
method='GET',
transport=None
) Result:
But, even if I create a simple server: from sanic import Sanic
from sanic.response import text
app = Sanic(__name__)
@app.route("/test/<hello>")
async def myroute(request, hello):
return text("hello, {}!".format(hello))
if __name__ == '__main__':
app.run(host="0.0.0.0", port=8000, workers=1) It will raise (basically) the same error. Server:
cURL:
|
I think the encoding is supposed to be 'latin-1' but either way é is not valid in RFC2616. here is a direct link to the character table for RFC2616 that is implemented in http-tools |
Oh. Sorry, my bad. I'll pay more attention to the RFC 😁 Regarding the dash on >>> "é".encode("utf-8")
b'\xc3\xa9'
>>> "é".encode("utf8")
b'\xc3\xa9'
>>> "é".encode("latin1")
b'\xe9'
>>> "é".encode("latin-1")
b'\xe9' |
Yeah it's just dumb. It's very outdated (1999 - almost 20 years) and even though it's been superseded twice, they haven't updated the "legal" characters for a URI. |
We need to see if any updated RFC brings this question forward, since 2616 is obsoleted and updated. |
@sjsadowski wow, almost at the same time (the comments). I'll take a look into these other RFCs just for precaution. |
@vltr you want 3986 which is an update and 7230 which obsoletes 2616 |
Thanks, @sjsadowski, I'll read them carefully when I get some free time and will get back here with (or without) new conclusions 👍 |
I have not had a full chance to review the issue.... but my first thought when I read the idea of catching non Latin characters and translating them would be the potential performance hit compared to the use case. I'll come back after the weekend with some more thoughts. |
didn't get time to dig into this issue, will take a look when getting a chance. If you guys've investigated and thought it's because of httptools, we can pull @1st1 in for discussion |
yeah it's not httptools, it's the upstream parser it relies on http-tools. I don't think the httptools package can implement a fix. I wonder if we want to consider a workaround that would enable unicode parsing with a flag. |
Everyone, I did some more digging into this and things got a little more complicated. From the RFC 3986, see the topic "2.4. When to Encode or Decode". From what I could understand:
What happens:
What can be done:
Thoughts? |
I'd lean towards nothing. If a client is supposed to be % encoding non-ascii characters that are in the RFC2616 charmap, then bare unicode characters should not be parsed, and httptools/http-tools are doing the right thing. So long as routes that have unicode characters in them are getting % encoded as well, we have done (what I think) are our jobs, as the routes will match the correct client encoding. |
@sjsadowski that would be my opinion too. We can perhaps add a server option to |
@vltr unless we get a chorus of 'no' I'm going to say we just need to document the hell out of it. A special section on unicode characters or something for handling unicode in |
@sjsadowski I completely agree with you. |
Perhaps we can even provide some pre-built examples of extending the |
I'm actually closing this per #1424 |
@sjsadowski I'm sorry, I really forgot to document some examples with the |
@vltr no worries, just trying to do some house keeping! |
@sjsadowski I know 😉 And I'm glad you're doing it because I'm so full of stuff to do that this went straight through my todo list and I completely forgot about. |
@sjsadowski I made a mess in my mind here - I thought this was related to |
I'm going to close it. Solving the issue itself is going to require significant work and should be tracked separately - if we choose to take it on - and I think this issue number has run its course. |
Per #539 there may be a continued problem in parsing unicode characters due to a dependency.
Per @vltr: /中国 or even /jacaré would break on request.py, mostly because of httptools.parse_url that receives bytes and is unable to parse non-ascii chars. I'll need to dig deeper to see if this is a restraint in httptools itself or just in the Python bindings.
The text was updated successfully, but these errors were encountered: