-
-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSONParser (and CharField) let malformed strings (isolated surrogate code points) pass through to the application… to then cause late 500 errors #7026
Comments
Your input encoding should be provided by the client. If none was sent, DRF will fall back to Django's default Since you are expecting string to be ASCII, you could:
|
@xordoquy I think there is a misunderstanding here: My "That input is ASCII technically" was trying to emphasize that there are no funky bytes in the request payload: Things only get funky once the JSON parser comes into play. The string # ipython3
[..]
In [1]: b'{"title": "\\ud83d"}'.decode('ascii')
Out[1]: '{"title": "\\ud83d"}'
In [2]: b'{"title": "\\ud83d"}'.decode('utf-8')
Out[2]: '{"title": "\\ud83d"}' The client does send a UTF-8 encoding header If you want to see the bug in action: $ curl --data '{"username":"drfbug7026","password":"\ud83d"}' --header 'Content-Type: application/json;charset=UTF-8' http://demo.drfdocs.com/accounts/login/
Internal Server Error
The server encountered an unexpected internal server error
(generated by waitress) Could you re-open this ticket please? Thank you! |
curl --data '{"username":"drfbug7026","password":"\\ud83d"}' --header 'Content-Type: application/json;charset=UTF-8' http://demo.drfdocs.com/accounts/login/
{"non_field_errors":["Unable to log in with provided credentials."]}
|
You have one backslash too many in there. Please copy my curl line 1:1. |
Scratch my previous comment. |
What troubles me is: json.dumps({"title": chr(0xd83d)}).encode('utf8')
b'{"title": "\\ud83d"}' Which means the double backslash is expected somehow though and I trust Python for those. |
The double backslash on Python level makes a single backslash on JSON level, so it's only displayed as two:
It's interesting that the JSON encoder is waterproof to this problem. |
There is a pull request #7028 now. It has a different approach to a fix, happy to discuss. |
Python upstream knows about this issue and mentions it in their Python 3 JSON documentation:
So as far as Python is concerned, this is a feature and not a bug. (This Python issue/fix seems related: https://bugs.python.org/issue17906) I think for django-rest-framework that means that if we want to protect users from this issue we can (a) detect and deny such input or (b) auto-correct it (to the extend possible). I can think of three different approaches/places to attack this problem:
What do you think? |
Any thoughts? |
I'd be interested to start by seeing how this would look if we just applied it against CharField, and nothing else. It seems to me that there's a reasonable validation rule that we might want to apply at that layer of interface which is something like "this string must be able to encode into x", where x defaults to utf-8, but could also be some other character set. |
I have just created a pull request #7067 to demo and for discussion. |
Any news? |
Any thoughts? |
Happy new year! Any thoughts about pull request #7067 for a fix? |
I didn't spam. I felt like the guy is not having a proper response or attention for the work he is doing. So I thought it might be better to mention the core maintainer of the project who is sponsored to work on this particular project for part/full-time. |
Pinging folks directly without adding anything of value is spammy behavior. FWIW I'm still working flat-out: https://github.com/tomchristie/ That time happens to be largely on httpx right at this moment, which our sponsors are fully aware of, via the monthly reports. |
Speaking for myself, there were more than 7 days between my pings which I would consider okay in my own projects. (If that frequency of pinging is too high for DRF, please share what frequency of pinging is considered okay.) |
It can occasionally be useful to get a helpful nudge, but generally it never adds value and if it's done repeatedly it's not likely to be appreciated. There's a follow-up on #7067 now, I'd suggest any further discussions should be against that. |
I do myself a sponsored open source maintainer. And I found your comment very disrespectful. We never consider any one mentioning us and don't show ego. please don't try to teach me what is spammy behavior and what is not. I do participate in more open-source projects but never got triggered by someone mentioning me directly. I rather found your comment useless and adding no real value. |
With #7067 merged — is this ready to be closed? |
Yup, thanks! |
* CharField: Detect and prohibit surrogate characters * CharField: Cover handling of surrogate characters
* CharField: Detect and prohibit surrogate characters * CharField: Cover handling of surrogate characters
Hi!
I'm working on a DRF-based backend and we get 500 errors caused by specific Unicode characters…
I'm opening an issue here because it's not specific to our code or setup. I would also like to share a workaround and hear your thoughts about it. I'm aware of #6895 and #6633 and checked that DRF 3.10.3 is still affected by this issue.
The problem
Make sure you use
rest_framework.parsers.JSONParser
inREST_FRAMEWORK['DEFAULT_PARSER_CLASSES']
in settings. Now pass JSON to an API endpoint of yours that looks like this:{"title": "\ud83d"}
. Instead oftitle
use some field that your serializer supports and that is backed by aCharField
, explicitly or implicitly. That input is ASCII technically but the JSON decoder will interpret\ud83d
and turn it into anstr
instance equal tochr(0xd83d)
, i.e. a string with is a code point from the surrogates block which cannot be encoded to UTF-8 (or UTF-16 or ..) — becauseIsolated surrogate code points have no general interpretation
—, see:So my CharField
title
now contains a Python 3 string'\ud83d'
and the code in the serializer starts working with it and we will only learn that we received malformed data in the first place once we try to store it into a database or when we use it while rendering the reply. That's rather late — maybe too late?To write a test for this case for your own API, you could do something like this:
Workaround
One way to workaround this problem globally and deny malformed input from even getting to your serializers is to use a derived JSON parser for
REST_FRAMEWORK['DEFAULT_PARSER_CLASSES']
like this:I have not measured the performance penalty if this approach, yet. The upside is that only once single place of code needs to be touched to get all API endpoints on dry ground.
Discussion
I would love to hear how you handled this situation in your backend, if this is something you expect DRF users to handle themselves or would want to protect against upstream, and what other approaches come to your mind.
Many thanks in advance,
Sebastian
The text was updated successfully, but these errors were encountered: