Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSONParser (and CharField) let malformed strings (isolated surrogate code points) pass through to the application… to then cause late 500 errors #7026

Closed
moseb opened this issue Oct 30, 2019 · 24 comments

Comments

@moseb
Copy link

moseb commented Oct 30, 2019

Hi!

I'm working on a DRF-based backend and we get 500 errors caused by specific Unicode characters…
I'm opening an issue here because it's not specific to our code or setup. I would also like to share a workaround and hear your thoughts about it. I'm aware of #6895 and #6633 and checked that DRF 3.10.3 is still affected by this issue.

The problem

Make sure you use rest_framework.parsers.JSONParser in REST_FRAMEWORK['DEFAULT_PARSER_CLASSES'] in settings. Now pass JSON to an API endpoint of yours that looks like this: {"title": "\ud83d"}. Instead of title use some field that your serializer supports and that is backed by a CharField, explicitly or implicitly. That input is ASCII technically but the JSON decoder will interpret \ud83d and turn it into an str instance equal to chr(0xd83d), i.e. a string with is a code point from the surrogates block which cannot be encoded to UTF-8 (or UTF-16 or ..) — because Isolated surrogate code points have no general interpretation —, see:

In [1]: import sys; sys.version_info.major                                                                                                                                                                           
Out[1]: 3

In [2]: chr(0xd83d).encode('utf-8')                                                                                                                                                                                  
[..]
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

In [3]: import json; json.loads('{"title": "\\ud83d"}')['title'].encode('utf-8')                                                                                                                                     
[..]
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

So my CharField title now contains a Python 3 string '\ud83d' and the code in the serializer starts working with it and we will only learn that we received malformed data in the first place once we try to store it into a database or when we use it while rendering the reply. That's rather late — maybe too late?

To write a test for this case for your own API, you could do something like this:

def test_..............(self):
    detail_url = reverse(...................)
    data = '{"some-charfield-of-yours": "\\ud83d"}'

    response = self.client.patch(detail_url, data, content_type='application/json')

    ...........

Workaround

One way to workaround this problem globally and deny malformed input from even getting to your serializers is to use a derived JSON parser for REST_FRAMEWORK['DEFAULT_PARSER_CLASSES'] like this:

# This code is from https://github.com/encode/django-rest-framework/issues/7026#issue-514872212
# Licensed under the BSD license as DRF itself

import os

from rest_framework.exceptions import ParseError
from rest_framework.parsers import JSONParser
from rest_framework.utils import json


class JsonParserThatRejectsSurrogateCodePoints(JSONParser):

    @staticmethod
    def _reject_surrogate_code_points(parsed):
        try:
            with open(os.devnull, 'w') as f:
                json.dump(parsed, f, ensure_ascii=False)
        except UnicodeEncodeError as e:
            raise ParseError(f'Parsed JSON contains surrogates (code points 0xD800 to 0xDFFF) - {e}')

    def parse(self, stream, media_type=None, parser_context=None):
        parsed = super().parse(stream, media_type=media_type, parser_context=parser_context)
        self._reject_surrogate_code_points(parsed)
        return parsed

I have not measured the performance penalty if this approach, yet. The upside is that only once single place of code needs to be touched to get all API endpoints on dry ground.

Discussion

I would love to hear how you handled this situation in your backend, if this is something you expect DRF users to handle themselves or would want to protect against upstream, and what other approaches come to your mind.

Many thanks in advance,

Sebastian

@xordoquy
Copy link
Collaborator

That input is ASCII technically

Your input encoding should be provided by the client. If none was sent, DRF will fall back to Django's default DEFAULT_CHARSET which - unless overridden - is utf-8.

Since you are expecting string to be ASCII, you could:

  • enforce the client to be sending an explicit encoding matching the request body
  • configure Django to use ASCII as fallback
  • override the JSONParser to use another default encoding.

@moseb
Copy link
Author

moseb commented Oct 31, 2019

@xordoquy I think there is a misunderstanding here: My "That input is ASCII technically" was trying to emphasize that there are no funky bytes in the request payload: Things only get funky once the JSON parser comes into play. The string {"title": "\\ud83d"} is both well-formed ASCII and well-formed UTF-8 (due to use of a rather limited set of characters). Let me demonstrate:

# ipython3
[..]
In [1]: b'{"title": "\\ud83d"}'.decode('ascii')
Out[1]: '{"title": "\\ud83d"}'

In [2]: b'{"title": "\\ud83d"}'.decode('utf-8')
Out[2]: '{"title": "\\ud83d"}'

The client does send a UTF-8 encoding header Content-Type: application/json;charset=UTF-8 matching its payload and the server also has DEFAULT_CHARSET set to default UTF-8, so the setup is sane and mainstream.

If you want to see the bug in action:

$ curl --data '{"username":"drfbug7026","password":"\ud83d"}' --header 'Content-Type: application/json;charset=UTF-8' http://demo.drfdocs.com/accounts/login/
Internal Server Error

The server encountered an unexpected internal server error

(generated by waitress)

Could you re-open this ticket please? Thank you!

@xordoquy
Copy link
Collaborator

xordoquy commented Oct 31, 2019

curl --data '{"username":"drfbug7026","password":"\\ud83d"}' --header 'Content-Type: application/json;charset=UTF-8' http://demo.drfdocs.com/accounts/login/
{"non_field_errors":["Unable to log in with provided credentials."]}

"\ud83d" is not a valid utf8 content.

@moseb
Copy link
Author

moseb commented Oct 31, 2019

You have one backslash too many in there. Please copy my curl line 1:1.

@xordoquy
Copy link
Collaborator

Scratch my previous comment.

@xordoquy xordoquy reopened this Oct 31, 2019
@xordoquy
Copy link
Collaborator

What troubles me is:

json.dumps({"title": chr(0xd83d)}).encode('utf8')
b'{"title": "\\ud83d"}'

Which means the double backslash is expected somehow though and I trust Python for those.

@moseb
Copy link
Author

moseb commented Oct 31, 2019

The double backslash on Python level makes a single backslash on JSON level, so it's only displayed as two:

In [7]: len(b'\\')                                                                                                                                                                                                   
Out[7]: 1

It's interesting that the JSON encoder is waterproof to this problem.

@moseb
Copy link
Author

moseb commented Oct 31, 2019

There is a pull request #7028 now. It has a different approach to a fix, happy to discuss.

@moseb
Copy link
Author

moseb commented Nov 13, 2019

Python upstream knows about this issue and mentions it in their Python 3 JSON documentation:

The RFC does not explicitly forbid JSON strings which contain byte sequences that don’t correspond to valid Unicode characters (e.g. unpaired UTF-16 surrogates), but it does note that they may cause interoperability problems. By default, this module accepts and outputs (when present in the original str) code points for such sequences.

So as far as Python is concerned, this is a feature and not a bug. (This Python issue/fix seems related: https://bugs.python.org/issue17906)


I think for django-rest-framework that means that if we want to protect users from this issue we can (a) detect and deny such input or (b) auto-correct it (to the extend possible).

I can think of three different approaches/places to attack this problem:

  • Globally, before parsing JSON (e.g. using a regex replace loop) — the rejecting version is demo'ed in Reject surrogate code points in JSONParser (issue #7026) #7028
  • Globally, while parsing JSON (but object_pairs_hook and object_hook during parsing are not enough to cover all cases)
  • Globally, after parsing JSON (e.g. using a recursive post-processor)
  • Selectively, during input validation in most field classes in rest_framework.fields (not just CharField) (e.g. using a code-point string post-processor)

What do you think?

@moseb
Copy link
Author

moseb commented Nov 19, 2019

Any thoughts?

@tomchristie
Copy link
Member

I'd be interested to start by seeing how this would look if we just applied it against CharField, and nothing else. It seems to me that there's a reasonable validation rule that we might want to apply at that layer of interface which is something like "this string must be able to encode into x", where x defaults to utf-8, but could also be some other character set.

@hartwork
Copy link
Contributor

I'd be interested to start by seeing how this would look if we just applied it against CharField, and nothing else. [..]

I have just created a pull request #7067 to demo and for discussion.

@moseb
Copy link
Author

moseb commented Dec 10, 2019

Any news?

@moseb
Copy link
Author

moseb commented Dec 17, 2019

Any thoughts?

@moseb
Copy link
Author

moseb commented Jan 2, 2020

Happy new year! Any thoughts about pull request #7067 for a fix?

@auvipy
Copy link
Member

auvipy commented Jan 2, 2020

@tomchristie

@tomchristie
Copy link
Member

@moseb @auvipy Please don't spam issue trackers with direct requests like this.

@auvipy
Copy link
Member

auvipy commented Jan 2, 2020

@moseb @auvipy Please don't spam issue trackers with direct requests like this.

I didn't spam. I felt like the guy is not having a proper response or attention for the work he is doing. So I thought it might be better to mention the core maintainer of the project who is sponsored to work on this particular project for part/full-time.

@tomchristie
Copy link
Member

Pinging folks directly without adding anything of value is spammy behavior.
You have a track record here, and I'm asking you to please be more aware of that.

FWIW I'm still working flat-out: https://github.com/tomchristie/

Screenshot 2020-01-02 at 13 20 41

That time happens to be largely on httpx right at this moment, which our sponsors are fully aware of, via the monthly reports.

@moseb
Copy link
Author

moseb commented Jan 2, 2020

@moseb @auvipy Please don't spam issue trackers with direct requests like this.

Speaking for myself, there were more than 7 days between my pings which I would consider okay in my own projects. (If that frequency of pinging is too high for DRF, please share what frequency of pinging is considered okay.)

@tomchristie
Copy link
Member

It can occasionally be useful to get a helpful nudge, but generally it never adds value and if it's done repeatedly it's not likely to be appreciated.

There's a follow-up on #7067 now, I'd suggest any further discussions should be against that.

@auvipy
Copy link
Member

auvipy commented Jan 2, 2020

It can occasionally be useful to get a helpful nudge, but generally it never adds value and if it's done repeatedly it's not likely to be appreciated.

There's a follow-up on #7067 now, I'd suggest any further discussions should be against that.

I do myself a sponsored open source maintainer. And I found your comment very disrespectful. We never consider any one mentioning us and don't show ego. please don't try to teach me what is spammy behavior and what is not. I do participate in more open-source projects but never got triggered by someone mentioning me directly. I rather found your comment useless and adding no real value.

tomchristie pushed a commit that referenced this issue Jan 6, 2020
* CharField: Detect and prohibit surrogate characters

* CharField: Cover handling of surrogate characters
@hartwork
Copy link
Contributor

hartwork commented Jan 6, 2020

With #7067 merged — is this ready to be closed?

@tomchristie
Copy link
Member

Yup, thanks!

pchiquet pushed a commit to pchiquet/django-rest-framework that referenced this issue Nov 17, 2020
* CharField: Detect and prohibit surrogate characters

* CharField: Cover handling of surrogate characters
sigvef pushed a commit to sigvef/django-rest-framework that referenced this issue Dec 3, 2022
* CharField: Detect and prohibit surrogate characters

* CharField: Cover handling of surrogate characters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants