JSONParser (and CharField) let malformed strings (isolated surrogate code points) pass through to the application… to then cause late 500 errors #7026

moseb · 2019-10-30T17:59:56Z

Hi!

I'm working on a DRF-based backend and we get 500 errors caused by specific Unicode characters…
I'm opening an issue here because it's not specific to our code or setup. I would also like to share a workaround and hear your thoughts about it. I'm aware of #6895 and #6633 and checked that DRF 3.10.3 is still affected by this issue.

The problem

Make sure you use rest_framework.parsers.JSONParser in REST_FRAMEWORK['DEFAULT_PARSER_CLASSES'] in settings. Now pass JSON to an API endpoint of yours that looks like this: {"title": "\ud83d"}. Instead of title use some field that your serializer supports and that is backed by a CharField, explicitly or implicitly. That input is ASCII technically but the JSON decoder will interpret \ud83d and turn it into an str instance equal to chr(0xd83d), i.e. a string with is a code point from the surrogates block which cannot be encoded to UTF-8 (or UTF-16 or ..) — because Isolated surrogate code points have no general interpretation —, see:

In [1]: import sys; sys.version_info.major                                                                                                                                                                           
Out[1]: 3

In [2]: chr(0xd83d).encode('utf-8')                                                                                                                                                                                  
[..]
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

In [3]: import json; json.loads('{"title": "\\ud83d"}')['title'].encode('utf-8')                                                                                                                                     
[..]
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

So my CharField title now contains a Python 3 string '\ud83d' and the code in the serializer starts working with it and we will only learn that we received malformed data in the first place once we try to store it into a database or when we use it while rendering the reply. That's rather late — maybe too late?

To write a test for this case for your own API, you could do something like this:

def test_..............(self):
    detail_url = reverse(...................)
    data = '{"some-charfield-of-yours": "\\ud83d"}'

    response = self.client.patch(detail_url, data, content_type='application/json')

    ...........

Workaround

One way to workaround this problem globally and deny malformed input from even getting to your serializers is to use a derived JSON parser for REST_FRAMEWORK['DEFAULT_PARSER_CLASSES'] like this:

# This code is from https://github.com/encode/django-rest-framework/issues/7026#issue-514872212
# Licensed under the BSD license as DRF itself

import os

from rest_framework.exceptions import ParseError
from rest_framework.parsers import JSONParser
from rest_framework.utils import json


class JsonParserThatRejectsSurrogateCodePoints(JSONParser):

    @staticmethod
    def _reject_surrogate_code_points(parsed):
        try:
            with open(os.devnull, 'w') as f:
                json.dump(parsed, f, ensure_ascii=False)
        except UnicodeEncodeError as e:
            raise ParseError(f'Parsed JSON contains surrogates (code points 0xD800 to 0xDFFF) - {e}')

    def parse(self, stream, media_type=None, parser_context=None):
        parsed = super().parse(stream, media_type=media_type, parser_context=parser_context)
        self._reject_surrogate_code_points(parsed)
        return parsed

I have not measured the performance penalty if this approach, yet. The upside is that only once single place of code needs to be touched to get all API endpoints on dry ground.

Discussion

I would love to hear how you handled this situation in your backend, if this is something you expect DRF users to handle themselves or would want to protect against upstream, and what other approaches come to your mind.

Many thanks in advance,

Sebastian

The text was updated successfully, but these errors were encountered:

xordoquy · 2019-10-30T18:17:38Z

That input is ASCII technically

Your input encoding should be provided by the client. If none was sent, DRF will fall back to Django's default DEFAULT_CHARSET which - unless overridden - is utf-8.

Since you are expecting string to be ASCII, you could:

enforce the client to be sending an explicit encoding matching the request body
configure Django to use ASCII as fallback
override the JSONParser to use another default encoding.

moseb · 2019-10-31T09:45:33Z

@xordoquy I think there is a misunderstanding here: My "That input is ASCII technically" was trying to emphasize that there are no funky bytes in the request payload: Things only get funky once the JSON parser comes into play. The string {"title": "\\ud83d"} is both well-formed ASCII and well-formed UTF-8 (due to use of a rather limited set of characters). Let me demonstrate:

# ipython3
[..]
In [1]: b'{"title": "\\ud83d"}'.decode('ascii')
Out[1]: '{"title": "\\ud83d"}'

In [2]: b'{"title": "\\ud83d"}'.decode('utf-8')
Out[2]: '{"title": "\\ud83d"}'

The client does send a UTF-8 encoding header Content-Type: application/json;charset=UTF-8 matching its payload and the server also has DEFAULT_CHARSET set to default UTF-8, so the setup is sane and mainstream.

If you want to see the bug in action:

$ curl --data '{"username":"drfbug7026","password":"\ud83d"}' --header 'Content-Type: application/json;charset=UTF-8' http://demo.drfdocs.com/accounts/login/
Internal Server Error

The server encountered an unexpected internal server error

(generated by waitress)

Could you re-open this ticket please? Thank you!

xordoquy · 2019-10-31T10:07:03Z

curl --data '{"username":"drfbug7026","password":"\\ud83d"}' --header 'Content-Type: application/json;charset=UTF-8' http://demo.drfdocs.com/accounts/login/
{"non_field_errors":["Unable to log in with provided credentials."]}

"\ud83d" is not a valid utf8 content.

moseb · 2019-10-31T10:08:45Z

You have one backslash too many in there. Please copy my curl line 1:1.

xordoquy · 2019-10-31T10:09:11Z

Scratch my previous comment.

xordoquy · 2019-10-31T10:14:09Z

What troubles me is:

json.dumps({"title": chr(0xd83d)}).encode('utf8')
b'{"title": "\\ud83d"}'

Which means the double backslash is expected somehow though and I trust Python for those.

moseb · 2019-10-31T10:41:43Z

The double backslash on Python level makes a single backslash on JSON level, so it's only displayed as two:

In [7]: len(b'\\')                                                                                                                                                                                                   
Out[7]: 1

It's interesting that the JSON encoder is waterproof to this problem.

moseb · 2019-10-31T11:45:54Z

There is a pull request #7028 now. It has a different approach to a fix, happy to discuss.

moseb · 2019-11-13T16:54:33Z

Python upstream knows about this issue and mentions it in their Python 3 JSON documentation:

The RFC does not explicitly forbid JSON strings which contain byte sequences that don’t correspond to valid Unicode characters (e.g. unpaired UTF-16 surrogates), but it does note that they may cause interoperability problems. By default, this module accepts and outputs (when present in the original str) code points for such sequences.

So as far as Python is concerned, this is a feature and not a bug. (This Python issue/fix seems related: https://bugs.python.org/issue17906)

I think for django-rest-framework that means that if we want to protect users from this issue we can (a) detect and deny such input or (b) auto-correct it (to the extend possible).

I can think of three different approaches/places to attack this problem:

Globally, before parsing JSON (e.g. using a regex replace loop) — the rejecting version is demo'ed in Reject surrogate code points in JSONParser (issue #7026) #7028
Globally, while parsing JSON (but object_pairs_hook and object_hook during parsing are not enough to cover all cases)
Globally, after parsing JSON (e.g. using a recursive post-processor)
Selectively, during input validation in most field classes in rest_framework.fields (not just CharField) (e.g. using a code-point string post-processor)

What do you think?

moseb · 2019-11-19T10:00:52Z

Any thoughts?

tomchristie · 2019-11-19T10:16:09Z

I'd be interested to start by seeing how this would look if we just applied it against CharField, and nothing else. It seems to me that there's a reasonable validation rule that we might want to apply at that layer of interface which is something like "this string must be able to encode into x", where x defaults to utf-8, but could also be some other character set.

hartwork · 2019-11-23T21:41:49Z

I'd be interested to start by seeing how this would look if we just applied it against CharField, and nothing else. [..]

I have just created a pull request #7067 to demo and for discussion.

moseb · 2019-12-10T10:16:33Z

Any news?

moseb · 2019-12-17T13:51:51Z

Any thoughts?

moseb · 2020-01-02T10:31:34Z

Happy new year! Any thoughts about pull request #7067 for a fix?

auvipy · 2020-01-02T12:33:09Z

@tomchristie

tomchristie · 2020-01-02T12:48:00Z

@moseb @auvipy Please don't spam issue trackers with direct requests like this.

auvipy · 2020-01-02T13:09:37Z

@moseb @auvipy Please don't spam issue trackers with direct requests like this.

I didn't spam. I felt like the guy is not having a proper response or attention for the work he is doing. So I thought it might be better to mention the core maintainer of the project who is sponsored to work on this particular project for part/full-time.

tomchristie · 2020-01-02T13:21:16Z

Pinging folks directly without adding anything of value is spammy behavior.
You have a track record here, and I'm asking you to please be more aware of that.

FWIW I'm still working flat-out: https://github.com/tomchristie/

That time happens to be largely on httpx right at this moment, which our sponsors are fully aware of, via the monthly reports.

moseb · 2020-01-02T13:31:31Z

@moseb @auvipy Please don't spam issue trackers with direct requests like this.

Speaking for myself, there were more than 7 days between my pings which I would consider okay in my own projects. (If that frequency of pinging is too high for DRF, please share what frequency of pinging is considered okay.)

tomchristie · 2020-01-02T13:43:41Z

It can occasionally be useful to get a helpful nudge, but generally it never adds value and if it's done repeatedly it's not likely to be appreciated.

There's a follow-up on #7067 now, I'd suggest any further discussions should be against that.

auvipy · 2020-01-02T14:14:33Z

It can occasionally be useful to get a helpful nudge, but generally it never adds value and if it's done repeatedly it's not likely to be appreciated.

There's a follow-up on #7067 now, I'd suggest any further discussions should be against that.

I do myself a sponsored open source maintainer. And I found your comment very disrespectful. We never consider any one mentioning us and don't show ego. please don't try to teach me what is spammy behavior and what is not. I do participate in more open-source projects but never got triggered by someone mentioning me directly. I rather found your comment useless and adding no real value.

* CharField: Detect and prohibit surrogate characters * CharField: Cover handling of surrogate characters

hartwork · 2020-01-06T17:21:19Z

With #7067 merged — is this ready to be closed?

tomchristie · 2020-01-06T17:41:07Z

Yup, thanks!

* CharField: Detect and prohibit surrogate characters * CharField: Cover handling of surrogate characters

xordoquy closed this as completed Oct 30, 2019

xordoquy reopened this Oct 31, 2019

This was referenced Oct 31, 2019

UnicodeEncodeError #6895

Closed

Reject surrogate code points in JSONParser (issue #7026) #7028

Closed

hartwork mentioned this issue Nov 23, 2019

Make CharField prohibit surrogate characters (#7026) #7067

Merged

tomchristie pushed a commit that referenced this issue Jan 6, 2020

Make CharField prohibit surrogate characters (#7026) (#7067)

373e521

* CharField: Detect and prohibit surrogate characters * CharField: Cover handling of surrogate characters

tomchristie closed this as completed Jan 6, 2020

This was referenced Feb 20, 2020

segfault on surrogate unicode python-rapidjson/python-rapidjson#81

Open

UltraJson doesn't behave the same way as json.JSONEncoder for unicode chars ultrajson/ultrajson#156

Closed

pchiquet pushed a commit to pchiquet/django-rest-framework that referenced this issue Nov 17, 2020

Make CharField prohibit surrogate characters (encode#7026) (encode#7067)

380498a

* CharField: Detect and prohibit surrogate characters * CharField: Cover handling of surrogate characters

sigvef pushed a commit to sigvef/django-rest-framework that referenced this issue Dec 3, 2022

Make CharField prohibit surrogate characters (encode#7026) (encode#7067)

4c1a8b1

* CharField: Detect and prohibit surrogate characters * CharField: Cover handling of surrogate characters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSONParser (and CharField) let malformed strings (isolated surrogate code points) pass through to the application… to then cause late 500 errors #7026

JSONParser (and CharField) let malformed strings (isolated surrogate code points) pass through to the application… to then cause late 500 errors #7026

moseb commented Oct 30, 2019 •

edited

Loading

xordoquy commented Oct 30, 2019

moseb commented Oct 31, 2019

xordoquy commented Oct 31, 2019 •

edited

Loading

moseb commented Oct 31, 2019

xordoquy commented Oct 31, 2019

xordoquy commented Oct 31, 2019

moseb commented Oct 31, 2019

moseb commented Oct 31, 2019

moseb commented Nov 13, 2019 •

edited

Loading

moseb commented Nov 19, 2019

tomchristie commented Nov 19, 2019

hartwork commented Nov 23, 2019

moseb commented Dec 10, 2019

moseb commented Dec 17, 2019

moseb commented Jan 2, 2020

auvipy commented Jan 2, 2020

tomchristie commented Jan 2, 2020

auvipy commented Jan 2, 2020

tomchristie commented Jan 2, 2020

moseb commented Jan 2, 2020

tomchristie commented Jan 2, 2020

auvipy commented Jan 2, 2020

hartwork commented Jan 6, 2020

tomchristie commented Jan 6, 2020

JSONParser (and CharField) let malformed strings (isolated surrogate code points) pass through to the application… to then cause late 500 errors #7026

JSONParser (and CharField) let malformed strings (isolated surrogate code points) pass through to the application… to then cause late 500 errors #7026

Comments

moseb commented Oct 30, 2019 • edited Loading

The problem

Workaround

Discussion

xordoquy commented Oct 30, 2019

moseb commented Oct 31, 2019

xordoquy commented Oct 31, 2019 • edited Loading

moseb commented Oct 31, 2019

xordoquy commented Oct 31, 2019

xordoquy commented Oct 31, 2019

moseb commented Oct 31, 2019

moseb commented Oct 31, 2019

moseb commented Nov 13, 2019 • edited Loading

moseb commented Nov 19, 2019

tomchristie commented Nov 19, 2019

hartwork commented Nov 23, 2019

moseb commented Dec 10, 2019

moseb commented Dec 17, 2019

moseb commented Jan 2, 2020

auvipy commented Jan 2, 2020

tomchristie commented Jan 2, 2020

auvipy commented Jan 2, 2020

tomchristie commented Jan 2, 2020

moseb commented Jan 2, 2020

tomchristie commented Jan 2, 2020

auvipy commented Jan 2, 2020

hartwork commented Jan 6, 2020

tomchristie commented Jan 6, 2020

moseb commented Oct 30, 2019 •

edited

Loading

xordoquy commented Oct 31, 2019 •

edited

Loading

moseb commented Nov 13, 2019 •

edited

Loading