GET and POST behavior w.r.t. utf-8 decoding errors #161

dairiki · 2014-09-30T15:50:56Z

The way things stand

Current behavior on badly encoded GET and POST params is

Request.GET raises UnicodeDecodeError:

>>> from webob import Request
>>> req = Request.blank('/?f%FC=123')
>>> req.GET
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/srv/w/users/dairiki/git/github/webob/webob/request.py", line 838, in GET
    vars = GetDict(data, env)
  File "/srv/w/users/dairiki/git/github/webob/webob/multidict.py", line 287, in __init__
    MultiDict.__init__(self, data)
  File "/srv/w/users/dairiki/git/github/webob/webob/multidict.py", line 38, in __init__
    items = list(args[0])
  File "/srv/w/users/dairiki/git/github/webob/webob/compat.py", line 113, in parse_qsl_text
    yield (name.decode(encoding), value.decode(encoding))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 1: invalid start byte

Request.POST essentially does the utf-8 decoding with errors='replace':

>>> from io import BytesIO
>>> from webob import Request
>>> body = b'f\xfc=bar'
>>> environ = {
...     'wsgi.input': BytesIO(body),
...     'REQUEST_METHOD': 'POST',
...     'CONTENT_LENGTH': len(body),
... }
>>> Request(environ).POST
MultiDict([('f�', 'bar')])

Behavior with Content-Type: multipart/form-data is similar. [Edit: actually if the bad bytes are in the body of one of the subparts then UnicodeDecodeError is raised. If the bad bytes are in the headers of the subparts, they are decoded with errors='replace'.]

Gripes

Having request.GET raise UnicodeDecodeError is inconvenient. If one doesn't want way too many entries in ones exception log when a pentester is set loose on ones site, one must check for errors from every request.GET (or request.params). Also it seems, IMO, bad form — or unexpected, at least — for a property to raise aUnicodeDecodeError.
This is inconsistent. Request.GET and request.POST should (IMO) behave similarly w.r.t. how they handle improperly encoded characters.

Possible Solutions

Change request.GET and request.POST so that they return a NoVars instance on parameter decode errors. The reason attribute of the return value would describe the decoding error. (This is similar to how it is suggested to handle non multipart/form-data bodies in #149.)
Change Request.GET to use errors='replace' semantics when decoding (so that it no longer raises UnicodeDecodeErrors and matches the behavior of Request.POST.
Screw it! No API changes.

At the moment, I vote for option #1.

(I'm not quite sure, however, how easy it will be to implement for POST under python 2. Py3k's cgi.FieldStorage has an explicit errors parameter to control how character decoding errors are handled. Python 2's FieldStorage appears to lack this control.)

If there is consensus on what needs doing, I’d be happy to (attempt to) come up with a PR.

The text was updated successfully, but these errors were encountered:

digitalresistor · 2015-06-01T03:54:25Z

I'd be happy with a PR against master for solution number 1. There are some changes that were made to fix #149 so I wonder how that changes things with regards to your possible solution.

dairiki · 2015-06-10T21:06:59Z

In comments on #198, I wrote

I’m working on a pull request which will have request.GET return a webob.multidict.NoVars instance when QUERY_STRING is mis-encoded.
(this is solution 1, from above)

To which, @mmerickel responded

This will likely not be accepted in favor of something like #115 (which you could definitely contribute to).

Moving discussion here, since there is a bit more context here.

dairiki · 2015-06-10T21:15:06Z

@mmerickel Okay, I was proceeding base on the comment from @bertjwregeer, above — I'll abort for now, until there is a clear consensus.

digitalresistor · 2015-06-10T21:23:20Z

I think that #115 is still important, but I am not sure that accessing the request.GET should raise an error. I would love to have @mmerickel's input on this, to see what he thinks.

dairiki · 2015-06-10T21:59:52Z

As a bit of an aside, looking at the charset decoding of multipart/form-data bodies was giving me a headache anyway, particularly with respect to non-ascii control names. I'm including what I think I've figured out about this here for posterity...

RFC2388 (which describes multipart/form-data) seems to say that RFC2047 (e.g. =?UTF-8?Q?Foo?=) should be used to encode non-ascii names. Later it says that RFC2231 (e.g. filename*=utf-8'en-us'Foo) should be used to encode non-ascii filenames. (This doesn't make much sense to me, since control names and filenames are both attributes on the Content-Disposition header — why shouldn't they both be encoded using the same mechanism?)

Preliminary testing with google chrome, however, seems to indicate the chrome simply encodes non-ascii control names to bytes using the encoding specified by the accept-charset attribute of the <form> element (or the character set of the document, if no accept-charset is specified.) Furthermore, I have not found a way to determine the charset of the encoding from the HTTP request headers (so it appears that one needs external information to properly decode these.)

Also of note, cgi.FieldStorage (which is what webob currently uses to parse multipart/form-data bodies) appears to support neither RFC2047 nor RFC2231, so it appears that implementing support for either of those will be non-trivial (see also #165).

invisibleroads · 2015-10-23T06:38:40Z

My linux test server has been getting a lot of these kinds of requests lately...

ERROR [waitress:339][waitress] Exception when serving /�.�./�.�./�.�./�.�./winnt/win.ini
Traceback (most recent call last):
File "xxx/waitress-0.8.10-py2.7.egg/waitress/channel.py", line 336, in service
    task.service()
File "xxx/waitress-0.8.10-py2.7.egg/waitress/task.py", line 169, in service
    self.execute()
File "xxx/waitress-0.8.10-py2.7.egg/waitress/task.py", line 388, in execute
    app_iter = self.channel.server.application(env, start_response)
File "xxx/pyramid/router.py", line 242, in __call__
    response = self.invoke_subrequest(request, use_tweens=True)
File "xxx/pyramid/router.py", line 217, in invoke_subrequest
    response = handle_request(request)
File "xxx/pyramid/tweens.py", line 21, in excview_tween
    response = handler(request)
File "xxx/pyramid_tm/__init__.py", line 82, in tm_tween
    reraise(*exc_info)
File "xxx/pyramid_tm/__init__.py", line 62, in tm_tween
    t.note(request.path_info)
File "xxx/webob/descriptors.py", line 68, in fget
    return req.encget(key, encattr=encattr)
File "xxx/webob/request.py", line 178, in encget
    return val.decode(encoding)
File "xxx/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 1: invalid start byte

I'm debating whether to use @Gijutsu's sanitization approach.

willmcgugan · 2016-03-12T16:48:53Z

Are these decoding errors always indicative of a badly formatted request?

I get regular unicode decode tracebacks from request.POST, which I'm pretty sure is due to a bot trying to post garbage to my comment system. Although I'm not certain; it could be someone commenting in Chinese.

digitalresistor · 2016-03-12T21:05:10Z

It is indicative of the remote sending content in a non-UTF8 format. Browsers send the data in the format of the page by default (UTF-8 if you set the charset for the page to UTF-8). Otherwise its up to the browser and the users settings IIRC.

digitalresistor · 2016-07-27T15:50:11Z

Looking over this issue again, I do think a property should be able to raise. Simply returning NoVars when clearly there are vars, just not ones we happen to like is a bad idea. It doesn't give the programmer a chance to let the user know they did something wrong.

for .POST we may be stuck with the existing way things are done, but we can do better for .GET. Having a URLDecodeError that is raised would allow calling applications to handle it appropriately.

digitalresistor · 2016-09-28T19:55:05Z

This affects OpenStack: https://bugs.launchpad.net/neutron/+bug/1613901

Natim · 2017-04-25T12:32:46Z

We are having the same issue with some public Kinto instances running with Python3 (Refs #164) I am a bit puzzled to see that the Python2 and Python3 code are so different.

digitalresistor · 2017-04-25T16:38:49Z

They are different due to differences between what the WSGI environment provides and requires on Python 3 vs Python 2.

https://www.python.org/dev/peps/pep-3333/#a-note-on-string-types

This is the reason why the code is so different and why these differences exist between Python 2 and 3.

Iff you can figure out a good way to bring the two back together and have the code be similar, I am all game.

Natim · 2017-04-26T07:42:39Z

Thank you for the reference in the WSGI PEP. I think the pep is the root of our problem here when they talk about latin-1 in my opinion. Also the choice of using str in both Python2 and Python3 while they are not talking about the same thing is really confusing. I will try to investigate that.

seanbudd · 2019-08-20T02:56:23Z

Not an active dev here, but as a consumer facing this issue option 1 would be ideal. Option 2 could result in unintended side-effect. In terms of option 3, it is not hard to add some sort of middleware to your application to test if a request can be encoded in utf-8 before proceeding, and throw a 400 otherwise, which is why I assume this issue hasn't been addressed yet.

merwok · 2020-07-02T19:47:32Z

What are the next steps here? Can I help?

jon-betts · 2020-08-20T14:10:01Z

We are hitting this issue (mostly from pen-testing as well), and the problem it is causing us is we can't ignore UnicodeDecodeError's, but we also can't assume they are from the user and return 400 Bad Request either, as there are many other potential causes.

Could I suggest catching and raising a child of UnicodeDecodeError, e.g. ParamUnicodeDecodeError. This would let callers distinguish between this specific issue and unicode errors in general whilst maintaining backwards compatibility.

mmerickel · 2020-08-25T15:45:55Z

Could we just define a RequestDecodeError? Or do we want a type hierarchy or multiple types? I think the issues are in headers, url path, query string, and body and we could potentially identify them all separately or we could just call it a request decode error as they all indicate a client-side issue and we pretty much want to just return a 400.

class RequestDecodeError(HTTPBadRequest, UnicodeDecodeError):

One we decide on this api, someone just needs to pepper it around the code and add docs/tests.

merwok · 2020-08-25T15:51:39Z

One exception class + a parameter holding the source of the problem (a string that’s one of `"headers", "path", "params", "query", "body") seems nice and clean to me.

digitalresistor · 2020-08-26T01:34:29Z

If someone wants to do that work, I'd accept it. @mmerickel's suggestion is the one I was working towards in my head as well.

I'd prefer it to have unique exceptions with RequestDecodeError being the top-level. Not a fan of an exception which holds a parameter as a mechanism because you may want different handling if its a header that failed vs url params for example, and I don't like the idea of people writing code like it's OSError/errno.

However please do that work against the webob-ng (which is py3 only) branch I started (#390) as I would prefer not to port it later. I am not likely to accept this change against Python 2 at this time.

I do plan on trying to get some work done on that PR over the next coming days to get it merged to master, so that will help everyone involved.

For context see Pylons/webob#161

ztane · 2020-09-01T12:10:11Z

If there is exception hierarchy, it would be still nice to have one attr giving the info out.

See Pylons/webob#161 Recognized with Swift, when not proper encoded object names causing a HTTP 500 error

See Pylons/webob#161 Recognized with Swift, when not proper encoded object names causing a HTTP 500 error Co-authored-by: Arno Uhlig <[email protected]>

digitalresistor added this to the Version 1.5 milestone Mar 23, 2015

dairiki mentioned this issue Jun 3, 2015

UnicodeDecodeError within webob when parsing query strings with invalid URL encoded parameters #195

Closed

digitalresistor removed this from the Version 1.5 milestone Sep 6, 2015

Gijutsu mentioned this issue Oct 2, 2015

Sanitize user input and take care of illegal UTF-8, which is not properly handled in webob SUNET/eduid-dashboard#44

Merged

digitalresistor mentioned this issue Oct 28, 2015

latin encoded urls raise 500 Pylons/pyramid#2047

Closed

rmichaelis mentioned this issue Apr 15, 2016

Encoding issue Geoportail-Luxembourg/geoportailv3#1208

Closed

mmerickel mentioned this issue Jul 26, 2016

Adding an invalid UTF-8 sequence in the URL make the server to crash. Pylons/pyramid#2725

Closed

digitalresistor mentioned this issue Jul 27, 2016

encget fails with invalid parameter. #268

Closed

digitalresistor mentioned this issue Jul 31, 2016

url quote/unquote with python 3 broken #164

Closed

digitalresistor mentioned this issue Nov 20, 2016

httpexception rendering crashes if wsgi environ contains non-utf8 bytestrings Pylons/pyramid#1374

Open

leplatrem mentioned this issue Dec 17, 2016

Fix crash on redirection when path contains unicode characters Kinto/kinto#982

Merged

Natim mentioned this issue Apr 25, 2017

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 10: invalid start byte Kinto/kinto#1195

Closed

mmerickel mentioned this issue Oct 24, 2018

UnicodeDecodeError on request.localizer with bogus GET params Pylons/pyramid#3399

Closed

lyzadanger mentioned this issue Jan 9, 2019

UnicodeDecodeError: 'utf8' codec can't decode bytes... in search API hypothesis/h#5312

Closed

snarfed mentioned this issue Jan 20, 2020

Handle UnicodeDecodeError on bad UTF-8 URLs GoogleCloudPlatform/webapp2#152

Open

jon-betts mentioned this issue Aug 20, 2020

Any code which uses requests.GET can be made to crash with malformed query params hypothesis/h#6157

Closed

phillbaker added a commit to phillbaker/routes that referenced this issue Aug 28, 2020

Add graceful fallback for invalid character encoding

017f598

For context see Pylons/webob#161

phillbaker mentioned this issue Aug 28, 2020

Add graceful fallback for invalid character encoding bbangert/routes#94

Merged

Kami mentioned this issue Mar 13, 2021

Cannot view rule with UTF-8 character in name StackStorm/st2#5188

Closed

reimannf added a commit to sapcc/openstack-watcher-middleware that referenced this issue Nov 10, 2022

req.path fails on not proper encoded URLs

016e750

See Pylons/webob#161 Recognized with Swift, when not proper encoded object names causing a HTTP 500 error

reimannf added a commit to sapcc/openstack-watcher-middleware that referenced this issue Nov 10, 2022

req.path fails on not proper encoded URLs

0a8d26d

See Pylons/webob#161 Recognized with Swift, when not proper encoded object names causing a HTTP 500 error

reimannf mentioned this issue Nov 14, 2022

req.path fails on not proper encoded URLs sapcc/openstack-watcher-middleware#9

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GET and POST behavior w.r.t. utf-8 decoding errors #161

GET and POST behavior w.r.t. utf-8 decoding errors #161

dairiki commented Sep 30, 2014

digitalresistor commented Jun 1, 2015

dairiki commented Jun 10, 2015

dairiki commented Jun 10, 2015

digitalresistor commented Jun 10, 2015

dairiki commented Jun 10, 2015

invisibleroads commented Oct 23, 2015

willmcgugan commented Mar 12, 2016

digitalresistor commented Mar 12, 2016

digitalresistor commented Jul 27, 2016

digitalresistor commented Sep 28, 2016

Natim commented Apr 25, 2017 •

edited

Loading

digitalresistor commented Apr 25, 2017 •

edited

Loading

Natim commented Apr 26, 2017

seanbudd commented Aug 20, 2019

merwok commented Jul 2, 2020

jon-betts commented Aug 20, 2020 •

edited

Loading

mmerickel commented Aug 25, 2020 •

edited

Loading

merwok commented Aug 25, 2020

digitalresistor commented Aug 26, 2020 •

edited

Loading

ztane commented Sep 1, 2020

GET and POST behavior w.r.t. utf-8 decoding errors #161

GET and POST behavior w.r.t. utf-8 decoding errors #161

Comments

dairiki commented Sep 30, 2014

The way things stand

Gripes

Possible Solutions

digitalresistor commented Jun 1, 2015

dairiki commented Jun 10, 2015

dairiki commented Jun 10, 2015

digitalresistor commented Jun 10, 2015

dairiki commented Jun 10, 2015

invisibleroads commented Oct 23, 2015

willmcgugan commented Mar 12, 2016

digitalresistor commented Mar 12, 2016

digitalresistor commented Jul 27, 2016

digitalresistor commented Sep 28, 2016

Natim commented Apr 25, 2017 • edited Loading

digitalresistor commented Apr 25, 2017 • edited Loading

Natim commented Apr 26, 2017

seanbudd commented Aug 20, 2019

merwok commented Jul 2, 2020

jon-betts commented Aug 20, 2020 • edited Loading

mmerickel commented Aug 25, 2020 • edited Loading

merwok commented Aug 25, 2020

digitalresistor commented Aug 26, 2020 • edited Loading

ztane commented Sep 1, 2020

Natim commented Apr 25, 2017 •

edited

Loading

digitalresistor commented Apr 25, 2017 •

edited

Loading

jon-betts commented Aug 20, 2020 •

edited

Loading

mmerickel commented Aug 25, 2020 •

edited

Loading

digitalresistor commented Aug 26, 2020 •

edited

Loading