Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GET and POST behavior w.r.t. utf-8 decoding errors #161

Open
dairiki opened this issue Sep 30, 2014 · 20 comments
Open

GET and POST behavior w.r.t. utf-8 decoding errors #161

dairiki opened this issue Sep 30, 2014 · 20 comments

Comments

@dairiki
Copy link
Contributor

dairiki commented Sep 30, 2014

The way things stand

Current behavior on badly encoded GET and POST params is

Request.GET raises UnicodeDecodeError:

>>> from webob import Request
>>> req = Request.blank('/?f%FC=123')
>>> req.GET
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/srv/w/users/dairiki/git/github/webob/webob/request.py", line 838, in GET
    vars = GetDict(data, env)
  File "/srv/w/users/dairiki/git/github/webob/webob/multidict.py", line 287, in __init__
    MultiDict.__init__(self, data)
  File "/srv/w/users/dairiki/git/github/webob/webob/multidict.py", line 38, in __init__
    items = list(args[0])
  File "/srv/w/users/dairiki/git/github/webob/webob/compat.py", line 113, in parse_qsl_text
    yield (name.decode(encoding), value.decode(encoding))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 1: invalid start byte

Request.POST essentially does the utf-8 decoding with errors='replace':

>>> from io import BytesIO
>>> from webob import Request
>>> body = b'f\xfc=bar'
>>> environ = {
...     'wsgi.input': BytesIO(body),
...     'REQUEST_METHOD': 'POST',
...     'CONTENT_LENGTH': len(body),
... }
>>> Request(environ).POST
MultiDict([('f�', 'bar')])

Behavior with Content-Type: multipart/form-data is similar. [Edit: actually if the bad bytes are in the body of one of the subparts then UnicodeDecodeError is raised. If the bad bytes are in the headers of the subparts, they are decoded with errors='replace'.]

Gripes

  1. Having request.GET raise UnicodeDecodeError is inconvenient. If one doesn't want way too many entries in ones exception log when a pentester is set loose on ones site, one must check for errors from every request.GET (or request.params). Also it seems, IMO, bad form — or unexpected, at least — for a property to raise aUnicodeDecodeError.
  2. This is inconsistent. Request.GET and request.POST should (IMO) behave similarly w.r.t. how they handle improperly encoded characters.

Possible Solutions

  1. Change request.GET and request.POST so that they return a NoVars instance on parameter decode errors. The reason attribute of the return value would describe the decoding error. (This is similar to how it is suggested to handle non multipart/form-data bodies in #149.)
  2. Change Request.GET to use errors='replace' semantics when decoding (so that it no longer raises UnicodeDecodeErrors and matches the behavior of Request.POST.
  3. Screw it! No API changes.

At the moment, I vote for option #1.

(I'm not quite sure, however, how easy it will be to implement for POST under python 2. Py3k's cgi.FieldStorage has an explicit errors parameter to control how character decoding errors are handled. Python 2's FieldStorage appears to lack this control.)


If there is consensus on what needs doing, I’d be happy to (attempt to) come up with a PR.

@digitalresistor digitalresistor added this to the Version 1.5 milestone Mar 23, 2015
@digitalresistor
Copy link
Member

I'd be happy with a PR against master for solution number 1. There are some changes that were made to fix #149 so I wonder how that changes things with regards to your possible solution.

@dairiki
Copy link
Contributor Author

dairiki commented Jun 10, 2015

In comments on #198, I wrote

I’m working on a pull request which will have request.GET return a webob.multidict.NoVars instance when QUERY_STRING is mis-encoded.
(this is solution 1, from above)

To which, @mmerickel responded

This will likely not be accepted in favor of something like #115 (which you could definitely contribute to).

Moving discussion here, since there is a bit more context here.

@dairiki
Copy link
Contributor Author

dairiki commented Jun 10, 2015

@mmerickel Okay, I was proceeding base on the comment from @bertjwregeer, above — I'll abort for now, until there is a clear consensus.

@digitalresistor
Copy link
Member

I think that #115 is still important, but I am not sure that accessing the request.GET should raise an error. I would love to have @mmerickel's input on this, to see what he thinks.

@dairiki
Copy link
Contributor Author

dairiki commented Jun 10, 2015

As a bit of an aside, looking at the charset decoding of multipart/form-data bodies was giving me a headache anyway, particularly with respect to non-ascii control names. I'm including what I think I've figured out about this here for posterity...

RFC2388 (which describes multipart/form-data) seems to say that RFC2047 (e.g. =?UTF-8?Q?Foo?=) should be used to encode non-ascii names. Later it says that RFC2231 (e.g. filename*=utf-8'en-us'Foo) should be used to encode non-ascii filenames. (This doesn't make much sense to me, since control names and filenames are both attributes on the Content-Disposition header — why shouldn't they both be encoded using the same mechanism?)

Preliminary testing with google chrome, however, seems to indicate the chrome simply encodes non-ascii control names to bytes using the encoding specified by the accept-charset attribute of the <form> element (or the character set of the document, if no accept-charset is specified.) Furthermore, I have not found a way to determine the charset of the encoding from the HTTP request headers (so it appears that one needs external information to properly decode these.)

Also of note, cgi.FieldStorage (which is what webob currently uses to parse multipart/form-data bodies) appears to support neither RFC2047 nor RFC2231, so it appears that implementing support for either of those will be non-trivial (see also #165).

@invisibleroads
Copy link

My linux test server has been getting a lot of these kinds of requests lately...

ERROR [waitress:339][waitress] Exception when serving /�.�./�.�./�.�./�.�./winnt/win.ini
Traceback (most recent call last):
File "xxx/waitress-0.8.10-py2.7.egg/waitress/channel.py", line 336, in service
    task.service()
File "xxx/waitress-0.8.10-py2.7.egg/waitress/task.py", line 169, in service
    self.execute()
File "xxx/waitress-0.8.10-py2.7.egg/waitress/task.py", line 388, in execute
    app_iter = self.channel.server.application(env, start_response)
File "xxx/pyramid/router.py", line 242, in __call__
    response = self.invoke_subrequest(request, use_tweens=True)
File "xxx/pyramid/router.py", line 217, in invoke_subrequest
    response = handle_request(request)
File "xxx/pyramid/tweens.py", line 21, in excview_tween
    response = handler(request)
File "xxx/pyramid_tm/__init__.py", line 82, in tm_tween
    reraise(*exc_info)
File "xxx/pyramid_tm/__init__.py", line 62, in tm_tween
    t.note(request.path_info)
File "xxx/webob/descriptors.py", line 68, in fget
    return req.encget(key, encattr=encattr)
File "xxx/webob/request.py", line 178, in encget
    return val.decode(encoding)
File "xxx/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 1: invalid start byte

I'm debating whether to use @Gijutsu's sanitization approach.

@willmcgugan
Copy link

Are these decoding errors always indicative of a badly formatted request?

I get regular unicode decode tracebacks from request.POST, which I'm pretty sure is due to a bot trying to post garbage to my comment system. Although I'm not certain; it could be someone commenting in Chinese.

@digitalresistor
Copy link
Member

It is indicative of the remote sending content in a non-UTF8 format. Browsers send the data in the format of the page by default (UTF-8 if you set the charset for the page to UTF-8). Otherwise its up to the browser and the users settings IIRC.

@digitalresistor
Copy link
Member

Looking over this issue again, I do think a property should be able to raise. Simply returning NoVars when clearly there are vars, just not ones we happen to like is a bad idea. It doesn't give the programmer a chance to let the user know they did something wrong.

for .POST we may be stuck with the existing way things are done, but we can do better for .GET. Having a URLDecodeError that is raised would allow calling applications to handle it appropriately.

@digitalresistor
Copy link
Member

This affects OpenStack: https://bugs.launchpad.net/neutron/+bug/1613901

@Natim
Copy link

Natim commented Apr 25, 2017

We are having the same issue with some public Kinto instances running with Python3 (Refs #164) I am a bit puzzled to see that the Python2 and Python3 code are so different.

@digitalresistor
Copy link
Member

digitalresistor commented Apr 25, 2017

They are different due to differences between what the WSGI environment provides and requires on Python 3 vs Python 2.

https://www.python.org/dev/peps/pep-3333/#a-note-on-string-types

This is the reason why the code is so different and why these differences exist between Python 2 and 3.

Iff you can figure out a good way to bring the two back together and have the code be similar, I am all game.

@Natim
Copy link

Natim commented Apr 26, 2017

Thank you for the reference in the WSGI PEP. I think the pep is the root of our problem here when they talk about latin-1 in my opinion. Also the choice of using str in both Python2 and Python3 while they are not talking about the same thing is really confusing. I will try to investigate that.

@seanbudd
Copy link

Not an active dev here, but as a consumer facing this issue option 1 would be ideal. Option 2 could result in unintended side-effect. In terms of option 3, it is not hard to add some sort of middleware to your application to test if a request can be encoded in utf-8 before proceeding, and throw a 400 otherwise, which is why I assume this issue hasn't been addressed yet.

@merwok
Copy link

merwok commented Jul 2, 2020

What are the next steps here? Can I help?

@jon-betts
Copy link

jon-betts commented Aug 20, 2020

We are hitting this issue (mostly from pen-testing as well), and the problem it is causing us is we can't ignore UnicodeDecodeError's, but we also can't assume they are from the user and return 400 Bad Request either, as there are many other potential causes.

Could I suggest catching and raising a child of UnicodeDecodeError, e.g. ParamUnicodeDecodeError. This would let callers distinguish between this specific issue and unicode errors in general whilst maintaining backwards compatibility.

@mmerickel
Copy link
Member

mmerickel commented Aug 25, 2020

Could we just define a RequestDecodeError? Or do we want a type hierarchy or multiple types? I think the issues are in headers, url path, query string, and body and we could potentially identify them all separately or we could just call it a request decode error as they all indicate a client-side issue and we pretty much want to just return a 400.

class RequestDecodeError(HTTPBadRequest, UnicodeDecodeError):

One we decide on this api, someone just needs to pepper it around the code and add docs/tests.

@merwok
Copy link

merwok commented Aug 25, 2020

One exception class + a parameter holding the source of the problem (a string that’s one of `"headers", "path", "params", "query", "body") seems nice and clean to me.

@digitalresistor
Copy link
Member

digitalresistor commented Aug 26, 2020

If someone wants to do that work, I'd accept it. @mmerickel's suggestion is the one I was working towards in my head as well.

I'd prefer it to have unique exceptions with RequestDecodeError being the top-level. Not a fan of an exception which holds a parameter as a mechanism because you may want different handling if its a header that failed vs url params for example, and I don't like the idea of people writing code like it's OSError/errno.

However please do that work against the webob-ng (which is py3 only) branch I started (#390) as I would prefer not to port it later. I am not likely to accept this change against Python 2 at this time.

I do plan on trying to get some work done on that PR over the next coming days to get it merged to master, so that will help everyone involved.

@ztane
Copy link

ztane commented Sep 1, 2020

If there is exception hierarchy, it would be still nice to have one attr giving the info out.

reimannf added a commit to sapcc/openstack-watcher-middleware that referenced this issue Nov 10, 2022
See Pylons/webob#161

Recognized with Swift, when not proper encoded object names causing a HTTP 500 error
reimannf added a commit to sapcc/openstack-watcher-middleware that referenced this issue Nov 10, 2022
See Pylons/webob#161
Recognized with Swift, when not proper encoded object names causing a HTTP 500 error
auhlig added a commit to sapcc/openstack-watcher-middleware that referenced this issue Nov 14, 2022
See Pylons/webob#161
Recognized with Swift, when not proper encoded object names causing a HTTP 500 error

Co-authored-by: Arno Uhlig <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants