Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash on multipart/form-data post #1743

Merged
merged 3 commits into from
Mar 24, 2017
Merged

Fix crash on multipart/form-data post #1743

merged 3 commits into from
Mar 24, 2017

Conversation

hubo1016
Copy link
Contributor

@hubo1016 hubo1016 commented Mar 23, 2017

What do these changes do?

When multipart/form-data format data is posted to aiohttp server and is processed by request.post(), if there are fields without filename and "Content-Type" header, the request crashes on checking content_type.startswith("text/"). Many browsers and tools generates this kind of post data.

Are there changes in behavior for the user?

No. There may be different opinions on whether to decode the data to unicode string or leave it as bytes, but it should be better than crashing.

Related issue number

Checklist

  • I think the code is well written
  • Unit tests for the changes exist
  • Documentation reflects the changes
  • If you provide code modification, please add yourself to CONTRIBUTORS.txt
    • The format is <Name> <Surname>.
    • Please keep alphabetical order, the file is sorted by names.
  • Add a new entry to CHANGES.rst
    • Choose any open position to avoid merge conflicts with other PRs.
    • Add a link to the issue you are fixing (if any) using #issue_number format at the end of changelog message. Use Pull Request number if there are no issues for PR or PR covers the issue only partially.

@hubo1016 hubo1016 force-pushed the 2.0 branch 3 times, most recently from 1e4c3ba to f48499c Compare March 23, 2017 13:59
@@ -409,7 +409,8 @@ def post(self):
out.add(field.name, ff)
else:
value = yield from field.read(decode=True)
if content_type.startswith('text/'):
if content_type is None or \
content_type.startswith('text/'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in case of None, you really cannot be sure if is it safe to decode or not. Better leave data as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'm also thinking about this, but most post() use cases do not return bytes. When user call post(), maybe he always want something same returned from either multipart or url-encoded data.

If the user cares about raw data (bytes), he may call multipart() directly and process the post data himself.

Copy link
Contributor Author

@hubo1016 hubo1016 Mar 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you post a form with fields like textboxes in a browser like Firefox, e.g.

<form method="post" enctype="multipart/form-data">
  <input type="hidden" name="p1" value="v1"/>
  <input type="submit"/>
</form>

The browser usually do not set Content-Type for subpart of the post.
Files are not affected by this commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @kxepal, we should not decode data if we do not know content-type
it would very hard to reason about exception if one occurs from this code

Copy link
Contributor Author

@hubo1016 hubo1016 Mar 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a hard decision, and I am in an open mind about this. There are three ways for this situation:

  1. When no Content-Type is provided, assume it is a utf-8 string
  2. When no Content-Type is provided, always keep it as bytes
  3. When no Content-Type is provided, first try to parse it as an utf-8 string (with "strict"), and when exception occurs, return the raw bytes

Each has their own advantages and disadvantages. I'm looking at the code which is processing application/x-www-form-urlencoded data and it is:

            data = yield from self.read()
            if data:
                charset = self.charset or 'utf-8'
                out.extend(
                    parse_qsl(
                        data.rstrip().decode(charset),
                        encoding=charset))

Notice that this piece of code assume charset to be utf-8 when no charset is provided through Content-Type header (notice that a %NN encoded character is really a byte). It always decode data into string. So I suggest using the same strategy for multipart/form-data format.

As I have said, a lot of browsers do not send Content-Type header for sub parts of the form data - in most times, they are indeed encoded into utf-8. There is nothing a developer can do about this. If multipart/form-data post data is parsed into bytes, a developer is forced to check the data type of post() every time if he wants to accept both format. To decide to not decode a bytes object is easy, but the user may be suprised to see that the return type for multipart/form-data and application/x-www-from-urlencoded is so different. And he would also have a hard time when some tools or browsers actually provide the Content-Type header.

After we have a conclusion maybe we should add it into the document.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When no Content-Type is provided, always keep it as bytes

Will be fine in all cases. Browsers just are another HTTP clients with own specifics.

As I have said, a lot of browsers do not send Content-Type header for sub parts of the form data - in most times, they are indeed encoded into utf-8.

They actually do this for simple input fields, not file inputs. I'm worry about "in most times" part of your post, but in anyway, there are no reasons here to make any preferences for browsers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, according to RFC 7578

4.4. Content-Type Header Field for Each Part

Each part MAY have an (optional) "Content-Type" header field, which
defaults to "text/plain". If the contents of a file are to be sent,
the file data SHOULD be labeled with an appropriate media type, if
known, or "application/octet-stream".

It really SHOULD be considered as "text/plain"... And if "text/plain" is decoded to unicode with the default encoding as utf-8, it should be same for content without a content-type header.

I'm also testing the simple HTML page with Firefox, Internet Explorer and Edge, they all send the text without a content-type header - even when the input field contains non-ASCII characters.

Anyway, if you do not change your mind, I don't mind to change the logic to what you are considering.

@kxepal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI

https://tools.ietf.org/html/rfc7578#page-5

and also these chapters

5.1.2. Interpreting Forms and Creating multipart/form-data Data

Some applications of this specification will supply a character
encoding to be used for interpretation of the multipart/form-data
body. In particular, HTML 5 [W3C.REC-html5-20141028] uses

o the content of a "charset" field, if there is one;

o the value of an accept-charset attribute of the

element, if
there is one;

o the character encoding of the document containing the form, if it
is US-ASCII compatible;

o otherwise, UTF-8.

5.1.3. Parsing and Interpreting Form Data

While this specification provides guidance for the creation of
multipart/form-data, parsers and interpreters should be aware of the
variety of implementations. File systems differ as to whether and
how they normalize Unicode names, for example. The matching of form
elements to form-data parts may rely on a fuzzier match. In
particular, some multipart/form-data generators might have followed
the previous advice of [RFC2388] and used the "encoded-word" method
of encoding non-ASCII values, as described in [RFC2047]:

  encoded-word = "=?" charset "?" encoding "?" encoded-text "?="

Others have been known to follow [RFC2231], to send unencoded UTF-8,
or even to send strings encoded in the form-charset.

For this reason, interpreting multipart/form-data (even from
conforming generators) may require knowing the charset used in form
encoding in cases where the charset field value or a charset
parameter of a "text/plain" Content-Type header field is not
supplied.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love RFC references! Thanks for them. I guess RFC-7578#4.4 is pretty clear instructs what to do in this case so can follow it.

@fafhrd91 are you ok with as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok

@hubo1016 hubo1016 force-pushed the 2.0 branch 8 times, most recently from 3c618da to a6dae6d Compare March 23, 2017 15:00
@fafhrd91
Copy link
Member

@hubo1016 please add yourself to contributors list

@hubo1016
Copy link
Contributor Author

@fafhrd91 Done. Where and what should I add to CHANGES.rst?

@fafhrd91
Copy link
Member

add it to 2.0 branch, I am planing to release 2.0.3 today

thanks!

@fafhrd91
Copy link
Member

@hubo1016 do not worry about change, I will add entry

@fafhrd91 fafhrd91 merged commit a666d7f into aio-libs:2.0 Mar 24, 2017
@lock
Copy link

lock bot commented Oct 29, 2019

This thread has been automatically locked since there has not been
any recent activity after it was closed. Please open a new issue for
related bugs.

If you feel like there's important points made in this discussion,
please include those exceprts into that new issue.

@lock lock bot added the outdated label Oct 29, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Oct 29, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants