-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misinterpretation of rfc2616 in response.dart #186
Comments
As you say, there is some interpretation involved when reading RFCs so the way I see it ambiguity is the problem here. I've just posted a write-up on #175 for a very similar issue, but let me elaborate specifically about the RFCs you quote. I agree that the first part of the quote only concerns
However the next sentence says:
Which I interpret to say that media types that are not ISO-8859-1 text must set charset - which is to say ISO-8859-1 is the default if no charset is set. The later RFC 7231 goes on to amend RFC 2616 with:
Which I frankly don't think helps clarifying much since the IANA entries for e.g. text/html don't specify any default encoding, AFAICT. As also discussed in #175 JSON was also later on amended to use UTF-8 by default in RFC 8259, but whether that retroactively affects the IANA media type is unclear to me. When all is said and done, as you mention, behaviour when handling anything other than As also discussed further on #175 Dart I hope that clarifies why we're doing what we're doing. |
The title of the section is "Canonicalization and Text Defaults". This section is talking particularly about media subtypes of the "text". Nothing in this paragraph applies to If you open the IANA definition for application/json, there are two crutial things to observe. The media type was defined by rfc8259 and the latest observation in the document which I quote here:
The rfc8259 talks about character encoding in section 8.1:
To use
In my opinion, you can't interpret that the charset needs to be present when its clear in the media type definition its not even an optional parameter. It is also not cool to make the automatic charset parsing to something that never was defined to be the defaults for this particular media type, probably inconvenient to the majority of the users since most of the libraries in other languages I've tried so far behave differently (and there is even a concrete example with Postman on #175) .
This was unclear for me. Do you mean the current implementation only applies to text subtypes? It doesn't seem to be the case as the application/json is also being encoded as I was able to solve the issue in application level, but I don't believe its the right place to do it. My goal here is more than solving the issue I had at hands. I also don't think the discussion is over and therefore the issue should be reopened. As you saw from the other issue this will be very common because there is a large number of APIs that don't provide the charset parameter and are not required to. I'm not fully aware of the impact, but It might be nice to consider that since this package didn't reach stable version (1.0) yet, this might be the perfect time to make such a change before even more applications rely on this library. One thing that I'm very sure is about is that at least the comment in |
As said, "there is some interpretation involved", and I absolutely do not mean to say my interpretation is necessarily the one and only interpretation. However, to me the central point here is that As such Ultimately, the HTTP standard does not define how client implementations should handle anything but
For that reason I do believe the documentation on body is correct. I hope that makes sense. |
I agree with that, but I wouldn't have to anything in my application layer if the body was being provided as is, instead of automatically processed to latin1. Since
There is nothing in RFC 2616 that supports the text above from body. RFC 2616 refers only to the case when the sender provides the charset. Most of the senders using I'm guessing another reason for resisting this change would be: Line 27 in 5f0d557
The fact the body is stored represented in Uint8List forces a later charset decode. My knowledge is limited here, and my question is if this is really neccessary and there isn't a way to represent in a format that doesn't require later processing. The fuctions are just trying to figure out which charset should be used. Instead of doing that, we would be able to apply charset decode ONLY in two cases:
|
The response is internally represented as The contentious point then is that RFC 2616 goes on to define special handling for The behaviour of I understand that to many users having good support for JSON would be really convenient. That said I do concede there are things we could do to try make
|
I'm not proposing that and I also don't think its a good idea. I like to have the things where they belong and you are right that RFC 8259 might not belong there. That means I also don't believe its a good idea to have something like bodyJSON. If no better solution can be given however, I'd rather extract the logic of handling the encode from the response class and allow different logic to be applied depending on the media type.
I see that, but the specification never says this should be applied as a default to ANY media type. The tricky thing here is the Uint8List representation, which you necessarily need to encode to be able to have the body as String. How to create a strategy that works for other media types without being a hassle to the developers? Again in all the experience I had so far with other languages and libraries, I've never had the case I had to decode latin1 to utf8 in the application layer when using JSON. I would highlight particularly PHP, and although you might have strong opinions on the language, the way HTTP is handled in PHP was actually a community-driven effort and it lives outside of the language. I believe the interfaces designed there might be a useful resource for inspiration to this library (for example, I think response should not have request as property). I understood that the representation in bytes is due to the nature of what http transports, but its clear this library is handling the charset detection and body encoding in a different way than the rest. I'm trying to do aditional research to be able to come up with a suggestion. PS: Did you happen to come across the observation on top of this w3c document? Maybe those could be used as reference instead? |
Ah, yes, you should never have to convert from bytes to latin1 to UTF-8. Especially because that is potentially lossy. import 'package:http/http.dart' as http;
import 'dart:convert';
var client = new http.Client();
client.get(myUrl)
.then((response) => utf8.decode(response.bodyBytes)).then((bodyStr) {
print(bodyStr);
}); Developers wanting to use yet other media types will likewise have to implement their own application specific handling outside of |
I agree with several points in the conversation:
I checked how python requests works for this and they expose several options, like the fields raw and text, and functions like json(). json() is just a wrapper and may fail (raise an exception) but it is very convenient. I think that it all comes down to giving the user the tools to work effectively without going against the stablished standards. Compliant json servers do not have the charset and compliant clients should not decode utf-8 by default (no charset is present) as that is not written anywhere. Clients, however, could provide the tools to make this painless and avoid unnecesary verbosity. |
Can the issue be reopened? I suggest two paths here:
I tend to prefer the second one for the sake of convenience and single api, but I'd be satisfied with the fairness of the first. |
@cskau-g ? |
I encourage you to open an issue for the improvements to the docs if you'd like to follow up on that. This issue however will most likely remain closed as the original premise remains contended. To address your two suggestions above:
Thanks |
It could be deprecated and removed in the next major and the library didn't reach 1.0 yet. I think this library is following Semantic versioning, and as per its rules it is ok to break backwards compatibility at any time before 1.0. People should not be relying on the library on 0.x.
This is again my recommended and favorite approach. If you are concerned about technical efforts it is similar, but there is a big conceptual difference between extracting the behavior related to each media type to a separate class and adding thousands of bodyX methods to this file. Anyway, I propose to make the changes necessary myself. I'm just asking before so I don't waste my time, so it would be nice to get an agreement from you. |
This fixes the Unicode encoding issue caused by Dart's broken unicode support (dart-lang/http#186)
Dart
http
is annoyingly defaulting toISO-8859-1
whencharset
parameter is not present inContent-Type
header (Content-Type: application/json
) The explanation is provided in response.dart and I quote here:The section
3.7.1 Canonicalization and Text Defaults
talks specifically about subtypes of TEXT, I quote here:Since
application/json
is NOT subtype of text, the above rule does NOT apply.The section
3.7 Media Types
says the following:that means that each type must define if a parameter like
charset
would be required, and to know aboutapplication/json
we need to check the IANA definition, which can be found here.As there is no other rule in rfc2616 that deals particularly with
application
subtypes, and the media type definition does not even mention the charset in the optional parameters (only in the note in the end), I think relying on this being present is a really bad idea. I think most of the http libraries actually default to UTF-8, and most of the communication that happens using this media type uses UTF-8, so a saner option could potentially to default also to UTF-8.The text was updated successfully, but these errors were encountered: