-
-
Notifications
You must be signed in to change notification settings - Fork 799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF8JsonGenerator
writes supplementary characters as a surrogate pair -- should use 4-byte encoding
#223
Comments
Yes. Unfortunately this is how JSON specification mandates escaping of these characters: "To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". " (Section 9, "String", http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf -- same as what earlier JSON specifications have said) So although native UTF-8 representation would use 4-byte sequence (and one that I personally agree would be the obvious correct choice), my understanding that JSON specification requires different handling. If there are other interpretations or specifications wrt this issue I would be interested in those. I would be open to addition of |
True, on closer reading that is how the spec requires them to be escaped if you choose to escape them, but supplementary characters are not required to be escaped so an option to control it would be reasonable. |
From that same section of the spec:
|
Hmmh. Ok, fair enough. I am not a fan of using somewhat broken escaping anyway... Thank you for reporting this! |
That sounds perfectly fair. Both forms would parse to the same result for a spec-compliant parser so it's definitely not a critical bug. |
Quick note: hoping to fix this before 2.7.0-rc1 goes out, which is to happen soon (ideally within a week or so but we'll see). |
Ok. Turns out that the fix is not quite as easy as I had hoped. I forgot that the thing that makes this complex is the requirement to have access to 2 chars instead of single one; and that requires propagation of input as well as return value to indicate an "extra" character getting consumed. |
Sort of related, #307. |
As already discussed above, the ECMA specification allows (but does not mandate) using \uHHHH escaping for Unicode characters (including ones that are represented with surrogate pairs in UTF-16). Note that using \uHHHH, though correct and valid has 2 big deficiencies:
To me the \uHHHH support in JSON serves to escape characters that are not representable in the used encoding (say ASCII). This has no meaningful usage nowadays that JSON should be UTF-8 and SHALL be UTF-8/16/32 according to the latest RFC http://www.rfc-editor.org/rfc/rfc7159.txt All that said, I do think the option to use \uHHHH is required and this option must default to not using such escaping. |
Hi, |
@fiserro I haven't had time to work on this and have nothing planned. |
hi @cowtowncoder , |
@rkinabhi It is something that would be nice to resolve but I am not actively working on it at this point (I try to add |
To print emoji while use ByteOutputStream.
|
That's all very well for values you completely control, but you would need to manually take care of escaping everything else apart from the supplementary characters according to JSON rules ( |
I fixed what you said that JSON Rule Add JSON Rule Processing
|
UTF8JsonGenerator
writes supplementary characters as a surrogate pair -- should use 4-byte encoding
Fixed via #1335, although needs to be explicitly enabled with Will default to enabled when merged in 3.0. |
Should this be closed? |
@pjfanning yes, thanks! |
When outputting a string value containing a supplementary Unicode code point, UTF8JsonGenerator is encoding the supplementary character as a pair of
\uNNNN
escapes representing the two halves of the surrogate pair that would denote the code point in UTF-16 instead of using the correct multi-byte UTF-8 encoding of the character. The following Groovy script demonstrates the behaviour:When generating to a Writer rather than an OutputStream (and letting Java handle the UTF-8 byte conversion) the supplementary character U+1F602 is encoded as the correct UTF-8 four byte sequence
f0 9f 98 82
.The text was updated successfully, but these errors were encountered: