Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8JsonGenerator writes supplementary characters as a surrogate pair -- should use 4-byte encoding #223

Closed
ianroberts opened this issue Oct 11, 2015 · 19 comments
Milestone

Comments

@ianroberts
Copy link

When outputting a string value containing a supplementary Unicode code point, UTF8JsonGenerator is encoding the supplementary character as a pair of \uNNNN escapes representing the two halves of the surrogate pair that would denote the code point in UTF-16 instead of using the correct multi-byte UTF-8 encoding of the character. The following Groovy script demonstrates the behaviour:

@Grab(group='com.fasterxml.jackson.core', module='jackson-core', version='2.6.2')
import com.fasterxml.jackson.core.JsonFactory

def factory = new JsonFactory()
def bytes1 = new ByteArrayOutputStream()
def gen1 = factory.createGenerator(bytes1) // UTF8JsonGenerator
gen1.writeStartObject()
gen1.writeStringField("test", new String(Character.toChars(0x1F602)))
gen1.writeEndObject()
gen1.close()
System.out.write(bytes1.toByteArray())
println ""
// prints {"test":"\uD83D\uDE02"}


def bytes2 = new ByteArrayOutputStream()
new OutputStreamWriter(bytes2, "UTF-8").withWriter { w ->
  def gen2 = factory.createGenerator(w) // WriterBasedJsonGenerator
  gen2.writeStartObject()
  gen2.writeStringField("test", new String(Character.toChars(0x1F602)))
  gen2.writeEndObject()
  gen2.close()
}
System.out.write(bytes2.toByteArray())
println ""
// prints {"test":"😂"}

When generating to a Writer rather than an OutputStream (and letting Java handle the UTF-8 byte conversion) the supplementary character U+1F602 is encoded as the correct UTF-8 four byte sequence f0 9f 98 82.

@cowtowncoder
Copy link
Member

Yes. Unfortunately this is how JSON specification mandates escaping of these characters:

"To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". "

(Section 9, "String", http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf -- same as what earlier JSON specifications have said)

So although native UTF-8 representation would use 4-byte sequence (and one that I personally agree would be the obvious correct choice), my understanding that JSON specification requires different handling. If there are other interpretations or specifications wrt this issue I would be interested in those.

I would be open to addition of JsonGenerator.Feature that would allow more natural UTF-8 encoding to be used.

@ianroberts
Copy link
Author

True, on closer reading that is how the spec requires them to be escaped if you choose to escape them, but supplementary characters are not required to be escaped so an option to control it would be reasonable.

@ianroberts
Copy link
Author

From that same section of the spec:

All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F.

@cowtowncoder
Copy link
Member

Hmmh. Ok, fair enough. I am not a fan of using somewhat broken escaping anyway...
So it does seem like output could be changed.
But just in case some code out there would find the change unpalatable (may seem unlikely, but there always tends to be some user somewhere that does report problems), I think this needs to go in 2.7. I could then add a JsonGenerator.Feature, but default it so that native UTF-8 encoding is used unless feature is changed to force escaping.

Thank you for reporting this!

@cowtowncoder cowtowncoder changed the title UTF8JsonGenerator writes supplementary characters as a surrogate pair (2.7) UTF8JsonGenerator writes supplementary characters as a surrogate pair -- should use 4-byte encoding Oct 12, 2015
@ianroberts
Copy link
Author

That sounds perfectly fair. Both forms would parse to the same result for a spec-compliant parser so it's definitely not a critical bug.

@cowtowncoder
Copy link
Member

Quick note: hoping to fix this before 2.7.0-rc1 goes out, which is to happen soon (ideally within a week or so but we'll see).

cowtowncoder added a commit that referenced this issue Nov 24, 2015
@cowtowncoder
Copy link
Member

Ok. Turns out that the fix is not quite as easy as I had hoped. I forgot that the thing that makes this complex is the requirement to have access to 2 chars instead of single one; and that requires propagation of input as well as return value to indicate an "extra" character getting consumed.
So I added failing tests for handling, but have not been able to improve code.

@cowtowncoder cowtowncoder changed the title (2.7) UTF8JsonGenerator writes supplementary characters as a surrogate pair -- should use 4-byte encoding UTF8JsonGenerator writes supplementary characters as a surrogate pair -- should use 4-byte encoding Dec 12, 2015
@cowtowncoder
Copy link
Member

Sort of related, #307.

@mtsvetanov
Copy link

mtsvetanov commented Nov 21, 2016

As already discussed above, the ECMA specification allows (but does not mandate) using \uHHHH escaping for Unicode characters (including ones that are represented with surrogate pairs in UTF-16).

Note that using \uHHHH, though correct and valid has 2 big deficiencies:

  1. it bloats the binary size of the serialized JSON, e.g. for the surrogate pair discussed here, it will take 12 bytes to be represented, while it is actually a single Unicode code point, which requires no more than 5 bytes to be represented natively in UTF-8

  2. the serialized JSON is completely unreadable, which defies one of its main advantages

To me the \uHHHH support in JSON serves to escape characters that are not representable in the used encoding (say ASCII). This has no meaningful usage nowadays that JSON should be UTF-8 and SHALL be UTF-8/16/32 according to the latest RFC http://www.rfc-editor.org/rfc/rfc7159.txt

All that said, I do think the option to use \uHHHH is required and this option must default to not using such escaping.

@fiserro
Copy link

fiserro commented Apr 20, 2017

Hi,
is there any chance that this issue will fixed soon?
Robert

@cowtowncoder
Copy link
Member

@fiserro I haven't had time to work on this and have nothing planned.

@abhijeethp
Copy link

hi @cowtowncoder ,
Is this still a work in progress?

@cowtowncoder
Copy link
Member

cowtowncoder commented Nov 12, 2019

@rkinabhi It is something that would be nice to resolve but I am not actively working on it at this point (I try to add active label on things I do work on).

@gymnopedy01
Copy link

To print emoji while use ByteOutputStream.
It can be print using writeRawValue(cbuf, offset, len) function.

String emoji = new String(Character.toChars(0x1F602));
char[] charArray = String.format("\"%s\"", emoji).toCharArray();
JsonFactory factory = new JsonFactory();
ByteArrayOutputStream bytes3 = new ByteArrayOutputStream();
JsonGenerator gen3 = factory.createGenerator(bytes3); // UTF8JsonGenerator
gen3.writeStartObject();
//gen1.writeStringField("test", new String(Character.toChars(0x1F602)));
gen3.writeFieldName("test");
gen3.writeRawValue(charArray, 0, charArray.length);			
gen3.writeEndObject();
gen3.close();
System.out.write(bytes3.toByteArray());
System.out.println(new String(bytes3.toByteArray()));
// prints {"test":"😂"}

@ianroberts
Copy link
Author

It can be print using writeRawValue(cbuf, offset, len) function.

That's all very well for values you completely control, but you would need to manually take care of escaping everything else apart from the supplementary characters according to JSON rules (\ => \\, " => \", newline => \n, etc. etc.).

@gymnopedy01
Copy link

That's all very well for values you completely control, but you would need to manually take care of escaping everything else apart from the supplementary characters according to JSON rules (\ => \\, " => \", newline => \n, etc. etc.).

I fixed what you said that JSON Rule

Add JSON Rule Processing

String emoji4 = new String(Character.toChars(0x1F602));
char[] cEmoji4 = JsonStringEncoder.getInstance().quoteAsString(String.format("\\\"\n{%s}", emoji4));
char[] charArray4 = new char[cEmoji4.length + 2];
System.arraycopy(cEmoji4, 0, charArray4, 1, cEmoji4.length);
charArray4[0] = '"';
charArray4[charArray4.length - 1] = '"';

ByteArrayOutputStream bytes4 = new ByteArrayOutputStream();
JsonGenerator gen4 = factory.createGenerator(bytes4); // UTF8JsonGenerator
gen4.writeStartObject();
gen4.writeFieldName("test");
gen4.writeRawValue(charArray4, 0, charArray4.length);
gen4.writeEndObject();
gen4.close();

System.out.write(bytes4.toByteArray());
System.out.println(new String(bytes4.toByteArray()));

// prints {"test":"\\\"\n{😂}"}{"test":"\\\"\n{😂}"}

@cowtowncoder cowtowncoder changed the title UTF8JsonGenerator writes supplementary characters as a surrogate pair -- should use 4-byte encoding UTF8JsonGenerator writes supplementary characters as a surrogate pair -- should use 4-byte encoding Sep 17, 2024
@cowtowncoder
Copy link
Member

Fixed via #1335, although needs to be explicitly enabled with JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8 for 2.x.

Will default to enabled when merged in 3.0.

@pjfanning
Copy link
Member

Should this be closed?

@cowtowncoder cowtowncoder added this to the 2.18.0 milestone Dec 4, 2024
@cowtowncoder
Copy link
Member

@pjfanning yes, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants