`UTF8JsonGenerator` writes supplementary characters as a surrogate pair -- should use 4-byte encoding #223

ianroberts · 2015-10-11T20:20:48Z

When outputting a string value containing a supplementary Unicode code point, UTF8JsonGenerator is encoding the supplementary character as a pair of \uNNNN escapes representing the two halves of the surrogate pair that would denote the code point in UTF-16 instead of using the correct multi-byte UTF-8 encoding of the character. The following Groovy script demonstrates the behaviour:

@Grab(group='com.fasterxml.jackson.core', module='jackson-core', version='2.6.2')
import com.fasterxml.jackson.core.JsonFactory

def factory = new JsonFactory()
def bytes1 = new ByteArrayOutputStream()
def gen1 = factory.createGenerator(bytes1) // UTF8JsonGenerator
gen1.writeStartObject()
gen1.writeStringField("test", new String(Character.toChars(0x1F602)))
gen1.writeEndObject()
gen1.close()
System.out.write(bytes1.toByteArray())
println ""
// prints {"test":"\uD83D\uDE02"}


def bytes2 = new ByteArrayOutputStream()
new OutputStreamWriter(bytes2, "UTF-8").withWriter { w ->
  def gen2 = factory.createGenerator(w) // WriterBasedJsonGenerator
  gen2.writeStartObject()
  gen2.writeStringField("test", new String(Character.toChars(0x1F602)))
  gen2.writeEndObject()
  gen2.close()
}
System.out.write(bytes2.toByteArray())
println ""
// prints {"test":"😂"}

When generating to a Writer rather than an OutputStream (and letting Java handle the UTF-8 byte conversion) the supplementary character U+1F602 is encoded as the correct UTF-8 four byte sequence f0 9f 98 82.

The text was updated successfully, but these errors were encountered:

cowtowncoder · 2015-10-12T02:18:41Z

Yes. Unfortunately this is how JSON specification mandates escaping of these characters:

"To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". "

(Section 9, "String", http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf -- same as what earlier JSON specifications have said)

So although native UTF-8 representation would use 4-byte sequence (and one that I personally agree would be the obvious correct choice), my understanding that JSON specification requires different handling. If there are other interpretations or specifications wrt this issue I would be interested in those.

I would be open to addition of JsonGenerator.Feature that would allow more natural UTF-8 encoding to be used.

ianroberts · 2015-10-12T08:12:56Z

True, on closer reading that is how the spec requires them to be escaped if you choose to escape them, but supplementary characters are not required to be escaped so an option to control it would be reasonable.

ianroberts · 2015-10-12T08:15:06Z

From that same section of the spec:

All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F.

cowtowncoder · 2015-10-12T16:55:33Z

Hmmh. Ok, fair enough. I am not a fan of using somewhat broken escaping anyway...
So it does seem like output could be changed.
But just in case some code out there would find the change unpalatable (may seem unlikely, but there always tends to be some user somewhere that does report problems), I think this needs to go in 2.7. I could then add a JsonGenerator.Feature, but default it so that native UTF-8 encoding is used unless feature is changed to force escaping.

Thank you for reporting this!

ianroberts · 2015-10-12T17:10:15Z

That sounds perfectly fair. Both forms would parse to the same result for a spec-compliant parser so it's definitely not a critical bug.

cowtowncoder · 2015-11-16T04:39:51Z

Quick note: hoping to fix this before 2.7.0-rc1 goes out, which is to happen soon (ideally within a week or so but we'll see).

cowtowncoder · 2015-11-24T06:51:52Z

Ok. Turns out that the fix is not quite as easy as I had hoped. I forgot that the thing that makes this complex is the requirement to have access to 2 chars instead of single one; and that requires propagation of input as well as return value to indicate an "extra" character getting consumed.
So I added failing tests for handling, but have not been able to improve code.

cowtowncoder · 2016-08-11T05:44:54Z

Sort of related, #307.

mtsvetanov · 2016-11-21T10:33:19Z

As already discussed above, the ECMA specification allows (but does not mandate) using \uHHHH escaping for Unicode characters (including ones that are represented with surrogate pairs in UTF-16).

Note that using \uHHHH, though correct and valid has 2 big deficiencies:

it bloats the binary size of the serialized JSON, e.g. for the surrogate pair discussed here, it will take 12 bytes to be represented, while it is actually a single Unicode code point, which requires no more than 5 bytes to be represented natively in UTF-8
the serialized JSON is completely unreadable, which defies one of its main advantages

To me the \uHHHH support in JSON serves to escape characters that are not representable in the used encoding (say ASCII). This has no meaningful usage nowadays that JSON should be UTF-8 and SHALL be UTF-8/16/32 according to the latest RFC http://www.rfc-editor.org/rfc/rfc7159.txt

All that said, I do think the option to use \uHHHH is required and this option must default to not using such escaping.

fiserro · 2017-04-20T09:27:32Z

Hi,
is there any chance that this issue will fixed soon?
Robert

cowtowncoder · 2017-04-20T15:13:23Z

@fiserro I haven't had time to work on this and have nothing planned.

abhijeethp · 2019-11-12T11:33:28Z

hi @cowtowncoder ,
Is this still a work in progress?

cowtowncoder · 2019-11-12T22:07:14Z

@rkinabhi It is something that would be nice to resolve but I am not actively working on it at this point (I try to add active label on things I do work on).

gymnopedy01 · 2024-01-10T08:56:54Z

To print emoji while use ByteOutputStream.
It can be print using writeRawValue(cbuf, offset, len) function.

String emoji = new String(Character.toChars(0x1F602));
char[] charArray = String.format("\"%s\"", emoji).toCharArray();
JsonFactory factory = new JsonFactory();
ByteArrayOutputStream bytes3 = new ByteArrayOutputStream();
JsonGenerator gen3 = factory.createGenerator(bytes3); // UTF8JsonGenerator
gen3.writeStartObject();
//gen1.writeStringField("test", new String(Character.toChars(0x1F602)));
gen3.writeFieldName("test");
gen3.writeRawValue(charArray, 0, charArray.length);			
gen3.writeEndObject();
gen3.close();
System.out.write(bytes3.toByteArray());
System.out.println(new String(bytes3.toByteArray()));
// prints {"test":"😂"}

ianroberts · 2024-01-10T11:15:36Z

It can be print using writeRawValue(cbuf, offset, len) function.

That's all very well for values you completely control, but you would need to manually take care of escaping everything else apart from the supplementary characters according to JSON rules (\ => \\, " => \", newline => \n, etc. etc.).

gymnopedy01 · 2024-01-11T10:45:50Z

That's all very well for values you completely control, but you would need to manually take care of escaping everything else apart from the supplementary characters according to JSON rules (\ => \\, " => \", newline => \n, etc. etc.).

I fixed what you said that JSON Rule

Add JSON Rule Processing

String emoji4 = new String(Character.toChars(0x1F602));
char[] cEmoji4 = JsonStringEncoder.getInstance().quoteAsString(String.format("\\\"\n{%s}", emoji4));
char[] charArray4 = new char[cEmoji4.length + 2];
System.arraycopy(cEmoji4, 0, charArray4, 1, cEmoji4.length);
charArray4[0] = '"';
charArray4[charArray4.length - 1] = '"';

ByteArrayOutputStream bytes4 = new ByteArrayOutputStream();
JsonGenerator gen4 = factory.createGenerator(bytes4); // UTF8JsonGenerator
gen4.writeStartObject();
gen4.writeFieldName("test");
gen4.writeRawValue(charArray4, 0, charArray4.length);
gen4.writeEndObject();
gen4.close();

System.out.write(bytes4.toByteArray());
System.out.println(new String(bytes4.toByteArray()));

// prints {"test":"\\\"\n{😂}"}{"test":"\\\"\n{😂}"}

cowtowncoder · 2024-09-18T01:53:32Z

Fixed via #1335, although needs to be explicitly enabled with JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8 for 2.x.

Will default to enabled when merged in 3.0.

pjfanning · 2024-12-04T16:44:14Z

Should this be closed?

cowtowncoder · 2024-12-04T18:21:20Z

@pjfanning yes, thanks!

cowtowncoder changed the title ~~UTF8JsonGenerator writes supplementary characters as a surrogate pair~~ (2.7) UTF8JsonGenerator writes supplementary characters as a surrogate pair -- should use 4-byte encoding Oct 12, 2015

cowtowncoder added a commit that referenced this issue Nov 24, 2015

Add a failing test for #223

ec560f3

cowtowncoder changed the title ~~(2.7) UTF8JsonGenerator writes supplementary characters as a surrogate pair -- should use 4-byte encoding~~ UTF8JsonGenerator writes supplementary characters as a surrogate pair -- should use 4-byte encoding Dec 12, 2015

lseeker mentioned this issue Jan 15, 2016

implements Feature.ESCAPE_UTF8_SURROGATES #244

Closed

osheroff mentioned this issue Jan 25, 2017

How to properly deal with emoji / 4 byte utf8? #348

Closed

fiserro mentioned this issue Apr 21, 2017

Allow pars escaped surrogate pairs creationix/jsonparse#32

Closed

ghost mentioned this issue Aug 28, 2018

why do I get different result when use different ways to serialize emoji char. FasterXML/jackson-databind#2123

Closed

gadams00 mentioned this issue Apr 7, 2021

TreeSerialization issue with certain unicode character literals openrewrite/rewrite#405

Closed

apatrida mentioned this issue Nov 8, 2021

V5.2 fix emoji ravendb/ravendb-jvm-client#31

Closed

simonbasle mentioned this issue Jan 16, 2023

Spring escapes Emoji's on json-marshalling spring-projects/spring-framework#29819

Closed

This was referenced Sep 11, 2024

[CAMEL-21199] Camel-jackson not properly marshalling 4-byte characters apache/camel#15515

Closed

Write 4-byte characters (surrogate pairs) instead of escapes #1335

Merged

cowtowncoder changed the title ~~UTF8JsonGenerator writes supplementary characters as a surrogate pair -- should use 4-byte encoding~~ UTF8JsonGenerator writes supplementary characters as a surrogate pair -- should use 4-byte encoding Sep 17, 2024

cowtowncoder added a commit that referenced this issue Sep 21, 2024

Fix unit test regression wrt #223-related change to defaults

bd08605

cowtowncoder mentioned this issue Nov 13, 2024

Non-surrogate characters being incorrectly combined when JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8 is enabled #1359

Closed

cowtowncoder added this to the 2.18.0 milestone Dec 4, 2024

cowtowncoder closed this as completed Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`UTF8JsonGenerator` writes supplementary characters as a surrogate pair -- should use 4-byte encoding #223

`UTF8JsonGenerator` writes supplementary characters as a surrogate pair -- should use 4-byte encoding #223

ianroberts commented Oct 11, 2015

cowtowncoder commented Oct 12, 2015

ianroberts commented Oct 12, 2015

ianroberts commented Oct 12, 2015

cowtowncoder commented Oct 12, 2015

ianroberts commented Oct 12, 2015

cowtowncoder commented Nov 16, 2015

cowtowncoder commented Nov 24, 2015

cowtowncoder commented Aug 11, 2016

mtsvetanov commented Nov 21, 2016 •

edited

Loading

fiserro commented Apr 20, 2017 •

edited

Loading

cowtowncoder commented Apr 20, 2017

abhijeethp commented Nov 12, 2019

cowtowncoder commented Nov 12, 2019 •

edited

Loading

gymnopedy01 commented Jan 10, 2024

ianroberts commented Jan 10, 2024

gymnopedy01 commented Jan 11, 2024

cowtowncoder commented Sep 18, 2024

pjfanning commented Dec 4, 2024

cowtowncoder commented Dec 4, 2024

UTF8JsonGenerator writes supplementary characters as a surrogate pair -- should use 4-byte encoding #223

UTF8JsonGenerator writes supplementary characters as a surrogate pair -- should use 4-byte encoding #223

Comments

ianroberts commented Oct 11, 2015

cowtowncoder commented Oct 12, 2015

ianroberts commented Oct 12, 2015

ianroberts commented Oct 12, 2015

cowtowncoder commented Oct 12, 2015

ianroberts commented Oct 12, 2015

cowtowncoder commented Nov 16, 2015

cowtowncoder commented Nov 24, 2015

cowtowncoder commented Aug 11, 2016

mtsvetanov commented Nov 21, 2016 • edited Loading

fiserro commented Apr 20, 2017 • edited Loading

cowtowncoder commented Apr 20, 2017

abhijeethp commented Nov 12, 2019

cowtowncoder commented Nov 12, 2019 • edited Loading

gymnopedy01 commented Jan 10, 2024

ianroberts commented Jan 10, 2024

gymnopedy01 commented Jan 11, 2024

cowtowncoder commented Sep 18, 2024

pjfanning commented Dec 4, 2024

cowtowncoder commented Dec 4, 2024

`UTF8JsonGenerator` writes supplementary characters as a surrogate pair -- should use 4-byte encoding #223

`UTF8JsonGenerator` writes supplementary characters as a surrogate pair -- should use 4-byte encoding #223

mtsvetanov commented Nov 21, 2016 •

edited

Loading

fiserro commented Apr 20, 2017 •

edited

Loading

cowtowncoder commented Nov 12, 2019 •

edited

Loading