Table: Need to encode special characters when serializing entity to XML #111

joostdenijs · 2012-07-16T21:13:48Z

Dev Estimate: 5
Test Estimate: 1

If you try to insert an entity with a "bad" character, decimal values 0-8,11,12,14-31, the XML is malformed and gets rejected by the server.

Comparing what this code emits with what the .NET Table Convienence layer emits, we see:

1.Troublesome characters are encoded with &#x_; encoding, so char(0x0E) becomes &#xE;.
2. Strings with whitespace characters (like 0x0B) have the xmlns:space="preserve" attribute added to their property:

<m:properties>
  <d:Email xmlns:space="preserve">&#xB;</d:Email>
 ...

The text was updated successfully, but these errors were encountered:

gcheng · 2012-07-29T22:36:39Z

http://blogs.msdn.com/b/gongcheng/archive/2010/03/30/how-to-encode-non-ascii-characters-in-xml.aspx

gcheng · 2012-07-30T00:27:36Z

It is actually harder than it looks like to implement :(

gcheng · 2012-07-30T17:15:15Z

fix ready!

jcookems · 2012-10-30T18:40:48Z

Found that the fix incorrectly converts characters outside of the "bad" character set: 9, 10, 13. These should be allowed to flow through unconverted.

jcookems · 2012-10-30T18:50:30Z

See https://github.com/WindowsAzure/azure-sdk-for-java/pull/143/files#r1296044

The code is too aggressive:

if (charArray[index] < 0x20 || charArray[index] > 0x7f)

The following excerpt from http://en.wikipedia.org/wiki/Valid_characters_in_XML explains that

Unicode code points in the following ranges are valid in XML 1.0 documents:[1]
U+0009, U+000A, U+000D: these are the only C0 controls accepted in XML 1.0;

jcookems · 2012-11-15T01:46:48Z

This also breaks GB18030.

Renaming domain data objects. fixes Azure#95 fixes Azure#108

christav · 2013-02-22T01:04:49Z

The current codebase cheats. It encodes, for example, the byte 0x05 in the XML as &#x5;. Note the leading '&' character is double-encoded.

This is problematic on a couple levels. First, the corresponding reader code doesn't decode the string, so you put in the Java literal string "\u0005", and what you end up with is ; it doesn't round trip correctly. The second issue is that there's no guarantee that other SDKs will do the same thing and interpret the resulting strings correctly.

Having said that, having the string � in an XML document is illegal; the Java stuff will throw an exception when it sees it; it doesn't even have to be straight binary, just that particular code point.

For now I'll leave it in place, and just restrict the range of code points to be escaped. We should consider doing the correct unescaping either as part of this WI or as a new one.

jcookems · 2013-02-26T00:15:31Z

The code now works for a majority of the characters required for GB-18030, but still fails for Chinese Extension-B characters (which are represented with surrogate characters in UTF-16). I updated the new test with this:

Entity entity = new Entity().setPartitionKey("001").setRowKey("insertEntityEscapeCharactersWorks")
...
     .setProperty("test8", EdmType.STRING, "surrogate pair \uD840\uDC00");
...
String actualTest8 = (String) result.getEntity().getProperty("test8").getValue();
assertEquals("surrogate pair \uD840\uDC00", actualTest8);

jcookems · 2013-02-26T00:16:41Z

That uses an Chinese Ext-B character "��" (GitHub doesn't support Ext-B?), U+020000, which is represented in UTF-16 using surrogate pairs 0xD840 0xDC00. However, the test fails with:

org.junit.ComparisonFailure: 
    expected:<surrogate pair [X]> 
     but was:<surrogate pair [&#xd840;&#xdc00;]>
    at com.microsoft.windowsazure.services.table.TableServiceIntegrationTest.insertEntityEscapeCharactersWorks(TableServiceIntegrationTest.java:397)

The problem seems to stem from the difference between the Unicode codepoint for a characters and the representation for that character in a particular encoding (UTF-16/UCS2). As @christav pointed out:

Based on the XML spec at http://www.w3.org/TR/xml/#charsets, the valid code points are:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

This explicitly excludes the surrogate character ranges.

Those are the Unicode code points, but Java doesn't present an easy way to get those for characters in a string. Java does allow you to get the "characters" from a string, (char[] charArray = value.toCharArray();), but those "characters" are really double-bytes from the UTF-16 representation of the string.

To come up with a rule that can be applied to Java char values, we need convert the rule for codepoints to UTF-16 values. The rule is that if the codepoint is greater than 0x10000, then it needs to be split (see Unicode Standard 3.0, section 3.7):

H = (S - 0x10000) / 0x400 + 0xD800
L = (S - 0x10000) % 0x400 + 0xDC00

Also, it appears that there are no Unicode characters with codepoints in the surrogate range; because those are reserved for surrogates.

Applying this to the validity rule from the XML spec, we find that [#x10000-#x10FFFF] goes to [0xD800-0xDFFF], which means that the valid chars (in UTF-16) are

Char       ::=      #x9 | #xA | #xD | [#x20-#xFFFD]

This meshes well with what I found from looking at the result of the .NET client. I created a test app to send this string:

StringBuilder sb = new StringBuilder();
// utf32: 21-bit code points from U+0 through U+10FFFF
for (int i = 0; i < 0x10FFFF; i++) {
    if (0xD800 <= i && i <= 0xDFFF) {
        // excluding the surrogate pair range
        continue;
    }
    sb.Append(char.ConvertFromUtf32(i) + "\n");
}

intercepted the output with Fiddler and searched for ampersands. There were fewer than I expected:

 - ,  -  (C0 control characters). The U+0009 HORIZONTAL TABULATION [tab] and U+000A LINE FEED are left unescaped.
&, <, >(the usual charactrers to escape)
,  (the Unicode noncharacters)

This is the same as the rule derived above; the only difference is that .NET chooses to escape 0x0D, which is not necessary.

jcookems · 2013-02-27T18:21:45Z

Looks good.

Update properties of QueueInfo

ghost assigned gcheng Jul 16, 2012

gcheng closed this as completed Aug 7, 2012

jcookems reopened this Oct 30, 2012

joostdenijs pushed a commit to joostdenijs/azure-sdk-for-java that referenced this issue Jan 18, 2013

Merge pull request Azure#111 from loudej/issue-95

d39eaf5

Renaming domain data objects. fixes Azure#95 fixes Azure#108

ghost assigned christav Feb 21, 2013

christav mentioned this issue Feb 22, 2013

Table: Reduce characters that get encoded to just illegal in XML ones #258

Merged

ghost assigned jcookems Feb 25, 2013

This was referenced Feb 26, 2013

Fixed test for surrogate pair, removed encoding #259

Closed

Table encoding fixes #260

Merged

jcookems closed this as completed Feb 27, 2013

guangyang mentioned this issue Apr 5, 2013

Queue messages should be base 64 encoded Azure/azure-sdk-for-ruby#6

Closed

gcheng pushed a commit to gcheng/azure-sdk-for-java that referenced this issue May 17, 2013

Merge pull request Azure#111 from gcheng/updateProperties

8f51a94

Update properties of QueueInfo

joostdenijs unassigned jcookems Apr 1, 2014

github-actions bot locked and limited conversation to collaborators Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table: Need to encode special characters when serializing entity to XML #111

Table: Need to encode special characters when serializing entity to XML #111

joostdenijs commented Jul 16, 2012

gcheng commented Jul 29, 2012

gcheng commented Jul 30, 2012

gcheng commented Jul 30, 2012

jcookems commented Oct 30, 2012

jcookems commented Oct 30, 2012

jcookems commented Nov 15, 2012

christav commented Feb 22, 2013

jcookems commented Feb 26, 2013

jcookems commented Feb 26, 2013

jcookems commented Feb 27, 2013

Table: Need to encode special characters when serializing entity to XML #111

Table: Need to encode special characters when serializing entity to XML #111

Comments

joostdenijs commented Jul 16, 2012

gcheng commented Jul 29, 2012

gcheng commented Jul 30, 2012

gcheng commented Jul 30, 2012

jcookems commented Oct 30, 2012

jcookems commented Oct 30, 2012

jcookems commented Nov 15, 2012

christav commented Feb 22, 2013

jcookems commented Feb 26, 2013

jcookems commented Feb 26, 2013

jcookems commented Feb 27, 2013