Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table: Need to encode special characters when serializing entity to XML #111

Closed
joostdenijs opened this issue Jul 16, 2012 · 10 comments
Closed

Comments

@joostdenijs
Copy link
Contributor

Dev Estimate: 5
Test Estimate: 1

If you try to insert an entity with a "bad" character, decimal values 0-8,11,12,14-31, the XML is malformed and gets rejected by the server.

Comparing what this code emits with what the .NET Table Convienence layer emits, we see:

1.Troublesome characters are encoded with &#x_; encoding, so char(0x0E) becomes .
2. Strings with whitespace characters (like 0x0B) have the xmlns:space="preserve" attribute added to their property:

<m:properties>
  <d:Email xmlns:space="preserve">&#xB;</d:Email>
 ...
@ghost ghost assigned gcheng Jul 16, 2012
@gcheng
Copy link

gcheng commented Jul 30, 2012

It is actually harder than it looks like to implement :(

@gcheng
Copy link

gcheng commented Jul 30, 2012

fix ready!

@gcheng gcheng closed this as completed Aug 7, 2012
@jcookems jcookems reopened this Oct 30, 2012
@jcookems
Copy link
Contributor

Found that the fix incorrectly converts characters outside of the "bad" character set: 9, 10, 13. These should be allowed to flow through unconverted.

@jcookems
Copy link
Contributor

See https://github.com/WindowsAzure/azure-sdk-for-java/pull/143/files#r1296044

The code is too aggressive:

if (charArray[index] < 0x20 || charArray[index] > 0x7f)

The following excerpt from http://en.wikipedia.org/wiki/Valid_characters_in_XML explains that

Unicode code points in the following ranges are valid in XML 1.0 documents:[1]
U+0009, U+000A, U+000D: these are the only C0 controls accepted in XML 1.0;

@jcookems
Copy link
Contributor

This also breaks GB18030.

joostdenijs pushed a commit to joostdenijs/azure-sdk-for-java that referenced this issue Jan 18, 2013
Renaming domain data objects. fixes Azure#95 fixes Azure#108
@ghost ghost assigned christav Feb 21, 2013
@christav
Copy link
Contributor

The current codebase cheats. It encodes, for example, the byte 0x05 in the XML as &amp;#x5;. Note the leading '&' character is double-encoded.

This is problematic on a couple levels. First, the corresponding reader code doesn't decode the string, so you put in the Java literal string "\u0005", and what you end up with is &#x5;; it doesn't round trip correctly. The second issue is that there's no guarantee that other SDKs will do the same thing and interpret the resulting strings correctly.

Having said that, having the string � in an XML document is illegal; the Java stuff will throw an exception when it sees it; it doesn't even have to be straight binary, just that particular code point.

For now I'll leave it in place, and just restrict the range of code points to be escaped. We should consider doing the correct unescaping either as part of this WI or as a new one.

@jcookems
Copy link
Contributor

The code now works for a majority of the characters required for GB-18030, but still fails for Chinese Extension-B characters (which are represented with surrogate characters in UTF-16). I updated the new test with this:

Entity entity = new Entity().setPartitionKey("001").setRowKey("insertEntityEscapeCharactersWorks")
...
     .setProperty("test8", EdmType.STRING, "surrogate pair \uD840\uDC00");
...
String actualTest8 = (String) result.getEntity().getProperty("test8").getValue();
assertEquals("surrogate pair \uD840\uDC00", actualTest8);

@jcookems
Copy link
Contributor

That uses an Chinese Ext-B character "��" (GitHub doesn't support Ext-B?), U+020000, which is represented in UTF-16 using surrogate pairs 0xD840 0xDC00. However, the test fails with:

org.junit.ComparisonFailure: 
    expected:<surrogate pair [X]> 
     but was:<surrogate pair [&#xd840;&#xdc00;]>
    at com.microsoft.windowsazure.services.table.TableServiceIntegrationTest.insertEntityEscapeCharactersWorks(TableServiceIntegrationTest.java:397)

The problem seems to stem from the difference between the Unicode codepoint for a characters and the representation for that character in a particular encoding (UTF-16/UCS2). As @christav pointed out:

Based on the XML spec at http://www.w3.org/TR/xml/#charsets, the valid code points are:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

This explicitly excludes the surrogate character ranges.

Those are the Unicode code points, but Java doesn't present an easy way to get those for characters in a string. Java does allow you to get the "characters" from a string, (char[] charArray = value.toCharArray();), but those "characters" are really double-bytes from the UTF-16 representation of the string.

To come up with a rule that can be applied to Java char values, we need convert the rule for codepoints to UTF-16 values. The rule is that if the codepoint is greater than 0x10000, then it needs to be split (see Unicode Standard 3.0, section 3.7):

H = (S - 0x10000) / 0x400 + 0xD800
L = (S - 0x10000) % 0x400 + 0xDC00 

Also, it appears that there are no Unicode characters with codepoints in the surrogate range; because those are reserved for surrogates.

Applying this to the validity rule from the XML spec, we find that [#x10000-#x10FFFF] goes to [0xD800-0xDFFF], which means that the valid chars (in UTF-16) are

Char       ::=      #x9 | #xA | #xD | [#x20-#xFFFD]

This meshes well with what I found from looking at the result of the .NET client. I created a test app to send this string:

StringBuilder sb = new StringBuilder();
// utf32: 21-bit code points from U+0 through U+10FFFF
for (int i = 0; i < 0x10FFFF; i++) {
    if (0xD800 <= i && i <= 0xDFFF) {
        // excluding the surrogate pair range
        continue;
    }
    sb.Append(char.ConvertFromUtf32(i) + "\n");
}

intercepted the output with Fiddler and searched for ampersands. There were fewer than I expected:

  • &#x0; - &#x8;, &#xB; - &#x1F; (C0 control characters). The U+0009 HORIZONTAL TABULATION [tab] and U+000A LINE FEED are left unescaped.
  • &amp;, &lt;, &gt;(the usual charactrers to escape)
  • &#xFFFE;, &#xFFFF; (the Unicode noncharacters)

This is the same as the rule derived above; the only difference is that .NET chooses to escape 0x0D, which is not necessary.

@jcookems
Copy link
Contributor

Looks good.

gcheng pushed a commit to gcheng/azure-sdk-for-java that referenced this issue May 17, 2013
@github-actions github-actions bot locked and limited conversation to collaborators Apr 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants