-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Table: Need to encode special characters when serializing entity to XML #111
Comments
It is actually harder than it looks like to implement :( |
fix ready! |
Found that the fix incorrectly converts characters outside of the "bad" character set: 9, 10, 13. These should be allowed to flow through unconverted. |
See https://github.com/WindowsAzure/azure-sdk-for-java/pull/143/files#r1296044 The code is too aggressive: if (charArray[index] < 0x20 || charArray[index] > 0x7f) The following excerpt from http://en.wikipedia.org/wiki/Valid_characters_in_XML explains that
|
This also breaks GB18030. |
The current codebase cheats. It encodes, for example, the byte This is problematic on a couple levels. First, the corresponding reader code doesn't decode the string, so you put in the Java literal string Having said that, having the string � in an XML document is illegal; the Java stuff will throw an exception when it sees it; it doesn't even have to be straight binary, just that particular code point. For now I'll leave it in place, and just restrict the range of code points to be escaped. We should consider doing the correct unescaping either as part of this WI or as a new one. |
The code now works for a majority of the characters required for GB-18030, but still fails for Chinese Extension-B characters (which are represented with surrogate characters in UTF-16). I updated the new test with this: Entity entity = new Entity().setPartitionKey("001").setRowKey("insertEntityEscapeCharactersWorks")
...
.setProperty("test8", EdmType.STRING, "surrogate pair \uD840\uDC00");
...
String actualTest8 = (String) result.getEntity().getProperty("test8").getValue();
assertEquals("surrogate pair \uD840\uDC00", actualTest8); |
That uses an Chinese Ext-B character "��" (GitHub doesn't support Ext-B?),
The problem seems to stem from the difference between the Unicode codepoint for a characters and the representation for that character in a particular encoding (UTF-16/UCS2). As @christav pointed out:
Those are the Unicode code points, but Java doesn't present an easy way to get those for characters in a string. Java does allow you to get the "characters" from a string, ( To come up with a rule that can be applied to Java
Also, it appears that there are no Unicode characters with codepoints in the surrogate range; because those are reserved for surrogates. Applying this to the validity rule from the XML spec, we find that
This meshes well with what I found from looking at the result of the .NET client. I created a test app to send this string: StringBuilder sb = new StringBuilder();
// utf32: 21-bit code points from U+0 through U+10FFFF
for (int i = 0; i < 0x10FFFF; i++) {
if (0xD800 <= i && i <= 0xDFFF) {
// excluding the surrogate pair range
continue;
}
sb.Append(char.ConvertFromUtf32(i) + "\n");
} intercepted the output with Fiddler and searched for ampersands. There were fewer than I expected:
This is the same as the rule derived above; the only difference is that .NET chooses to escape 0x0D, which is not necessary. |
Looks good. |
Update properties of QueueInfo
Dev Estimate: 5
Test Estimate: 1
If you try to insert an entity with a "bad" character, decimal values 0-8,11,12,14-31, the XML is malformed and gets rejected by the server.
Comparing what this code emits with what the .NET Table Convienence layer emits, we see:
1.Troublesome characters are encoded with
&#x_
; encoding, so char(0x0E
) becomes
;.2. Strings with whitespace characters (like 0x0B) have the xmlns:space="preserve" attribute added to their property:
The text was updated successfully, but these errors were encountered: