-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add coverage for supplementary plane code points #3
Add coverage for supplementary plane code points #3
Conversation
Preemptively copied from jmespath-community/jmespath.test#3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gibson042 Thanks a lot for that.
Please, can you also highlight the composite characters in the compliance tests?
I think JMESPath should not be in the job of normalizing strings.
I would expect compliance tests for cases such as:
'é'
- string with one codepoint (U+00E9 LATIN SMALL LETTER E WITH ACUTE) vs
'é'
- string with two codepoints (U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT)
Specifically, if we view a string as pure sequence of Unicode code points, I think we should have something like:
Given | Code Points | Expression | Result |
---|---|---|---|
"é" |
U+00E9 LATIN SMALL LETTER E WITH ACUTE | length(@) |
1 |
"e◌́" |
U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT | length(@) |
2 (this feels wrong 🤔) |
"𝌆" |
U+1D306 TETRAGRAM FOR CENTER | length(@) |
1 |
The downside to the second line is that reverse()
will break the string by reversing the two codepoints. ☹
This seems clearly unacceptable.
What do you think?
As described in a later post below, I took the liberty to push a fix for precomposed characters.
This means that JMESPath now correctly handles:
Given | Code Points | Expression | Result |
---|---|---|---|
"é" |
U+00E9 LATIN SMALL LETTER E WITH ACUTE | length(@) |
1 |
"e◌́" |
U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT | length(@) |
1 |
"𝌆" |
U+1D306 TETRAGRAM FOR CENTER | length(@) |
1 |
This also means that the reverse()
function will not break the sequence of codepoints in the second line.
I uploaded fixes to the I took the liberty to handle composite characters as a single un-breakable entity which I feel is more intuitive to users. JMESPath treats strings as a sequence of Unicode characters. For the purpose of this specification, a character is defined like so:
Some Unicode codepoints represent precomposed characters. A precomposed character can also be defined as a sequence of one or more codepoints and typically represent a letter with a diacritical mark. For instance, the character
These two representations are treated as an equivalent single character by JMESPath, even though their internal representations use different sequences of codepoints. I think we also need to update the specification for the Please, let me know what you think. |
Done, as described below.
I disagree that it is unacceptable—anyone wanting normalization to either composed or decomposed forms (both of which have their independent uses) can and should apply the relevant transformations on their own, although it may be reasonable to include such normalization in the standard library. Note that not all combinations even have a composed form (e.g., U+0067 LATIN SMALL LETTER G + U+0308 COMBINING DIAERESIS "g̈" has no simpler form), and that higher order segmentation into e.g. grapheme clusters is locale-dependent (such as "ch" being a single cluster in Slovak but two in English). There are well-known hazards with reversing/sorting/etc. arbitrary Unicode strings, which can be warned about but not really avoided. In my opinion, treating strings as opaque sequences of code points is the right line to draw.
But only in such special cases. I'm afraid that kind of cure is worse than the disease. 🤕
What is the locale-independent definition of "composite character", and how does it relate to the [locale-dependent] UAX #29?
I think this instead needs to align with JSON, which permits unpaired surrogates:
As explained above, I am opposed to this.
sort is already documented with "Sorting strings is based on code points. Locale is not taken into account.", which looks fine to me. Likewise for |
@gibson042 I understand your position and sort of agree in most with an important caveat, being the notion of combining characters.
However, in my mind, combining characters should be exactly that, combining with the previous code point(s) in the sequence. I do not see that as being locale dependant as you imply. In fact, as you correctly outlines, most sequences with combining characters do not have an equivalent precomposed codepoint in the Unicode standard. This leads me to believe that it would be counter intuitive to break the sequence after a In fact the .NET Unicode aware classes that are used internally in JMESPath.NET do treat any sequence of a base chararacter followed by any number of combining marks as a single "TextElement". It seem Apple iOS software stack does the same. It seems like a common request in most programming languages. Can you be more specific in the liabilities you would see by replicating this in JmesPath? The more I read this section of UAX #29 you linked to, the more I think JMESPath should at least be able to handle legacy grapheme clusters. Are you really suggesting we should not even attempt to doing this? |
There are multiple kinds of combination—General_Category=Mark as in ȩ̷́̈ (with many special cases such as enclosing 1⃣ and variation selectors like ︎🍰︎), zero width joiner as in 🏳️⚧️, zero width non joiner as in র্যাঁদা, conjoining jamo as in 각, Devanagari vowels as in षि, regional indicators like 🇺🇸, not to mention defective combining character sequences. And yes, much of which is locale-sensitive.
In short: a dramatic and harmful decrease in the predictability of standard library behavior.
Yes, at least not without explicit opt-in via separate (and possibly even configurable) interface. Comparison operators, |
@gibson042 I will trust your judgement on this one 😉. Thank you for your contributions. |
Fixes #2