Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add coverage for supplementary plane code points #3

Conversation

gibson042
Copy link
Contributor

Fixes #2

gibson042 added a commit to gibson042/jmespath.js that referenced this pull request Oct 17, 2022
Copy link
Contributor

@springcomp springcomp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gibson042 Thanks a lot for that.

Please, can you also highlight the composite characters in the compliance tests?
I think JMESPath should not be in the job of normalizing strings.

I would expect compliance tests for cases such as:

'é' - string with one codepoint (U+00E9 LATIN SMALL LETTER E WITH ACUTE) vs
'é' - string with two codepoints (U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT)

Specifically, if we view a string as pure sequence of Unicode code points, I think we should have something like:

Given Code Points Expression Result
"é" U+00E9 LATIN SMALL LETTER E WITH ACUTE length(@) 1
"e◌́" U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT length(@) 2 (this feels wrong 🤔)
"𝌆" U+1D306 TETRAGRAM FOR CENTER length(@) 1

The downside to the second line is that reverse() will break the string by reversing the two codepoints. ☹
This seems clearly unacceptable.

What do you think?

As described in a later post below, I took the liberty to push a fix for precomposed characters.
This means that JMESPath now correctly handles:

Given Code Points Expression Result
"é" U+00E9 LATIN SMALL LETTER E WITH ACUTE length(@) 1
"e◌́" U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT length(@) 1
"𝌆" U+1D306 TETRAGRAM FOR CENTER length(@) 1

This also means that the reverse() function will not break the sequence of codepoints in the second line.

@springcomp
Copy link
Contributor

springcomp commented Oct 18, 2022

@gibson042

I uploaded fixes to the length(), reverse(), sort() and sort_by() functions to the JMESPath Community Preview page.

I took the liberty to handle composite characters as a single un-breakable entity which I feel is more intuitive to users.
If that’s OK for you, I think we can add a paragraph to the specification to explicitly describe JMESPath’s standards-compliant behaviour which is:

JMESPath treats strings as a sequence of Unicode characters. For the purpose of this specification, a character is defined like so:

  • A single numerical code point in the range U+0000 to U+FFFF excluding the surrogate range U+D800 to U+DFFF.
  • A surrogate pair of two numerical code units (high U+D800 to U+DBFF and low U+DC00 to U+DFFF).

Some Unicode codepoints represent precomposed characters. A precomposed character can also be defined as a sequence of one or more codepoints and typically represent a letter with a diacritical mark. For instance, the character é can be defined as two equivalent representations:

  • é U+00E9 LATIN SMALL LETTER E WITH ACUTE or
  • e+◌́ U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT.

These two representations are treated as an equivalent single character by JMESPath, even though their internal representations use different sequences of codepoints.

I think we also need to update the specification for the sort() function which sorts based on the sequence of codepoints as well as for the == comparison operator.

Please, let me know what you think.

@gibson042
Copy link
Contributor Author

Please, can you also highlight the composite characters in the compliance tests? … I would expect compliance tests for cases such as:

'é' - string with one codepoint (U+00E9 LATIN SMALL LETTER E WITH ACUTE) vs 'é' - string with two codepoints (U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT)

Done, as described below.

Specifically, if we view a string as pure sequence of Unicode code points, I think we should have something like:

Given Code Points Expression Result
"é" U+00E9 LATIN SMALL LETTER E WITH ACUTE length(@) 1
"e◌́" U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT length(@) 2 (this feels wrong )
"𝌆" U+1D306 TETRAGRAM FOR CENTER length(@) 1
The downside to the second line is that reverse() will break the string by reversing the two codepoints. frowning_face This seems clearly unacceptable.

I disagree that it is unacceptable—anyone wanting normalization to either composed or decomposed forms (both of which have their independent uses) can and should apply the relevant transformations on their own, although it may be reasonable to include such normalization in the standard library. Note that not all combinations even have a composed form (e.g., U+0067 LATIN SMALL LETTER G + U+0308 COMBINING DIAERESIS "g̈" has no simpler form), and that higher order segmentation into e.g. grapheme clusters is locale-dependent (such as "ch" being a single cluster in Slovak but two in English). There are well-known hazards with reversing/sorting/etc. arbitrary Unicode strings, which can be warned about but not really avoided.

In my opinion, treating strings as opaque sequences of code points is the right line to draw.

Given Code Points Expression Result
"é" U+00E9 LATIN SMALL LETTER E WITH ACUTE length(@) 1
"e◌́" U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT length(@) 1
"𝌆" U+1D306 TETRAGRAM FOR CENTER length(@) 1
This also means that the reverse() function will not break the sequence of codepoints in the second line.

But only in such special cases. I'm afraid that kind of cure is worse than the disease. 🤕

I took the liberty to handle composite characters as a single un-breakable entity which I feel is more intuitive to users.

What is the locale-independent definition of "composite character", and how does it relate to the [locale-dependent] UAX #29?

If that’s OK for you, I think we can add a paragraph to the specification to explicitly describe JMESPath’s standards-compliant behaviour which is:

JMESPath treats strings as a sequence of Unicode characters. For the purpose of this specification, a character is defined like so:

  • A single numerical code point in the range U+0000 to U+FFFF excluding the surrogate range U+D800 to U+DFFF.
  • A surrogate pair of two numerical code units (high U+D800 to U+DBFF and low U+DC00 to U+DFFF).

I think this instead needs to align with JSON, which permits unpaired surrogates:
JMESPath treats strings as a sequence of Unicode code points. For the purpose of this specification and in contrast with Unicode, the term "character" refers to any arbitrary code point. Note that, as in JSON, some code points can be expressed as a surrogate pair of two UTF-16 code units (e.g., "\uD834\uDD1E" represents U+1D11E MUSICAL SYMBOL G CLEF 𝄞) and strings can also contain lone surrogate code points such as "\uDEAD".

Some Unicode codepoints represent precomposed characters. A precomposed character can also be defined as a sequence of one or more codepoints and typically represent a letter with a diacritical mark. For instance, the character é can be defined as two equivalent representations:

  • é U+00E9 LATIN SMALL LETTER E WITH ACUTE or
  • e+◌́ U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT.

These two representations are treated as an equivalent single character by JMESPath, even though their internal representations use different sequences of codepoints.

As explained above, I am opposed to this.

I think we also need to update the specification for the sort() function which sorts based on the sequence of codepoints as well as for the == comparison operator.

sort is already documented with "Sorting strings is based on code points. Locale is not taken into account.", which looks fine to me. Likewise for ==: "A string is equal to another string if they they have the exact sequence of code points."

@springcomp
Copy link
Contributor

springcomp commented Oct 18, 2022

@gibson042 I understand your position and sort of agree in most with an important caveat, being the notion of combining characters.

  • I do agree that we should mostly process strings as sequences of Unicode code points.
  • I agree that sorting and comparisons are usually locale-dependant. In the context of JMESPath, performing those tasks based on the numerical value of the sequence of codepoints seems appropriate.

However, in my mind, combining characters should be exactly that, combining with the previous code point(s) in the sequence. I do not see that as being locale dependant as you imply. In fact, as you correctly outlines, most sequences with combining characters do not have an equivalent precomposed codepoint in the Unicode standard. This leads me to believe that it would be counter intuitive to break the sequence after a reverse() function (or even a split() in the future).

In fact the .NET Unicode aware classes that are used internally in JMESPath.NET do treat any sequence of a base chararacter followed by any number of combining marks as a single "TextElement". It seem Apple iOS software stack does the same. It seems like a common request in most programming languages.

Can you be more specific in the liabilities you would see by replicating this in JmesPath?

The more I read this section of UAX #29 you linked to, the more I think JMESPath should at least be able to handle legacy grapheme clusters. Are you really suggesting we should not even attempt to doing this?

@gibson042
Copy link
Contributor Author

gibson042 commented Oct 18, 2022

However, in my mind, combining characters should be exactly that, combining with the previous code point(s) in the sequence. I do not see that as being locale dependant as you imply. In fact, as you correctly outlines, most sequences with combining characters do not have an equivalent precomposed codepoint in the Unicode standard. This leads me to believe that it would be counter intuitive to break the sequence after a reverse() function (or even a split() in the future).

There are multiple kinds of combination—General_Category=Mark as in ȩ̷́̈ (with many special cases such as enclosing 1⃣ and variation selectors like ︎🍰︎), zero width joiner as in 🏳️‍⚧️, zero width non joiner as in র‌্যাঁদা, conjoining jamo as in 각, Devanagari vowels as in षि, regional indicators like 🇺🇸, not to mention defective combining character sequences. And yes, much of which is locale-sensitive.

In fact the .NET Unicode aware classes that are used internally in JMESPath.NET do treat any sequence of a base chararacter followed by any number of combining marks as a single "TextElement". It seem Apple iOS software stack does the same. It seems like a common request in most programming languages.

Can you be more specific in the liabilities you would see by replicating this in JmesPath?

In short: a dramatic and harmful decrease in the predictability of standard library behavior.

The more I read this section of UAX #29 you linked to, the more I think JMESPath should at least be able to handle legacy grapheme clusters. Are you really suggesting we should not even attempt to doing this?

Yes, at least not without explicit opt-in via separate (and possibly even configurable) interface. Comparison operators, length, reverse, and sort should remain straightforward handlers of opaque sequences of code points.

@springcomp
Copy link
Contributor

@gibson042 I will trust your judgement on this one 😉.

Thank you for your contributions.

@springcomp springcomp merged commit 0be0a71 into jmespath-community:main Oct 19, 2022
@springcomp
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unicode coverage is incomplete
2 participants