Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/unicode surrogate pairs #75

Merged
merged 6 commits into from
Oct 20, 2022

Conversation

springcomp
Copy link
Collaborator

@springcomp springcomp commented Oct 18, 2022

Fixes #74.

Internally, the .NET String class uses UTF-16 which encodes characters from the Basic Multilingual Plane.
Characters outside this plan are encoded as a sequence of two UTF-16 code units called a surrogate pair.
In that case, a single Unicode character – identified by its given codepoint – is encoded as two UTF-16 code units.
In that case, the .NET String length property returns 2 even though there is a single character.

Another specials case is composite characters. For instance, the character é can be encoded as two different Unicode sequences:

  • U+00E9 LATIN SMALL LETTER E WITH ACUTE i.e a single Unicode codepoint.
  • U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT i.e a sequence of two Unicode codepoints.

As a result, the builtin functions length(), reverse(), sort() and sort_by() did not handle those cases correctly.

This PR fixes #74 by introducing proper support for strings as a sequence of codepoints.
However, this still raises some issues with regards to composite characters.

@springcomp springcomp force-pushed the fix/unicode-surrogate-pairs branch from bd64a60 to f23a6f7 Compare October 19, 2022 06:16
@springcomp
Copy link
Collaborator Author

For the record, this PR fixes surrogate pairs.
However all other processing operations happen on the codepoint level.

@jdevillard jdevillard marked this pull request as ready for review October 20, 2022 15:52
@jdevillard jdevillard merged commit 2b9fbf5 into jdevillard:master Oct 20, 2022
@springcomp springcomp deleted the fix/unicode-surrogate-pairs branch October 20, 2022 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JMESPath.Net does not handle Unicode surrogate pair characters correctly.
2 participants