Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle surrogate pairs during scanning #159

Merged
merged 3 commits into from
Jul 20, 2024
Merged

Conversation

tamcy
Copy link
Contributor

@tamcy tamcy commented Mar 26, 2024

Fixes #53.

The problem

Issue #53 is about adding support to Emoji characters, and the root cause of this issue is that the current YAML package doesn't handle surrogate pairs when parsing.

Why?

In the YAML spec, a character is a Unicode code point (quote Section 5.2: "All characters mentioned in this specification are Unicode code points. Each such code point is written as one or more bytes depending on the character encoding used. Note that in UTF-16, characters above #xFFFF are written as four bytes, using a surrogate pair."). And in Section 5.1, the spec defines a list of acceptable Unicode characters, with UTF-16 surrogates explicitly excluded.

On the other hand, this package utilizes the string_scanner package to do string parsing or scanning, especially its peekChar() and readChar() methods. And the consumed "character" (in int) will be passed to the _isStandardCharacter() method in scanner.dart to test whether this Unicode character is an allowed one:

yaml/lib/src/scanner.dart

Lines 1594 to 1598 in e598443

bool _isStandardCharacter(int char) =>
(char >= 0x00020 && char <= 0x00007E) ||
(char >= 0x000A0 && char <= 0x00D7FF) ||
(char >= 0x0E000 && char <= 0x00FFFD) ||
(char >= 0x10000 && char <= 0x10FFFF);

The problem is that a character (or char) returned by peekChar() or readChar() in string_scanner is actually a "code unit", and the "character" expected by _isStandardCharacter() is a "code point". As a result, peekChar() and readChar() can return a surrogate (usually the higher one) that represents a paired code units, and the value will be immediately rejected by _isStandardCharacter().

The impact is that the package will reject all code points that requires a surrogate pairs if the string is not quoted. This doesn't only affect Emojis.

A note on my previous attempt

I had submitted a pull request at dart-lang/string_scanner#69, hoping that it can ease fixing the issue of the yaml package. But later I feel increasing uncomfortable about changing a codepoint-centric method to accept an offset argument that act upon code units. So I asked to make that pull request on hold, and decided to present my idea here for feedback first.

What is in this pull request

This pull request doesn’t need any modifications to the string_scanner package as I’ve copied the required utility functions from it. Basically, I’ve implemented the following changes:

  1. I have replaced all instances of _scanner.readChar() with _scanner.readCodePoint(). Although _scanner.readChar() could still be used in some areas, I’ve chosen to maintain consistency to prevent potential confusion or unintentional bugs in the future.
  2. Instances that invoke _isStandardCharacter() have been modified to call the new method _isStandardCharacterAt(). This new method is responsible for checking and merging surrogate pairs, and then passing the actual codepoint/character to _isStandardCharacter().

Misc.

benchmark.dart output before and after the fix (in JIT mode):

Before

Run #1: 0.124ms ============
Run #3: 0.116ms ===========
Run #4: 0.111ms ===========
Run #5: 0.110ms ===========
Run #8: 0.107ms ==========
Run #9: 0.106ms ==========
Run #10: 0.105ms ==========
Run #11: 0.099ms =========
Run #12: 0.092ms =========
Run #15: 0.091ms =========
Run #17: 0.089ms ========
Run #23: 0.077ms =======
Run #29: 0.065ms ======
Run #30: 0.064ms ======
Run #32: 0.059ms =====
Run #34: 0.057ms =====
Run #40: 0.056ms =====
Best   : 0.056ms =====

After

Run #1: 0.110ms ===========
Run #3: 0.100ms ==========
Run #5: 0.090ms =========
Run #9: 0.082ms ========
Run #10: 0.072ms =======
Run #11: 0.070ms =======
Run #12: 0.064ms ======
Run #14: 0.062ms ======
Run #18: 0.061ms ======
Run #21: 0.059ms =====
Best   : 0.059ms =====

@kevmoo
Copy link
Member

kevmoo commented Mar 26, 2024

@devoncarew ? @natebosch ?

@natebosch natebosch requested a review from lrhn March 27, 2024 15:54
@natebosch
Copy link
Member

@lrhn - should we be using package:characters or does this look sufficient?

@lrhn
Copy link
Member

lrhn commented Mar 27, 2024

The YAML spec seems to be defined in terms of valid code points, which are almost all code points except specifically excluded ranges.

Grapheme clusters shouldn't be necessary to validate that. The excluded code points don't generally combine, so they're unlikely to be part of a larger grapheme cluster.

Code points should be enough.

lib/src/scanner.dart Outdated Show resolved Hide resolved
Copy link
Member

@lrhn lrhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I'd change all the .readCodePoint()s that know they're reading ASCII characters back to just .readChar(). It should be more efficient. (If it isn't, we should make it more efficient,)

The only reason not to would be that you can't mix readChar and readCodePoint calls.
(If so, we should also fix that.)

lib/src/scanner.dart Outdated Show resolved Hide resolved
lib/src/scanner.dart Outdated Show resolved Hide resolved
lib/src/scanner.dart Outdated Show resolved Hide resolved
lib/src/scanner.dart Show resolved Hide resolved
lib/src/utils.dart Outdated Show resolved Hide resolved
lib/src/utils.dart Outdated Show resolved Hide resolved
lib/src/utils.dart Outdated Show resolved Hide resolved
lib/src/utils.dart Outdated Show resolved Hide resolved
lib/src/utils.dart Outdated Show resolved Hide resolved
lib/src/utils.dart Outdated Show resolved Hide resolved
@tamcy
Copy link
Contributor Author

tamcy commented Apr 4, 2024

In the latest revision I'd changed back to use readChar() whenever possible. Also I have removed the need to decode the surrogate for further checking because a valid surrogate pair itself should be of the range between 0x10000 to 0x10FFFF, so just checking for the validaity of the high and low surrogate is enough.

@kevmoo
Copy link
Member

kevmoo commented May 28, 2024

Any interest in moving this forward?

@jonasfj jonasfj removed their request for review May 29, 2024 08:53
@kevmoo kevmoo merged commit a645c39 into dart-lang:master Jul 20, 2024
6 checks passed
@kevmoo
Copy link
Member

kevmoo commented Jul 20, 2024

Thanks so much, @tamcy !

copybara-service bot pushed a commit to dart-lang/sdk that referenced this pull request Jul 22, 2024
Revisions updated by `dart tools/rev_sdk_deps.dart`.

http (https://github.com/dart-lang/http/compare/a0781c5..73fce77):
  73fce77  2024-07-19  Nate Bosch  Prepare to publish ok_http (dart-lang/http#1274)

yaml (https://github.com/dart-lang/yaml/compare/30fd9e0..a645c39):
  a645c39  2024-07-21  Tamcy  Handle surrogate pairs during scanning (dart-lang/yaml#159)

Change-Id: I32d27892bdb07978ee712c8446142addafec61a1
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/377040
Reviewed-by: Konstantin Shcheglov <[email protected]>
Commit-Queue: Devon Carew <[email protected]>
@tamcy tamcy mentioned this pull request Sep 22, 2024
mosuem pushed a commit to dart-lang/tools that referenced this pull request Dec 11, 2024
Change back to readChar() whenever possible; remove the need to decode the surrogate for further checking
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Emoji support
4 participants