Handle surrogate pairs during scanning #159

tamcy · 2024-03-26T03:40:40Z

Fixes #53.

The problem

Issue #53 is about adding support to Emoji characters, and the root cause of this issue is that the current YAML package doesn't handle surrogate pairs when parsing.

Why?

In the YAML spec, a character is a Unicode code point (quote Section 5.2: "All characters mentioned in this specification are Unicode code points. Each such code point is written as one or more bytes depending on the character encoding used. Note that in UTF-16, characters above #xFFFF are written as four bytes, using a surrogate pair."). And in Section 5.1, the spec defines a list of acceptable Unicode characters, with UTF-16 surrogates explicitly excluded.

On the other hand, this package utilizes the string_scanner package to do string parsing or scanning, especially its peekChar() and readChar() methods. And the consumed "character" (in int) will be passed to the _isStandardCharacter() method in scanner.dart to test whether this Unicode character is an allowed one:

yaml/lib/src/scanner.dart

Lines 1594 to 1598 in e598443

    
           bool _isStandardCharacter(int char) => 
        
               (char >= 0x00020 && char <= 0x00007E) || 
        
               (char >= 0x000A0 && char <= 0x00D7FF) || 
        
               (char >= 0x0E000 && char <= 0x00FFFD) || 
        
               (char >= 0x10000 && char <= 0x10FFFF);

The problem is that a character (or char) returned by peekChar() or readChar() in string_scanner is actually a "code unit", and the "character" expected by _isStandardCharacter() is a "code point". As a result, peekChar() and readChar() can return a surrogate (usually the higher one) that represents a paired code units, and the value will be immediately rejected by _isStandardCharacter().

The impact is that the package will reject all code points that requires a surrogate pairs if the string is not quoted. This doesn't only affect Emojis.

A note on my previous attempt

I had submitted a pull request at dart-lang/string_scanner#69, hoping that it can ease fixing the issue of the yaml package. But later I feel increasing uncomfortable about changing a codepoint-centric method to accept an offset argument that act upon code units. So I asked to make that pull request on hold, and decided to present my idea here for feedback first.

What is in this pull request

This pull request doesn’t need any modifications to the string_scanner package as I’ve copied the required utility functions from it. Basically, I’ve implemented the following changes:

I have replaced all instances of _scanner.readChar() with _scanner.readCodePoint(). Although _scanner.readChar() could still be used in some areas, I’ve chosen to maintain consistency to prevent potential confusion or unintentional bugs in the future.
Instances that invoke _isStandardCharacter() have been modified to call the new method _isStandardCharacterAt(). This new method is responsible for checking and merging surrogate pairs, and then passing the actual codepoint/character to _isStandardCharacter().

Misc.

benchmark.dart output before and after the fix (in JIT mode):

Before

Run #1: 0.124ms ============
Run #3: 0.116ms ===========
Run #4: 0.111ms ===========
Run #5: 0.110ms ===========
Run #8: 0.107ms ==========
Run #9: 0.106ms ==========
Run #10: 0.105ms ==========
Run #11: 0.099ms =========
Run #12: 0.092ms =========
Run #15: 0.091ms =========
Run #17: 0.089ms ========
Run #23: 0.077ms =======
Run #29: 0.065ms ======
Run #30: 0.064ms ======
Run #32: 0.059ms =====
Run #34: 0.057ms =====
Run #40: 0.056ms =====
Best   : 0.056ms =====

After

Run #1: 0.110ms ===========
Run #3: 0.100ms ==========
Run #5: 0.090ms =========
Run #9: 0.082ms ========
Run #10: 0.072ms =======
Run #11: 0.070ms =======
Run #12: 0.064ms ======
Run #14: 0.062ms ======
Run #18: 0.061ms ======
Run #21: 0.059ms =====
Best   : 0.059ms =====

kevmoo · 2024-03-26T15:41:23Z

@devoncarew ? @natebosch ?

natebosch · 2024-03-27T15:54:48Z

@lrhn - should we be using package:characters or does this look sufficient?

lrhn · 2024-03-27T16:13:13Z

The YAML spec seems to be defined in terms of valid code points, which are almost all code points except specifically excluded ranges.

Grapheme clusters shouldn't be necessary to validate that. The excluded code points don't generally combine, so they're unlikely to be part of a larger grapheme cluster.

Code points should be enough.

lib/src/scanner.dart

lrhn

Generally, I'd change all the .readCodePoint()s that know they're reading ASCII characters back to just .readChar(). It should be more efficient. (If it isn't, we should make it more efficient,)

The only reason not to would be that you can't mix readChar and readCodePoint calls.
(If so, we should also fix that.)

lib/src/scanner.dart

lib/src/utils.dart

…e the surrogate for further checking

tamcy · 2024-04-04T12:26:32Z

In the latest revision I'd changed back to use readChar() whenever possible. Also I have removed the need to decode the surrogate for further checking because a valid surrogate pair itself should be of the range between 0x10000 to 0x10FFFF, so just checking for the validaity of the high and low surrogate is enough.

kevmoo · 2024-05-28T23:07:45Z

Any interest in moving this forward?

kevmoo · 2024-07-20T22:20:06Z

Thanks so much, @tamcy !

Revisions updated by `dart tools/rev_sdk_deps.dart`. http (https://github.com/dart-lang/http/compare/a0781c5..73fce77): 73fce77 2024-07-19 Nate Bosch Prepare to publish ok_http (dart-lang/http#1274) yaml (https://github.com/dart-lang/yaml/compare/30fd9e0..a645c39): a645c39 2024-07-21 Tamcy Handle surrogate pairs during scanning (dart-lang/yaml#159) Change-Id: I32d27892bdb07978ee712c8446142addafec61a1 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/377040 Reviewed-by: Konstantin Shcheglov <[email protected]> Commit-Queue: Devon Carew <[email protected]>

Change back to readChar() whenever possible; remove the need to decode the surrogate for further checking

tamcy force-pushed the surrogate-fix branch from 0b3a680 to e4d4f14 Compare March 27, 2024 02:50

kevmoo requested review from devoncarew, jonasfj and natebosch March 27, 2024 15:41

natebosch requested a review from lrhn March 27, 2024 15:54

lrhn reviewed Mar 27, 2024

View reviewed changes

lib/src/scanner.dart Outdated Show resolved Hide resolved

lrhn approved these changes Mar 27, 2024

View reviewed changes

Handle surrogate pairs during scanning

ca8d2dc

tamcy force-pushed the surrogate-fix branch from 1638da2 to 342c053 Compare April 4, 2024 11:39

Change back to readChar() whenever possible; remove the need to decod…

153fe9b

…e the surrogate for further checking

tamcy force-pushed the surrogate-fix branch from 342c053 to 153fe9b Compare April 4, 2024 12:23

jonasfj removed their request for review May 29, 2024 08:53

Merge branch 'master' into surrogate-fix

f039440

kevmoo merged commit a645c39 into dart-lang:master Jul 20, 2024
6 checks passed

tamcy mentioned this pull request Sep 22, 2024

fix: Emoji support #53 #125

Closed

mosuem pushed a commit to dart-lang/tools that referenced this pull request Dec 11, 2024

Handle surrogate pairs during scanning (dart-lang/yaml#159)

a0f9e57

Change back to readChar() whenever possible; remove the need to decode the surrogate for further checking

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle surrogate pairs during scanning #159

Handle surrogate pairs during scanning #159

tamcy commented Mar 26, 2024 •

edited by kevmoo

Loading

kevmoo commented Mar 26, 2024

natebosch commented Mar 27, 2024

lrhn commented Mar 27, 2024 •

edited

Loading

lrhn left a comment

tamcy commented Apr 4, 2024 •

edited

Loading

kevmoo commented May 28, 2024

kevmoo commented Jul 20, 2024

	bool _isStandardCharacter(int char) =>
	(char >= 0x00020 && char <= 0x00007E) \|\|
	(char >= 0x000A0 && char <= 0x00D7FF) \|\|
	(char >= 0x0E000 && char <= 0x00FFFD) \|\|
	(char >= 0x10000 && char <= 0x10FFFF);

Handle surrogate pairs during scanning #159

Handle surrogate pairs during scanning #159

Conversation

tamcy commented Mar 26, 2024 • edited by kevmoo Loading

The problem

Why?

A note on my previous attempt

What is in this pull request

Misc.

Before

After

kevmoo commented Mar 26, 2024

natebosch commented Mar 27, 2024

lrhn commented Mar 27, 2024 • edited Loading

lrhn left a comment

Choose a reason for hiding this comment

tamcy commented Apr 4, 2024 • edited Loading

kevmoo commented May 28, 2024

kevmoo commented Jul 20, 2024

tamcy commented Mar 26, 2024 •

edited by kevmoo

Loading

lrhn commented Mar 27, 2024 •

edited

Loading

tamcy commented Apr 4, 2024 •

edited

Loading