-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle surrogate pairs during scanning #159
Conversation
@lrhn - should we be using |
The YAML spec seems to be defined in terms of valid code points, which are almost all code points except specifically excluded ranges. Grapheme clusters shouldn't be necessary to validate that. The excluded code points don't generally combine, so they're unlikely to be part of a larger grapheme cluster. Code points should be enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, I'd change all the .readCodePoint()
s that know they're reading ASCII characters back to just .readChar()
. It should be more efficient. (If it isn't, we should make it more efficient,)
The only reason not to would be that you can't mix readChar
and readCodePoint
calls.
(If so, we should also fix that.)
…e the surrogate for further checking
In the latest revision I'd changed back to use |
Any interest in moving this forward? |
Thanks so much, @tamcy ! |
Revisions updated by `dart tools/rev_sdk_deps.dart`. http (https://github.com/dart-lang/http/compare/a0781c5..73fce77): 73fce77 2024-07-19 Nate Bosch Prepare to publish ok_http (dart-lang/http#1274) yaml (https://github.com/dart-lang/yaml/compare/30fd9e0..a645c39): a645c39 2024-07-21 Tamcy Handle surrogate pairs during scanning (dart-lang/yaml#159) Change-Id: I32d27892bdb07978ee712c8446142addafec61a1 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/377040 Reviewed-by: Konstantin Shcheglov <[email protected]> Commit-Queue: Devon Carew <[email protected]>
Change back to readChar() whenever possible; remove the need to decode the surrogate for further checking
Fixes #53.
The problem
Issue #53 is about adding support to Emoji characters, and the root cause of this issue is that the current YAML package doesn't handle surrogate pairs when parsing.
Why?
In the YAML spec, a character is a Unicode code point (quote Section 5.2: "All characters mentioned in this specification are Unicode code points. Each such code point is written as one or more bytes depending on the character encoding used. Note that in UTF-16, characters above #xFFFF are written as four bytes, using a surrogate pair."). And in Section 5.1, the spec defines a list of acceptable Unicode characters, with UTF-16 surrogates explicitly excluded.
On the other hand, this package utilizes the
string_scanner
package to do string parsing or scanning, especially itspeekChar()
andreadChar()
methods. And the consumed "character" (inint
) will be passed to the_isStandardCharacter()
method inscanner.dart
to test whether this Unicode character is an allowed one:yaml/lib/src/scanner.dart
Lines 1594 to 1598 in e598443
The problem is that a character (or char) returned by
peekChar()
orreadChar()
instring_scanner
is actually a "code unit", and the "character" expected by_isStandardCharacter()
is a "code point". As a result,peekChar()
andreadChar()
can return a surrogate (usually the higher one) that represents a paired code units, and the value will be immediately rejected by_isStandardCharacter()
.The impact is that the package will reject all code points that requires a surrogate pairs if the string is not quoted. This doesn't only affect Emojis.
A note on my previous attempt
I had submitted a pull request at dart-lang/string_scanner#69, hoping that it can ease fixing the issue of the yaml package. But later I feel increasing uncomfortable about changing a codepoint-centric method to accept an
offset
argument that act upon code units. So I asked to make that pull request on hold, and decided to present my idea here for feedback first.What is in this pull request
This pull request doesn’t need any modifications to the
string_scanner
package as I’ve copied the required utility functions from it. Basically, I’ve implemented the following changes:_scanner.readChar()
with_scanner.readCodePoint()
. Although_scanner.readChar()
could still be used in some areas, I’ve chosen to maintain consistency to prevent potential confusion or unintentional bugs in the future._isStandardCharacter()
have been modified to call the new method_isStandardCharacterAt()
. This new method is responsible for checking and merging surrogate pairs, and then passing the actual codepoint/character to _isStandardCharacter().Misc.
benchmark.dart output before and after the fix (in JIT mode):
Before
After