-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normative: Fully specify legal escape sequences in RegExp capture group names #1869
Conversation
Thanks for fixing the spec bug that accidentally made As explained in #1861, I'd prefer making |
Yup, I'm happy either way. Mostly I just wanted to have a PR to discuss next meeting, and this behavior was easier to implement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking as "request changes" per my earlier comment. Let's discuss this in plenary!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this fix to my incoherent spec text!
To echo my other comment, I'm happy with the change in grammar here, as it seems most consistent with how RegExps deal with literals in general. |
Goals
Guiding exampleAn ASCIIfier gets the following source code as input: 'a'.match(/(?<𐊧>🎈{2})/).groups.𐊧;
'b'.match(/(?<𐊧>🎈{2})/u).groups.𐊧;
'c'.match(RegExp('(?<𐊧>🎈{2})', 'u')).groups.𐊧;
const 𐊧 = 42; Note the following:
Proposal 1The current proposal is to allow escaping astral identifiers as individually-escaped surrogate halves in group names.
This proposal allows the ASCIIfier to output: 'a'.match(/(?<\uD800\uDEA7>\uD83C\uDF88{2})/).groups.\u{102A7};
'b'.match(/(?<\uD800\uDEA7>\uD83C\uDF88{2})/u).groups.\u{102A7};
'c'.match(RegExp('(?<\uD800\uDEA7>\uD83C\uDF88{2})', 'u')).groups.\u{102A7};
const \u{102A7} = 42;
// …or, if the ASCIIfier is slightly more advanced…
'a'.match(/(?<\uD800\uDEA7>\uD83C\uDF88{2})/).groups['\u{102A7}'];
'b'.match(/(?<\uD800\uDEA7>\uD83C\uDF88{2})/u).groups['\u{102A7}'];
'c'.match(RegExp('(?<\uD800\uDEA7>\uD83C\uDF88{2})', 'u')).groups['\u{102A7}'];
const \u{102A7} = 42;
// …or…
'a'.match(/(?<\uD800\uDEA7>\uD83C\uDF88{2})/).groups['\uD800\uDEA7'];
'b'.match(/(?<\uD800\uDEA7>\uD83C\uDF88{2})/u).groups['\uD800\uDEA7'];
'c'.match(RegExp('(?<\uD800\uDEA7>\uD83C\uDF88{2})', 'u')).groups['\uD800\uDEA7'];
const \u{102A7} = 42; Proposal 2Continue disallowing individually-escaped surrogates in group names, and instead allow
This proposal allows the ASCIIfier to output: 'a'.match(/(?<\u{102A7}>\uD83C\uDF88{2})/).groups.\u{102A7};
'b'.match(/(?<\u{102A7}>\uD83C\uDF88{2})/u).groups.\u{102A7};
'c'.match(RegExp('(?<\u{102A7}>\u{1F388}{2})', 'u')).\u{102A7};
const \u{102A7} = 42; Proposal 3Ban astral group names in non-
With this change, the first line of the source program remains invalid: 'a'.match(/(?<𐊧>🎈{2})/).groups.𐊧; // throws early SyntaxError
'b'.match(/(?<𐊧>🎈{2})/u).groups.𐊧;
'c'.match(RegExp('(?<𐊧>🎈{2})', 'u')).groups.𐊧;
const 𐊧 = 42; This proposal allows an ASCIIfier to output: 'a'.match(/(?<\uD83C\uDF88>\uD83C\uDF88{2})/).groups.\u{102A7}; // still throws early SyntaxError
'b'.match(/(?<\u{102A7}>\u{1F388}{2})/u).groups.\u{102A7};
'c'.match(RegExp('(?<\u{102A7}>\u{1F388}{2})', 'u')).\u{102A7};
const \u{102A7} = 42; The only “downside” to this proposal is that it introduces one more difference between non- TL;DRI believe proposal 3 is the way to go. It addresses the concerns that were raised: it’s tooling-friendly, preserves symmetry, and gates the changes on the presence of the |
I believe there is a 4th way that was missed for discussion: allow surrogate pair syntax outside of RegExp grammar as an Identifier and inside of named capture group Ids:
This would satisfy the thinking #1869 (comment) |
I think you're overvaluing this particular narrow symmetry... for the sake of argument (and echoing @bmeck), what about the symmetry of Proposal 4Interpret surrogate pairs as single code points in IdentifierName, just as they are in string, template, and regular expression literals (including in capture group names). Apply code point semantics to
This proposal allows an ASCIIfier to output: 'a'.match(/(?<\uD800\uDEA7>\uD83C\uDF88{2})/).groups.\uD800\uDEA7;
'b'.match(/(?<\uD800\uDEA7>\uD83C\uDF88{2})/u).groups.\uD800\uDEA7;
'c'.match(RegExp('(?<\uD800\uDEA7>\uD83C\uDF88{2})', 'u')).groups.\uD800\uDEA7;
const \uD800\uDEA7 = 42; |
Proposal 4 is another option, indeed. I would still prefer proposal 3 since a) it doesn't introduce the unfortunate concept of surrogates in more places in the language and b) it doesn't introduce a new mental model, but rather fits within a pre-existing one. Proposal 4 seems like a larger change. |
Here’s why I think expecting astral symbols to work within capture group names in non- This is a list of places where astral symbols don’t work as expected in non-
The ES2015 Given the above, I assert that there is no reasonable expectation that astral symbols are supported in any context within a regular expression pattern, unless the The same thing then applies for the |
@mathiasbynens Another downside of proposal three is that Chinese-language users will be unable to use certain nouns in their language as group names in certain regexes, for reasons which will appear totally opaque. (Is anyone really going to already know that "tungsten" (钨) is a BMP character but "seaborgium" (𨭎) is astral?) That seems like it is clearly unacceptable. |
This is already the case for everything else in non- |
In that non- |
Strange limitations are inherent to non- |
Yes, but that doesn't mean that creating more of them is costless. When weighed against the (to my mind) quite small benefits of ensuring the symmetry and mental model you would like to have are unbroken even in this edge case, which will be exposed to vanishingly few users, I think the cost of propagating this particular strange limitation (which will be exposed to far more users) dominates. |
7644ce6
to
77e0f8b
Compare
Per consensus today, all of /(?<\ud835\udc9c>.)/
/(?<\ud835\udc9c>.)/u
/(?<\u{1d49c}>.)/
/(?<\u{1d49c}>.)/u
/(?<𝒜>.)/
/(?<𝒜>.)/u will be legal. I have updated this PR to implement those semantics. |
77e0f8b
to
d374ff9
Compare
d374ff9
to
84c683b
Compare
84c683b
to
87ff636
Compare
87ff636
to
e61ddcb
Compare
…up names (#1869) This commit makes the Early Errors for RegExpIdentifierStart and RegExpIdentifierPart fully specified, with the semantics that Unicode escape sequences of the form `\u LeadSurrogate \u TrailSurrogate` as well as \u { CodePoint }` are legal in named capture group names for both Unicode and non-Unicode regular expressions. This commit thus makes legal all of the following: - `/(?<\ud835\udc9c>.)/` - `/(?<\ud835\udc9c>.)/u` - `/(?<\u{1d49c}>.)/` - `/(?<\u{1d49c}>.)/u` - `/(?<𝒜>)/` - `/(?<𝒜>)/u` Fixes #1861
e61ddcb
to
d1d466c
Compare
…up names (#1869) This commit makes the Early Errors for RegExpIdentifierStart and RegExpIdentifierPart fully specified, with the semantics that Unicode escape sequences of the form `\u LeadSurrogate \u TrailSurrogate` as well as `\u { CodePoint }` are legal in named capture group names for both Unicode and non-Unicode regular expressions. Fixes #1861
d1d466c
to
249c466
Compare
PR tc39#1869 removed/changed the [U] parameter in certain defining productions, but missed some collateral changes in referring productions.
Hi. I guess that this change should update the second and third paragraphs of 21.2.2 Pattern Semantics.
This means, in |
@mysticatea Nice catch! #1932 has a fix. |
This is a followup to #1869. Per consensus, `/(?<𝒜>.)/` should be legal - but the rules for parsing non-u patterns are such that the source text is parsed by treating each half of a surrogate pair as an individual code point, so the rules for RegExpIdentifierStart and RegExpIdentifierPart need to be tweaked to allow surrogate pairs for that case.
…de regular expressions https://bugs.webkit.org/show_bug.cgi?id=210309 Reviewed by Ross Kirsling. JSTests: * stress/regexp-named-capture-groups.js: New test added. (shouldBe): (shouldThrowInvalidGroupSpecifierName): * test262/expectations.yaml: Updated for now failing tests. When test262 gets updated for this change, this can be reverted. Source/JavaScriptCore: Update YARR pattern processing to allow for non-BMP unicode identifier characters in named capture groups. This change was discussed and approved at the March/April 2020 TC-39 meeting. See tc39/ecma262#1869 for the discussion and change. Updated tryConsumeUnicodeEscape() to allow for unicode escapes in non-unicode flagged regex's. Added the same support to consumePossibleSurrogatePair(). * yarr/YarrParser.h: (JSC::Yarr::Parser::consumePossibleSurrogatePair): (JSC::Yarr::Parser::parseCharacterClass): (JSC::Yarr::Parser::parseTokens): (JSC::Yarr::Parser::tryConsumeUnicodeEscape): (JSC::Yarr::Parser::tryConsumeIdentifierCharacter): Canonical link: https://commits.webkit.org/223334@main git-svn-id: https://svn.webkit.org/repository/webkit/trunk@260033 268f45cc-cd09-0410-ab3c-d52691b4dbfc
Fixes #1861. See that issue for context. Marked as normative despite fixing a spec bug because there are two sensible behaviors possible here, either of which is compatible with the current (incomplete) specification.
This PR takes the approach of making
/(?<\ud835\udc9c>.)/u
legal (in addition to making/(?<\u{1d49c}>.)/u
legal, which is more obviously the intent of the current incomplete spec text).