Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: Eliminate order-disambiguation from Annex B Pattern-grammar #2445

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jmdyck
Copy link
Collaborator

@jmdyck jmdyck commented Jun 27, 2021

B.1.4 Regular Expressions Patterns says:

The syntax of 22.2.1 is modified and extended as follows. These changes introduce ambiguities that are broken by the ordering of grammar productions and by contextual information. When parsing using the following grammar, each alternative is considered only if previous production alternatives do not match.

This PR eliminates these order-dependencies (mostly by inserting equivalent lookahead-constraints).

(I made this a Draft PR because it isn't in a final (mergeable) state. On the other hand, it is ready for review, at least to the extent of deciding whether to pursue it.)


The basic idea is that an order-disambiguated production such as:

lhs ::
    alt1
    alt2
    alt3

can be transformed into an equivalent "normal" production:

lhs ::
    alt1
    [lookahead != alt1] alt2
    [lookahead != alt1] [lookahead != alt2] alt3

Of course, applied naively, this would be verbose and hard to read. (Some productions have 9 alternatives!) So instead, we only insert lookahead-constraints (or other exclusions) where an ambiguity actually exists.

Also, there's the risk that an alt might be grammatically more complex than we want to have in a lookahead-constraint. In practice, it looks like some are more complex than we currently allow, but perhaps not unreasonably so.

@jmdyck
Copy link
Collaborator Author

jmdyck commented Jun 27, 2021

Analysis

[I wrote this before the grammatical parameters U and N were renamed to UnicodeMode and NamedCaptureGroups. For brevity, I'll continue to use the old names.]


B.1.4 says:

This alternative pattern grammar and semantics only changes the syntax and semantics of BMP patterns. The following grammar extensions include productions parameterized with the [U] parameter. However, none of these extensions change the syntax of Unicode patterns recognized when parsing with the [U] parameter present on the goal symbol.

That is, if we consider the B.1.4 grammar under [+U], it should duplicate the 22.2.1 grammar under [+U], which we can assume does not have ambiguities. Thus, we only need to look for ambiguities in the B.1.4 grammar under [~U].

This comment will go to each production in B.1.4, and:

  • look at the sentential forms generated (under [~U]) by each alternative (using an ad-hoc notation that combines BNF and regex),
  • identify the ambiguities (if any), and
  • suggest how they can be eliminated.

I'll consider the productions bottom-up (roughly the reverse of their order in the spec), so that our knowledge builds up as we go.


ClassControlLetter ::
  DecimalDigit
  `_`

forms:

  • alt1: [0-9]
  • alt2: _

There's no overlap, so no ambiguities.


SourceCharacterIdentityEscape[N] ::
  [~N] SourceCharacter but not `c`
  [+N] SourceCharacter but not one of `c` or `k`

For any given setting of [N], there's only one alternative, so there can't be an ambiguity.

Note that SourceCharacter is any Unicode code point, so below, I'll paraphrase SourceCharacterIdentityEscape as "any char except c and except k under [+N]"


IdentityEscape[U, N] ::
  [+U] SyntaxCharacter
  [+U] `/`
  [~U] SourceCharacterIdentityEscape[?N]

forms:

  • alt1: ignore
  • alt2: ignore
  • alt3: any char except c and except k under [+N]

Under [~U]:

  • there's only one alternative, so no ambiguities.
  • IdentityEscape has the same expansion as SourceCharacterIdentityEscape

CharacterEscape[U, N] ::
  ControlEscape
  `c` AsciiLetter
  `0` [lookahead ∉ DecimalDigit]
  HexEscapeSequence
  RegExpUnicodeEscapeSequence[?U]
  [~U] LegacyOctalEscapeSequence
  IdentityEscape[?U, ?N]

forms:

  • alt1: [fnrtv]
  • alt2: c [a-zA-Z]
  • alt3: 0 [...]
  • alt4: x ...
  • alt5: u ...
  • alt6: [0-8] ...
  • alt7: any char except c and except k under [+N]

It looks like there might be an ambiguity between alt3 and alt6 on 0, but in alt3, a 0 cannot be followed by a DecimalDigit, whereas in alt6, a 0 must be followed by a DecimalDigit (in effect). So, no ambiguity there.

The remaining ambiguities are between alt7 and everything else (except alt2).

Consider alt1 (ControlEscape). Clearly, if the current input begins with [fnrtv], this alternative will match, and alt7 will never get a chance. So we can resolve this ambiguity simply by excluding [fnrtv] from IdentityEscape. We could do this by appending "but not one of f or n or ..." to alt7, but because this is the only place (in the Annex B grammar) where IdentityEscape appears on the right-hand side, we can instead incorporate the exclusion into the definition of IdentityEscape (under [~U]). And from there, we can likewise push the exclusion down into the definition of SourceCharacterIdentityEscape.

Similarly, if the current input begins with [0-8], it's always the case that either alt3 or alt6 will match (though this isn't obvious), so we can resolve this ambiguity by excluding [0-8] from IdentityEscape.

However, this approach doesn't work for alt4 and alt5. If the current input begins with x, it might not match alt4 (HexEscapeSequence). And if it begins with u, it might not match alt5 (RegExpUnicodeEscapeSequence). In either case, under Annex B rules, alt7 will then be considered, and it will consume the x or u.

So IdentityEscape must still be free to recognize x and u, but only if the current input doesn't match HexEscapeSequence or RegExpUnicodeEscapeSequence.

  [lookahead ∉ HexEscapeSequence] [lookahead ∉ RegExpUnicodeEscapeSequence] IdentityEscape[?U, ?N]

Note that HexEscapeSequence derives a finite set of three-character sequences, and RegExpUnicodeEscapeSequence derives a rather large but still finite set of sequences of various lengths, so this is still within the definition of lookahead-constraints in 5.1.5 Grammar Notation.

Note also that, under [+U], these lookahead-constraints are satisfied automatically.

A quick summary for below: CharacterEscape derives forms that start with any char except k under [+N], and then maybe have more chars.


ClassEscape[U, N] ::
  `b`
  [+U] `-`
  [~U] `c` ClassControlLetter
  CharacterClassEscape[?U]
  CharacterEscape[?U, ?N]

forms:

  • alt1: b
  • alt2: ignore
  • alt3: c [0-9_]
  • alt4: [dswDSW]
  • alt5: (any char except k under [+N]) & maybe more chars

(Note that, for alt4, CharacterClassEscape only includes UnicodeProperty stuff under [+U], which we can ignore.)

alt1, alt3, alt4 are disjoint, so the ambiguities all involve alt5 (CharacterEscape)

alt3 and alt5 are disjoint, because alt3 only derives c [0-9_], and of the forms that alt5 derives that begin with c, there's only c [a-zA-Z].

alt5 has ambiguities with alt1 and alt4, which we could resolve with:

  [lookahead ∉ {`b`, `d`, `s`, `w`, `D`, `S`, `W`}] CharacterEscape[?U, ?N]

CharacterEscape appears in RHS of both ClassEscape and AtomEscape, so we can't push this exclusion into the definition of CharacterEscape unless the two uses agree. But looking ahead, we see that AtomEscape also has a CharacterClassEscape alt followed by a CharacterEscape alt, so they do agree on that exclusion. That is, alt5 here will be:

  [lookahead != `b`] CharacterEscape[?U, ?N]

and we can push the CharacterClassEscape [dswDSW] exclusion down into CharacterEscape, and thence into IdentityEscape, and thence into SourceCharacterIdentityEscape.

A quick summary for below: ClassEscape derives forms that start with any char except k under [+N], and then maybe have more chars.


ClassAtomNoDash[U, N] ::
  SourceCharacter but not one of `\` or `]` or `-`
  `\` ClassEscape[?U, ?N]
  `\` [lookahead == `c`]

forms:

  • alt1: any char except \ or ] or -
  • alt2: \ (any char except k under [+N]) (maybe more chars)
  • alt3: \ [lookahead == c]

alt1 can't be \, so can't conflict with alt2 or alt3.

But alt2 and alt3 conflict if the current input starts with \c. alt3 should only be considered if alt2 fails to match, i.e. if the input doesn't match c ClassControlLetter or c AsciiLetter:

  `\` [lookahead == `c`] [lookahead ∉ `c` ClassControlLetter] [lookahead ∉ `c` AsciiLetter]

This goes slightly outside what 5.1.5 Grammar Notation allows, in that c ClassControlLetter (likewise c AsciiLetter) is neither an explicit set of token sequences, nor a single nonterminal. But the extension seems minor, as the phrase still expands to a finite set of token sequences.

(Alternatively, one could say:

  `\` [lookahead == `c`] [lookahead ∉ ClassEscape[?U, ?N]]

but I think that might be worse.)


AtomEscape[U, N] ::
  [+U] DecimalEscape
  [~U] DecimalEscape [> but only if the CapturingGroupNumber of |DecimalEscape| is ≤ _NcapturingParens_]
  CharacterClassEscape[?U]
  CharacterEscape[?U, ?N]
  [+N] `k` GroupName[?U]

forms:

  • alt1: ignore
  • alt2: [1-9][0-9]* [> but only ...]
  • alt3: [dswDSW]
  • alt4: (any char except k under [+N]) && maybe more chars
  • alt5: [+N] k ...

alt2, alt3, alt5 are disjoint, so alt4 is the source of all ambiguities.

alt4 and alt5 are disjoint: the "except k under [+N]" that we've been carrying along since nearly the start finally pays off.

alt3 conflicts with alt4, but that's easy to deal with, and in fact the change to CharacterEscape proposed above under ClassEscape already took care of it.

alt2 also conflicts with alt4, when the current input starts with [1-9]. The resolution is (roughly) to prepend "[lookahead isnt alt2]" to alt4, but there are various ways it could be expressed. I wound up defining

ConstrainedDecimalEscape ::
  DecimalEscape [> but only if the CapturingGroupNumber of |DecimalEscape| is ≤ _NcapturingParens_]

and then tweaking AtomEscape to

AtomEscape[U, N] ::
  [+U] DecimalEscape
  [~U] ConstrainedDecimalEscape
  CharacterClassEscape[?U]
  [+U] CharacterEscape[?U, ?N]
  [~U] [lookahead ∉ ConstrainedDecimalEscape] CharacterEscape[?U, ?N]
  [+N] `k` GroupName[?U]

(I split the CharacterEscape alternative into [+U] and [~U] versions because the reference to ConstrainedDecimalEscape wouldn't have made sense under [+U].)

Note that, while DecimalEscape derives an infinite set of terminal-sequences, the "but only if" limits it to a finite set, so [lookahead ∉ ConstrainedDecimalEscape] seems to be a valid lookahead-constraint.

A quick summary for below: AtomEscape derives forms that start with any char, and then maybe have more chars.


ExtendedPatternCharacter ::
  SourceCharacter but not one of `^` `$` `\` `.` `*` `+` `?` `(` `)` `[` `|`

Only one alternative, so no ambiguities.


InvalidBracedQuantifier ::
  `{` DecimalDigits[~Sep] `}`
  `{` DecimalDigits[~Sep] `,` `}`
  `{` DecimalDigits[~Sep] `,` DecimalDigits[~Sep] `}`

The three alternatives are disjoint, so no ambiguities.


ExtendedAtom[N] ::
  `.`
  `\` AtomEscape[~U, ?N]
  `\` [lookahead == `c`]
  CharacterClass[~U]
  `(` Disjunction[~U, ?N] `)`
  `(` `?` `:` Disjunction[~U, ?N] `)`
  InvalidBracedQuantifier
  ExtendedPatternCharacter

forms:

  • alt1: .
  • alt2: \ (any char) && (maybe more chars)
  • alt3: \ [lookahead == c]
  • alt4: [ ...
  • alt5: ( [^?] ...
  • alt6: (?: ...
  • alt7: { ...
  • alt8: any char except ^ $ \ . * + ? ( ) [ |

alt2 conflicts with alt3 on \c. AtomEscape derives forms starting with c via

CharacterEscape :: `c` AsciiLetter

so we can resolve this by changing alt3 to:

  `\` [lookahead == `c`] [lookahead ∉ `c` AsciiLetter]

(similar to the change in ClassAtomNoDash).

alt7 conflicts with alt8 on {. To resolve, we could change alt8:

  [lookahead ∉ InvalidBracedQuantifier] ExtendedPatternCharacter

The problem with this is that InvalidBracedQuantifier derives an infinite set of terminal-sequences (because its DecimalDigits can be arbitrarily long). However, checking whether the lookahead matches InvalidBracedQuantifier doesn't seem unreasonable, so perhaps we can relax 5.1.5's restriction on the constraint-set from finite set to regular set.

A quick summary for below: ExtendedAtom derives forms that start with (any char except ^ $ * + ? ) |) && maybe have more


QuantifiableAssertion[N] ::
  `(` `?` `=` Disjunction[~U, ?N] `)`
  `(` `?` `!` Disjunction[~U, ?N] `)`

Alternatives are disjoint, so no ambiguities.


Assertion[U, N] ::
  `^`
  `$`
  `\` `b`
  `\` `B`
  [+U] `(` `?` `=` Disjunction[+U, ?N] `)`
  [+U] `(` `?` `!` Disjunction[+U, ?N] `)`
  [~U] QuantifiableAssertion[?N]
  `(` `?` `<=` Disjunction[?U, ?N] `)`
  `(` `?` `<!` Disjunction[?U, ?N] `)`

forms:

  • alt1: ^
  • alt2: $
  • alt3: \b
  • alt4: \B
  • alt5: ignore
  • alt6: ignore
  • alt7: (?[=!] ...
  • alt8: (?<= ...
  • alt9: (?<! ...

Alternatives are disjoint, so no ambiguities.


Term[U, N] ::
  [+U] Assertion[+U, ?N]
  [+U] Atom[+U, ?N] Quantifier
  [+U] Atom[+U, ?N]
  [~U] QuantifiableAssertion[?N] Quantifier
  [~U] Assertion[~U, ?N]
  [~U] ExtendedAtom[?N] Quantifier
  [~U] ExtendedAtom[?N]

forms:

  • alt1: ignore
  • alt2: ignore
  • alt3: ignore
  • alt4: (?[=!] ...
  • alt5: ^ | $ | \[bB] | (?[=!] ... | (?<[=!] ...
  • alt6: (any char except ^ $ * + ? ) |) ...
  • alt7: (any char except ^ $ * + ? ) |) ...

Consider alt6 and alt7:

  [~U] ExtendedAtom[?N] Quantifier
  [~U] ExtendedAtom[?N]

There's clearly a shift-reduce conflict after ExtendedAtom, but is there an ambiguity? If the text matching Quantifier is (or starts with) [*+?], there's no ambiguity, because (in this context) that can only be a Quantifier. But if the text matching Quantifier starts with '{', then formally there's an ambiguity, e.g. x{2} can be parsed as

Alternative --- Term -+- ExtendedAtom - x
                      |
                      +- Quantifier - { 2 }

or

Alternative -+- Term --- ExtendedAtom - x
             |
             +- Term --- ExtendedAtom --- InvalidBracedQuantifier - { 2 }

But any occurrence of InvalidBracedQuantifier is defined (by Early Error rule) to be a Syntax Error, so I'm guessing that this doesn't count as an ambiguity as far as the spec is concerned. Therefore, we don't need to disambiguate between alt6 and alt7.

Simlarly, alt4 conflicts with alt5 on QuantifiableAssertion followed by Quantifier (or not), but I'm assuming the spec doesn't consider it an ambiguity.

It looks like alt4 might conflict with alt6/alt7, but no, ExtendedAtom can't start with (?[=!].

Lastly, alt5 conflicts with alt6/alt7 on \[bB]. Since ExtendedAtom's only RHS appearances are here in alt6+alt7, we can resolve the ambiguity by modifying ExtendedAtom's definition to exclude \[bB]. That is, we can change its alt2:

  `\` AtomEscape[~U, ?N]

to

  `\` [lookahead &notin; {`b`, `B`}] AtomEscape[~U, ?N]

(We could push the exclusion down into AtomEscape[~U]:

  [~U] ... [lookahead &notin; {`b`, `B`}] CharacterEscape[?U, ?N]

but I'm not sure that would be an improvement.)

@ljharb

This comment has been minimized.

@jmdyck

This comment has been minimized.

@jmdyck jmdyck force-pushed the annex_B_pattern_ambig branch 3 times, most recently from a86233c to f773ee8 Compare June 28, 2021 14:05
@ljharb ljharb force-pushed the master branch 3 times, most recently from 3d0c24c to 7a79833 Compare June 29, 2021 02:21
@jmdyck jmdyck force-pushed the annex_B_pattern_ambig branch from f773ee8 to 32d59c4 Compare August 17, 2021 01:19
@jmdyck
Copy link
Collaborator Author

jmdyck commented Aug 17, 2021

(force-pushed to rebase to master + resolve merge conflicts from #2411)

@jmdyck jmdyck force-pushed the annex_B_pattern_ambig branch from 32d59c4 to 628be4a Compare September 14, 2021 13:24
@jmdyck jmdyck force-pushed the annex_B_pattern_ambig branch from 628be4a to 6425763 Compare February 28, 2022 18:55
@jmdyck jmdyck force-pushed the annex_B_pattern_ambig branch from 6425763 to 05c15a5 Compare April 21, 2022 14:48
@jmdyck jmdyck force-pushed the annex_B_pattern_ambig branch from 05c15a5 to 4403239 Compare August 6, 2022 03:39
@jmdyck jmdyck force-pushed the annex_B_pattern_ambig branch 2 times, most recently from 7d4b5e0 to 972db12 Compare September 29, 2022 23:33
@jmdyck jmdyck force-pushed the annex_B_pattern_ambig branch from 972db12 to 3d0639d Compare November 4, 2022 21:11
@jmdyck jmdyck force-pushed the annex_B_pattern_ambig branch from 3d0639d to a5ae275 Compare August 17, 2023 21:27
@jmdyck jmdyck force-pushed the annex_B_pattern_ambig branch from a5ae275 to 3a90b56 Compare August 13, 2024 02:25
@jmdyck
Copy link
Collaborator Author

jmdyck commented Aug 13, 2024

Okay, I've finally made the changes necessary to take this out of Draft. The difficulty was that disambiguating the annex B regexp grammar requires extending the syntax + semantics of lookahead restrictions (LRs), and that was tough to do given the way that they're currently defined in the "Lookahead Restrictions" section.

So, in commit 1, I started by rewriting some of the LRs section. It says basically the same thing as the status quo, but fixes some problems, and refactors things to make it easier to extend. (See notes below.) It's independent of the main point of this PR, so I could pull it out as a separate PR if you want.

Then, in commit 2 (the main commit, with all the disambiguating), I modified the LRs section to allow for the LRs I added to Annex B, e.g.,
[lookahead &notin; `c` ClassControlLetter]
(which is not allowed by current LR syntax because the lookahead-pattern has a terminal and a nonterminal), and
[lookahead &notin; InvalidBracedQuantifier]
(which is okay syntactically, but is not allowed by LR 'semantics' because InvalidBracedQuantifier expands to an infinite set).

However, the resulting LRs section feels very ad hoc and brittle, so in commit 3, I restructured lookahead restrictions to make the syntax + semantics more uniform. (See notes below.) This commit is also independent of this PR and could be pulled out. But it's more disruptive than the first commit, and isn't as strongly motivated unless you want the disambiguations of the main commit.

@jmdyck jmdyck marked this pull request as ready for review August 13, 2024 02:27
@jmdyck
Copy link
Collaborator Author

jmdyck commented Aug 13, 2024

Notes for commit 1:

The existing "Lookahead Restrictions" section has a couple problems.

(1)
It uses the phrase "token sequence" a lot. This makes sense in the context of syntactic productions, but not in the context of lexical productions, where lookahead restrictions also appear.

I replaced "the immediately following [input] token sequence" with "the lookahead" (defined somewhat more generally), and replaced other occurrences of "token sequence" with "terminal sequence" or "sequence of terminals".

(2a)
For the constituents of lookahead-patterns, it only admits terminals and nonterminals. But since the merge of PR #692, we've had lookahead restrictions involving "[no LineTerminator here]", which isn't a terminal or nonterminal.

(This is kind of difficult to fix with the way things are currently set up, but I'll address it in commit 3.)

(2b)
Since LineTerminators are not tokens, a lookahead restriction involving "[no LineTerminator here]" cannot be satisfied simply by looking at a stream of input tokens.

I've somewhat addressed this by saying that for syntactic productions, "the items of the lookahead are input elements (mainly tokens)".

(2c)
And in wording like "seq is a prefix of the lookahead", it's problematic to use the verb "is", because seq is not simply 'equal to' some portion of the input, rather there's something more involved going on.

I replaced such uses of "is" with "matches", since the intended semantics are basically the same as how a production 'matches' some input. This is brought out more in commit 3.


I also:

  • factored out "the [containing] production may only be used if"; and
  • brought together the 4 forms of lookahead restriction into a list, to make them easier to see, and easier to compare & contrast.

@jmdyck
Copy link
Collaborator Author

jmdyck commented Aug 13, 2024

Notes for commit 3:

Instead of basing the semantics of lookahead restrictions on terminal sequences and sets of terminal sequences, I base it on parsing. You need this to properly support a lookahead pattern that contains "[no LineTerminator here]", because its meaning can't be expressed as a terminal sequence. But mainly I wanted to make the syntax and semantics of lookahead restrictions more uniform, less ad hoc.

This also allows to collapse the == and &isin; operators into a single 'match' (or technically, 'is matched by') operator. For this, I'm suggesting ~, to echo the match operator in Awk. (Another possibility might be =~ as in Perl.)

Similarly, != and &notin; can be collapsed into !~.

In a few cases, we can merge adjacent lookahead restrictions. E.g.,

[lookahead &notin; HexDigit] [lookahead != `}`]

can become

[lookahead !~ HexDigit | `}`]

spec.html Outdated
Comment on lines 825 to 833
<li>“[lookahead = _seq_]”: _seq_ matches a prefix of the lookahead</li>
<li>“[lookahead ≠ _seq_]”: _seq_ does <em>not</em> match any prefix of the lookahead</li>
<li>“[lookahead ∈ _set_]”: some element of _set_ matches a prefix of the lookahead</li>
<li>“[lookahead ∉ _set_]”: <em>no</em> element of _set_ matches any prefix of the lookahead</li>
</ul>
<p>In the above:</p>
<ul>
<li>_seq_ is a sequence of terminal symbols from the production's grammar; and</li>
<li>_set_ is a finite non-empty set of terminal sequences. For convenience, _set_ can also be written as a nonterminal from the production's grammar, in which case it represents the set of all terminal sequences to which that nonterminal could expand. It is considered an editorial error if the nonterminal could expand to infinitely many distinct terminal sequences.</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seq and set look too similar. I'd rename seq to sequence.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third commit merges them into _pattern_.


IdentityEscape[UnicodeMode, NamedCaptureGroups] ::
[+UnicodeMode] SyntaxCharacter
[+UnicodeMode] `/`
[~UnicodeMode] SourceCharacterIdentityEscape[?NamedCaptureGroups]

SourceCharacterIdentityEscape[NamedCaptureGroups] ::
[~NamedCaptureGroups] SourceCharacter but not `c`
[+NamedCaptureGroups] SourceCharacter but not one of `c` or `k`
[~NamedCaptureGroups] SourceCharacter but not one of `0` `1` `2` `3` `4` `5` `6` `7` `c` `f` `n` `r` `t` `v` `d` `s` `w` `D` `S` `W`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't something like this be better?

Suggested change
[~NamedCaptureGroups] SourceCharacter but not one of `0` `1` `2` `3` `4` `5` `6` `7` `c` `f` `n` `r` `t` `v` `d` `s` `w` `D` `S` `W`
[~NamedCaptureGroups] [lookahead &notin; OctalDigit] [lookahead &notin; ControlEscape] [lookahead &notin; CharacterClassEscape[?UnicodeMode]] SourceCharacter

It'd be less repetitive, more robust to change, and more self-explanatory.

Come to think of it, most of the "but not"s in the grammar can just use lookaheads instead.

Copy link
Collaborator Author

@jmdyck jmdyck Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't something like this be better?

(Note that your suggestion is missing the c exclusion.)

It'd be less repetitive, more robust to change, and more self-explanatory.

I didn't make this very clear, but at that point in the commit, I'm actually suggesting two different ways of expressing the right-hand sides of the SourceCharacterIdentityEscape production. You're commenting on a line from the first way, but the corresponding line from the second way is

[~NamedCaptureGroups] SourceCharacter but not one of OctalDigit or ControlEscape or CharacterClassEscape or `c`

which has all the benefits of your suggestion with even less repetition.

Mind you, in third-commit syntax, your suggestion could be reduced to

[~NamedCaptureGroups] [lookahead !~ OctalDigit | ControlEscape | CharacterClassEscape | `c`] SourceCharacter

which is on par with my second way.

So it comes down to a preference between "but not" vs "lookahead". Personally, I think it's a bit easier to get the general case and then the exceptions, rather than the other way round.

Come to think of it, most of the "but not"s in the grammar can just use lookaheads instead.

I think they all could. The "but not" phrase goes back to ES1, so when ES3 introduced the "lookahead" phrase, I think they could have converted all the "but not"s to "lookahead"s. Maybe they didn't realize they could, or maybe they wanted to minimize change, maybe they just preferred to leave the "but not"s as is, or maybe something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants