-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Editorial: Eliminate order-disambiguation from Annex B Pattern-grammar #2445
base: main
Are you sure you want to change the base?
Conversation
Analysis [I wrote this before the grammatical parameters B.1.4 says:
That is, if we consider the B.1.4 grammar under [+U], it should duplicate the 22.2.1 grammar under [+U], which we can assume does not have ambiguities. Thus, we only need to look for ambiguities in the B.1.4 grammar under [~U]. This comment will go to each production in B.1.4, and:
I'll consider the productions bottom-up (roughly the reverse of their order in the spec), so that our knowledge builds up as we go.
forms:
There's no overlap, so no ambiguities.
For any given setting of [N], there's only one alternative, so there can't be an ambiguity. Note that SourceCharacter is any Unicode code point, so below, I'll paraphrase SourceCharacterIdentityEscape as "any char except
forms:
Under [~U]:
forms:
It looks like there might be an ambiguity between alt3 and alt6 on The remaining ambiguities are between alt7 and everything else (except alt2). Consider alt1 (ControlEscape). Clearly, if the current input begins with [fnrtv], this alternative will match, and alt7 will never get a chance. So we can resolve this ambiguity simply by excluding [fnrtv] from IdentityEscape. We could do this by appending "but not one of Similarly, if the current input begins with [0-8], it's always the case that either alt3 or alt6 will match (though this isn't obvious), so we can resolve this ambiguity by excluding [0-8] from IdentityEscape. However, this approach doesn't work for alt4 and alt5. If the current input begins with So IdentityEscape must still be free to recognize
Note that HexEscapeSequence derives a finite set of three-character sequences, and RegExpUnicodeEscapeSequence derives a rather large but still finite set of sequences of various lengths, so this is still within the definition of lookahead-constraints in 5.1.5 Grammar Notation. Note also that, under [+U], these lookahead-constraints are satisfied automatically. A quick summary for below: CharacterEscape derives forms that start with any char except
forms:
(Note that, for alt4, CharacterClassEscape only includes UnicodeProperty stuff under [+U], which we can ignore.) alt1, alt3, alt4 are disjoint, so the ambiguities all involve alt5 (CharacterEscape) alt3 and alt5 are disjoint, because alt3 only derives alt5 has ambiguities with alt1 and alt4, which we could resolve with:
CharacterEscape appears in RHS of both ClassEscape and AtomEscape, so we can't push this exclusion into the definition of CharacterEscape unless the two uses agree. But looking ahead, we see that AtomEscape also has a CharacterClassEscape alt followed by a CharacterEscape alt, so they do agree on that exclusion. That is, alt5 here will be:
and we can push the CharacterClassEscape [dswDSW] exclusion down into CharacterEscape, and thence into IdentityEscape, and thence into SourceCharacterIdentityEscape. A quick summary for below: ClassEscape derives forms that start with any char except
forms:
alt1 can't be But alt2 and alt3 conflict if the current input starts with
This goes slightly outside what 5.1.5 Grammar Notation allows, in that (Alternatively, one could say:
but I think that might be worse.)
forms:
alt2, alt3, alt5 are disjoint, so alt4 is the source of all ambiguities. alt4 and alt5 are disjoint: the "except alt3 conflicts with alt4, but that's easy to deal with, and in fact the change to CharacterEscape proposed above under ClassEscape already took care of it. alt2 also conflicts with alt4, when the current input starts with [1-9]. The resolution is (roughly) to prepend "[lookahead isnt alt2]" to alt4, but there are various ways it could be expressed. I wound up defining
and then tweaking AtomEscape to
(I split the CharacterEscape alternative into [+U] and [~U] versions because the reference to ConstrainedDecimalEscape wouldn't have made sense under [+U].) Note that, while DecimalEscape derives an infinite set of terminal-sequences, the "but only if" limits it to a finite set, so A quick summary for below: AtomEscape derives forms that start with any char, and then maybe have more chars.
Only one alternative, so no ambiguities.
The three alternatives are disjoint, so no ambiguities.
forms:
alt2 conflicts with alt3 on
so we can resolve this by changing alt3 to:
(similar to the change in ClassAtomNoDash). alt7 conflicts with alt8 on
The problem with this is that InvalidBracedQuantifier derives an infinite set of terminal-sequences (because its DecimalDigits can be arbitrarily long). However, checking whether the lookahead matches InvalidBracedQuantifier doesn't seem unreasonable, so perhaps we can relax 5.1.5's restriction on the constraint-set from finite set to regular set. A quick summary for below: ExtendedAtom derives forms that start with (any char except
Alternatives are disjoint, so no ambiguities.
forms:
Alternatives are disjoint, so no ambiguities.
forms:
Consider alt6 and alt7:
There's clearly a shift-reduce conflict after ExtendedAtom, but is there an ambiguity? If the text matching Quantifier is (or starts with) [*+?], there's no ambiguity, because (in this context) that can only be a Quantifier. But if the text matching Quantifier starts with '{', then formally there's an ambiguity, e.g.
or
But any occurrence of InvalidBracedQuantifier is defined (by Early Error rule) to be a Syntax Error, so I'm guessing that this doesn't count as an ambiguity as far as the spec is concerned. Therefore, we don't need to disambiguate between alt6 and alt7. Simlarly, alt4 conflicts with alt5 on QuantifiableAssertion followed by Quantifier (or not), but I'm assuming the spec doesn't consider it an ambiguity. It looks like alt4 might conflict with alt6/alt7, but no, ExtendedAtom can't start with Lastly, alt5 conflicts with alt6/alt7 on
to
(We could push the exclusion down into AtomEscape[~U]:
but I'm not sure that would be an improvement.) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
a86233c
to
f773ee8
Compare
3d0c24c
to
7a79833
Compare
f773ee8
to
32d59c4
Compare
(force-pushed to rebase to master + resolve merge conflicts from #2411) |
32d59c4
to
628be4a
Compare
628be4a
to
6425763
Compare
6425763
to
05c15a5
Compare
05c15a5
to
4403239
Compare
7d4b5e0
to
972db12
Compare
972db12
to
3d0639d
Compare
3d0639d
to
a5ae275
Compare
See comments in PR tc39#2445.
... from terminal sequences and sets of terminal sequences to parsing.
a5ae275
to
3a90b56
Compare
Okay, I've finally made the changes necessary to take this out of Draft. The difficulty was that disambiguating the annex B regexp grammar requires extending the syntax + semantics of lookahead restrictions (LRs), and that was tough to do given the way that they're currently defined in the "Lookahead Restrictions" section. So, in commit 1, I started by rewriting some of the LRs section. It says basically the same thing as the status quo, but fixes some problems, and refactors things to make it easier to extend. (See notes below.) It's independent of the main point of this PR, so I could pull it out as a separate PR if you want. Then, in commit 2 (the main commit, with all the disambiguating), I modified the LRs section to allow for the LRs I added to Annex B, e.g., However, the resulting LRs section feels very ad hoc and brittle, so in commit 3, I restructured lookahead restrictions to make the syntax + semantics more uniform. (See notes below.) This commit is also independent of this PR and could be pulled out. But it's more disruptive than the first commit, and isn't as strongly motivated unless you want the disambiguations of the main commit. |
Notes for commit 1: The existing "Lookahead Restrictions" section has a couple problems. (1) I replaced "the immediately following [input] token sequence" with "the lookahead" (defined somewhat more generally), and replaced other occurrences of "token sequence" with "terminal sequence" or "sequence of terminals". (2a) (This is kind of difficult to fix with the way things are currently set up, but I'll address it in commit 3.) (2b) I've somewhat addressed this by saying that for syntactic productions, "the items of the lookahead are input elements (mainly tokens)". (2c) I replaced such uses of "is" with "matches", since the intended semantics are basically the same as how a production 'matches' some input. This is brought out more in commit 3. I also:
|
Notes for commit 3: Instead of basing the semantics of lookahead restrictions on terminal sequences and sets of terminal sequences, I base it on parsing. You need this to properly support a lookahead pattern that contains "[no LineTerminator here]", because its meaning can't be expressed as a terminal sequence. But mainly I wanted to make the syntax and semantics of lookahead restrictions more uniform, less ad hoc. This also allows to collapse the Similarly, In a few cases, we can merge adjacent lookahead restrictions. E.g.,
can become
|
spec.html
Outdated
<li>“[lookahead = _seq_]”: _seq_ matches a prefix of the lookahead</li> | ||
<li>“[lookahead ≠ _seq_]”: _seq_ does <em>not</em> match any prefix of the lookahead</li> | ||
<li>“[lookahead ∈ _set_]”: some element of _set_ matches a prefix of the lookahead</li> | ||
<li>“[lookahead ∉ _set_]”: <em>no</em> element of _set_ matches any prefix of the lookahead</li> | ||
</ul> | ||
<p>In the above:</p> | ||
<ul> | ||
<li>_seq_ is a sequence of terminal symbols from the production's grammar; and</li> | ||
<li>_set_ is a finite non-empty set of terminal sequences. For convenience, _set_ can also be written as a nonterminal from the production's grammar, in which case it represents the set of all terminal sequences to which that nonterminal could expand. It is considered an editorial error if the nonterminal could expand to infinitely many distinct terminal sequences.</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seq
and set
look too similar. I'd rename seq
to sequence
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The third commit merges them into _pattern_
.
|
||
IdentityEscape[UnicodeMode, NamedCaptureGroups] :: | ||
[+UnicodeMode] SyntaxCharacter | ||
[+UnicodeMode] `/` | ||
[~UnicodeMode] SourceCharacterIdentityEscape[?NamedCaptureGroups] | ||
|
||
SourceCharacterIdentityEscape[NamedCaptureGroups] :: | ||
[~NamedCaptureGroups] SourceCharacter but not `c` | ||
[+NamedCaptureGroups] SourceCharacter but not one of `c` or `k` | ||
[~NamedCaptureGroups] SourceCharacter but not one of `0` `1` `2` `3` `4` `5` `6` `7` `c` `f` `n` `r` `t` `v` `d` `s` `w` `D` `S` `W` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't something like this be better?
[~NamedCaptureGroups] SourceCharacter but not one of `0` `1` `2` `3` `4` `5` `6` `7` `c` `f` `n` `r` `t` `v` `d` `s` `w` `D` `S` `W` | |
[~NamedCaptureGroups] [lookahead ∉ OctalDigit] [lookahead ∉ ControlEscape] [lookahead ∉ CharacterClassEscape[?UnicodeMode]] SourceCharacter |
It'd be less repetitive, more robust to change, and more self-explanatory.
Come to think of it, most of the "but not"s in the grammar can just use lookaheads instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't something like this be better?
(Note that your suggestion is missing the c
exclusion.)
It'd be less repetitive, more robust to change, and more self-explanatory.
I didn't make this very clear, but at that point in the commit, I'm actually suggesting two different ways of expressing the right-hand sides of the SourceCharacterIdentityEscape
production. You're commenting on a line from the first way, but the corresponding line from the second way is
[~NamedCaptureGroups] SourceCharacter but not one of OctalDigit or ControlEscape or CharacterClassEscape or `c`
which has all the benefits of your suggestion with even less repetition.
Mind you, in third-commit syntax, your suggestion could be reduced to
[~NamedCaptureGroups] [lookahead !~ OctalDigit | ControlEscape | CharacterClassEscape | `c`] SourceCharacter
which is on par with my second way.
So it comes down to a preference between "but not" vs "lookahead". Personally, I think it's a bit easier to get the general case and then the exceptions, rather than the other way round.
Come to think of it, most of the "but not"s in the grammar can just use lookaheads instead.
I think they all could. The "but not" phrase goes back to ES1, so when ES3 introduced the "lookahead" phrase, I think they could have converted all the "but not"s to "lookahead"s. Maybe they didn't realize they could, or maybe they wanted to minimize change, maybe they just preferred to leave the "but not"s as is, or maybe something else.
B.1.4 Regular Expressions Patterns says:
This PR eliminates these order-dependencies (mostly by inserting equivalent lookahead-constraints).
(I made this a Draft PR because it isn't in a final (mergeable) state. On the other hand, it is ready for review, at least to the extent of deciding whether to pursue it.)The basic idea is that an order-disambiguated production such as:
can be transformed into an equivalent "normal" production:
Of course, applied naively, this would be verbose and hard to read. (Some productions have 9 alternatives!) So instead, we only insert lookahead-constraints (or other exclusions) where an ambiguity actually exists.
Also, there's the risk that an alt might be grammatically more complex than we want to have in a lookahead-constraint. In practice, it looks like some are more complex than we currently allow, but perhaps not unreasonably so.