-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Normative] Add RegExp lookbehind assertions #1029
Conversation
db464a7
to
bdd31fd
Compare
Thank you so much for writing this PR. To give credit where credit is due, the spec text here was actually written by @claudepache and @IgnoredAmbience ; I just did the presentations in the committee . Review coming soon (maybe those folks will want to take a look as well). |
spec.html
Outdated
1. Set _cap_[_parenIndex_+1] to _s_. | ||
1. Let _z_ be the State (_ye_, _cap_). | ||
1. Call _c_(_z_) and return its result. | ||
1. Call _m_(_x_, _d_) and return its result. | ||
</emu-alg> | ||
<p>The production <emu-grammar>Atom :: `(` `?` `:` Disjunction `)`</emu-grammar> evaluates as follows:</p> | ||
<emu-alg> | ||
1. Return the Matcher that is the result of evaluating |Disjunction|. | ||
1. Return the Matcher that is the result of evaluating |Disjunction| with argument _direction_. | ||
</emu-alg> | ||
|
||
<!-- es6num="21.2.2.8.1" --> | ||
<emu-clause id="sec-runtime-semantics-charactersetmatcher-abstract-operation" aoid="CharacterSetMatcher"> | ||
<h1>Runtime Semantics: CharacterSetMatcher ( _A_, _invert_ )</h1> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the third argument should be added here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just these couple nits, generally LGTM
spec.html
Outdated
1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps when evaluated: | ||
1. Let _d_ be a Continuation that takes a State argument _y_ and returns the result of calling _m2_(_y_, _c_). | ||
1. Call _m1_(_x_, _d_) and return its result. | ||
1. Else, _direction_ is equal to -1. | ||
1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps when evaluated: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Indent this line and the following two lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
bdd31fd
to
ef8cdd3
Compare
Thanks for the thorough review, @littledan! Feedback addressed.
Hmm… Should I change the commit author? Or does it make more sense to mention the authors in the commit message? |
I don't have a good idea of sharing credit here; these things are always the work of a lot of people. If I were you, I'd just leave Mathias as the commit author, and give callouts to @hashseed, @claudepache, @bterlson and everyone else who contributed when you get a chance, like in a blog post or talk. (This is a lot of work though!) |
ef8cdd3
to
5c0b518
Compare
spec.html
Outdated
1. Let _s_ be a new List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive). | ||
1. If _direction_ is equal +1, then | ||
1. Let _s_ be a fresh List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive). | ||
1. Else, let _direction_ is equal to -1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wording is somewhat unusual. Maybe better wording would be, "
- If direction is equal to +1, then
- Let s be a fresh List whose characters are the characters of Input at indices xe (inclusive) through ye (exclusive).
- Else,
- Assert: direction is equal to -1
- Let s be a fresh List whose characters are the characters of Input at indices ye (inclusive) through xe (exclusive).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wording is somewhat unusual.
I would even say, it’s plain typos: “If direction is equal to +1” and “Else, let direction is equal to -1”.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few typos.
spec.html
Outdated
1. Call _m1_(_x_, _d_) and return its result. | ||
1. Evaluate |Alternative| with argument _direction_ to obtain a Matcher _m1_. | ||
1. Evaluate |Term| with argument _direction_ to obtain a Matcher _m2_. | ||
1. If _direction_ is equal to +1, then, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no comma after “then”
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@claudepache Fixed!
spec.html
Outdated
1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps when evaluated: | ||
1. Let _d_ be a Continuation that takes a State argument _y_ and returns the result of calling _m2_(_y_, _c_). | ||
1. Call _m1_(_x_, _d_) and return its result. | ||
1. Else, _direction_ is equal to -1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“Else direction is equal to -1,” (no comma after “else”, a comma at the end of the line).
Alternatively:
- Else,
- Assert: direction is equal to -1.
spec.html
Outdated
1. Let _parenIndex_ be the number of left-capturing parentheses in the entire regular expression that occur to the left of this |Atom|. This is the total number of <emu-grammar>Atom :: `(` Disjunction `)`</emu-grammar> Parse Nodes prior to or enclosing this |Atom|. | ||
1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps: | ||
1. Let _d_ be an internal Continuation closure that takes one State argument _y_ and performs the following steps: | ||
1. Let _cap_ be a copy of _y_'s _captures_ List. | ||
1. Let _xe_ be _x_'s _endIndex_. | ||
1. Let _ye_ be _y_'s _endIndex_. | ||
1. Let _s_ be a new List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive). | ||
1. If _direction_ is equal +1, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“is equal to +1,”
spec.html
Outdated
1. Let _s_ be a new List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive). | ||
1. If _direction_ is equal +1, then | ||
1. Let _s_ be a fresh List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive). | ||
1. Else, let _direction_ is equal to -1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“Else direction is equal to -1,” (no comma after “else”, no extraneous “let”, a comma at the end of the line).
Alternatively:
- Else,
- Assert: direction is equal to -1.
6c43916
to
ec63b15
Compare
I also expect this feature very much!!! |
@claudepache I’ve addressed your feedback. Please take another look and update your review status if everything seems okay now. |
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except for one remaining correction, LGTM.
spec.html
Outdated
1. Let _s_ be a new List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive). | ||
1. If _direction_ is equal to +1, then | ||
1. Let _s_ be a fresh List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive). | ||
1. Else, let _direction_ is equal to -1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has to be:
1. Else,
1. Assert: _direction_ is equal to -1.
Optionally: It is implicit in the formulation of these steps that the numerical order of xe and ye is bound to the value of direction (i.e., xe ≤ ye if direction is +1, etc.) I leave to your judgment to decide whether it adds value to make that relation explicit, i.e.
1. If _direction_ is equal to +1, then
1. Assert: _xe_ ≤ _ye_.
1. Let _s_ be a fresh List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive).
1. Else,
1. Assert: _direction_ is equal to -1.
1. Assert: _ye_ ≤ _xe_.
1. Let _s_ be a fresh List whose characters are the characters of _Input_ at indices _ye_ (inclusive) through _xe_ (exclusive).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and done. I agree being explicit makes it easier to follow.
</emu-alg> | ||
<emu-note> | ||
<p>Consecutive |Term|s try to simultaneously match consecutive portions of _Input_. If the left |Alternative|, the right |Term|, and the sequel of the regular expression all have choice points, all choices in the sequel are tried before moving on to the next choice in the right |Term|, and all choices in the right |Term| are tried before moving on to the next choice in the left |Alternative|.</p> | ||
<p>Consecutive |Term|s try to simultaneously match consecutive portions of _Input_. When _direction_ is equal to +1, if the left |Alternative|, the right |Term|, and the sequel of the regular expression all have choice points, all choices in the sequel are tried before moving on to the next choice in the right |Term|, and all choices in the right |Term| are tried before moving on to the next choice in the left |Alternative|. When _direction_ is equal to -1, the evaluation order of |Alternative| and |Term| are reversed.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it clear here to use "+1" and "-1", and then also to use "left" and "right", which might not be correct in an RTL context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it’s clear that this refers to the grammar productions, which are always in a left-to-right order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of -1/+1, what would you think about using the strings "left" and "right"? (obv they're merely spec artifacts either way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of -1/+1, what would you think about using the strings "left" and "right"?
No, it is not “left” and “right”, it is “in the direction of the beginning of the string” (-1) and “in the direction of the end of the string” (+1), regardless of the directionality of its characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that would be more clear? I’m in favor of making things more readable but personally don’t see how this change does that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ~forwards~
and ~backwards~
(in other parts of the spec, we have ~failure~
, ~empty~
) is probably better than +1
and -1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what's unclear about +1/-1. In an RTL context, it's still +1 in terms of indexing the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@littledan if a regex is in unicode mode, it might advance +2 when it finds a surrogate pair, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(It'll only be advancing +-1 code point...) I don't feel strongly about this. I'm fine with switching to ~forwards~
and ~backwards~
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a regex is in unicode mode, it might advance +2 when it finds a surrogate pair, no?
In the context of RegExp matching, “one step forth/back” is always literally +1/-1 in terms of “indexing” the string. The difference is that, in the presence of the u-flag, the RegExp algorithm use an alternative way to “index” the string.
If I had to rewrite the algorithms now, I would probably use “forwards/backwards” instead of “+1/-1”, because it is more descriptive. But I don’t think it adds much clarity, because +1/-1 couldn’t mean anything else... unless you ask yourself inappropriate questions regarding RTL and/or astral characters, in which case you’ll have trouble to understand the algorithms anyway.
8d3ac1e
to
9e29ec4
Compare
Note: my effort in this PR was to port the changes from an older version of the spec to the most recent version for the proposal. I've no in-depth views on the content, unfortunately - hence removing myself from the review list. |
a0b1ee8
to
f46f101
Compare
f46f101
to
920022c
Compare
Rebased against |
I have a question here. This PR didn't look to update B.1.4 Regular Expressions Patterns section. Is there a plan to update the section? |
…rn evaluate semantics in annex b Bugfix: When speccing lookbehind assertions (PR tc39#1029), a `direction` parameter were added, in particular to the `evaluate` semantics of Atom productions and to the CharacterSetMatcher abstract operation. The ExtendedAtom productions found in Annex B were forgotten to be amended in the same way.
…rn evaluate semantics in Annex B (tc39#1675) Bugfix: When speccing lookbehind assertions (PR tc39#1029), a `direction` parameter were added, in particular to the `evaluate` semantics of Atom productions and to the CharacterSetMatcher abstract operation. The ExtendedAtom productions found in Annex B were forgotten to be amended in the same way. Fixes tc39#1674.
@littledan asked me to prepare the PR for the proposal adding lookbehind assertions to ECMAScript regular expressions.
Proposal repo: https://github.com/tc39/proposal-regexp-lookbehind by Gorkem Yakin, Nozomu Katō, and @littledan.
@littledan, I’ve added you as the commit author — I hope I’m not misrepresenting you :)