Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Normative] Add RegExp lookbehind assertions #1029

Merged
merged 1 commit into from
Jan 26, 2018

Conversation

mathiasbynens
Copy link
Member

@mathiasbynens mathiasbynens commented Nov 12, 2017

@littledan asked me to prepare the PR for the proposal adding lookbehind assertions to ECMAScript regular expressions.

Proposal repo: https://github.com/tc39/proposal-regexp-lookbehind by Gorkem Yakin, Nozomu Katō, and @littledan.

@littledan, I’ve added you as the commit author — I hope I’m not misrepresenting you :)

@mathiasbynens mathiasbynens added the pending stage 4 This proposal has not yet achieved stage 4, but may otherwise be ready to merge. label Nov 12, 2017
@littledan
Copy link
Member

littledan commented Nov 16, 2017

Thank you so much for writing this PR. To give credit where credit is due, the spec text here was actually written by @claudepache and @IgnoredAmbience ; I just did the presentations in the committee . Review coming soon (maybe those folks will want to take a look as well).

spec.html Outdated
1. Set _cap_[_parenIndex_+1] to _s_.
1. Let _z_ be the State (_ye_, _cap_).
1. Call _c_(_z_) and return its result.
1. Call _m_(_x_, _d_) and return its result.
</emu-alg>
<p>The production <emu-grammar>Atom :: `(` `?` `:` Disjunction `)`</emu-grammar> evaluates as follows:</p>
<emu-alg>
1. Return the Matcher that is the result of evaluating |Disjunction|.
1. Return the Matcher that is the result of evaluating |Disjunction| with argument _direction_.
</emu-alg>

<!-- es6num="21.2.2.8.1" -->
<emu-clause id="sec-runtime-semantics-charactersetmatcher-abstract-operation" aoid="CharacterSetMatcher">
<h1>Runtime Semantics: CharacterSetMatcher ( _A_, _invert_ )</h1>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the third argument should be added here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Member

@littledan littledan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just these couple nits, generally LGTM

spec.html Outdated
1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps when evaluated:
1. Let _d_ be a Continuation that takes a State argument _y_ and returns the result of calling _m2_(_y_, _c_).
1. Call _m1_(_x_, _d_) and return its result.
1. Else, _direction_ is equal to -1.
1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps when evaluated:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Indent this line and the following two lines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@mathiasbynens
Copy link
Member Author

Thanks for the thorough review, @littledan! Feedback addressed.

To give credit where credit is due, the spec text here was actually written by @claudepache and @IgnoredAmbience

Hmm… Should I change the commit author? Or does it make more sense to mention the authors in the commit message?

@littledan
Copy link
Member

I don't have a good idea of sharing credit here; these things are always the work of a lot of people. If I were you, I'd just leave Mathias as the commit author, and give callouts to @hashseed, @claudepache, @bterlson and everyone else who contributed when you get a chance, like in a blog post or talk. (This is a lot of work though!)

spec.html Outdated
1. Let _s_ be a new List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive).
1. If _direction_ is equal +1, then
1. Let _s_ be a fresh List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive).
1. Else, let _direction_ is equal to -1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording is somewhat unusual. Maybe better wording would be, "

  1. If direction is equal to +1, then
    1. Let s be a fresh List whose characters are the characters of Input at indices xe (inclusive) through ye (exclusive).
  2. Else,
    1. Assert: direction is equal to -1
    2. Let s be a fresh List whose characters are the characters of Input at indices ye (inclusive) through xe (exclusive).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording is somewhat unusual.

I would even say, it’s plain typos: “If direction is equal to +1” and “Else, let direction is equal to -1”.

Copy link
Contributor

@claudepache claudepache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few typos.

spec.html Outdated
1. Call _m1_(_x_, _d_) and return its result.
1. Evaluate |Alternative| with argument _direction_ to obtain a Matcher _m1_.
1. Evaluate |Term| with argument _direction_ to obtain a Matcher _m2_.
1. If _direction_ is equal to +1, then,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no comma after “then”

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claudepache Fixed!

spec.html Outdated
1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps when evaluated:
1. Let _d_ be a Continuation that takes a State argument _y_ and returns the result of calling _m2_(_y_, _c_).
1. Call _m1_(_x_, _d_) and return its result.
1. Else, _direction_ is equal to -1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Else direction is equal to -1,” (no comma after “else”, a comma at the end of the line).
Alternatively:

  1. Else,
    1. Assert: direction is equal to -1.

spec.html Outdated
1. Let _parenIndex_ be the number of left-capturing parentheses in the entire regular expression that occur to the left of this |Atom|. This is the total number of <emu-grammar>Atom :: `(` Disjunction `)`</emu-grammar> Parse Nodes prior to or enclosing this |Atom|.
1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps:
1. Let _d_ be an internal Continuation closure that takes one State argument _y_ and performs the following steps:
1. Let _cap_ be a copy of _y_'s _captures_ List.
1. Let _xe_ be _x_'s _endIndex_.
1. Let _ye_ be _y_'s _endIndex_.
1. Let _s_ be a new List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive).
1. If _direction_ is equal +1, then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“is equal to +1,”

spec.html Outdated
1. Let _s_ be a new List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive).
1. If _direction_ is equal +1, then
1. Let _s_ be a fresh List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive).
1. Else, let _direction_ is equal to -1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Else direction is equal to -1,” (no comma after “else”, no extraneous “let”, a comma at the end of the line).
Alternatively:

  1. Else,
    1. Assert: direction is equal to -1.

@mathiasbynens mathiasbynens force-pushed the lookbehind branch 2 times, most recently from 6c43916 to ec63b15 Compare November 17, 2017 16:17
@wangyi7099
Copy link

I also expect this feature very much!!!

@mathiasbynens
Copy link
Member Author

@claudepache I’ve addressed your feedback. Please take another look and update your review status if everything seems okay now.

@littledan
Copy link
Member

LGTM

Copy link
Contributor

@claudepache claudepache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for one remaining correction, LGTM.

spec.html Outdated
1. Let _s_ be a new List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive).
1. If _direction_ is equal to +1, then
1. Let _s_ be a fresh List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive).
1. Else, let _direction_ is equal to -1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has to be:

1. Else,
    1. Assert: _direction_ is equal to -1.

Optionally: It is implicit in the formulation of these steps that the numerical order of xe and ye is bound to the value of direction (i.e., xeye if direction is +1, etc.) I leave to your judgment to decide whether it adds value to make that relation explicit, i.e.

1. If _direction_ is equal to +1, then
    1. Assert: _xe_ ≤ _ye_.
    1. Let _s_ be a fresh List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive).
1. Else,
    1. Assert: _direction_ is equal to -1.
    1. Assert: _ye_ ≤ _xe_.
    1. Let _s_ be a fresh List whose characters are the characters of _Input_ at indices _ye_ (inclusive) through _xe_ (exclusive).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and done. I agree being explicit makes it easier to follow.

</emu-alg>
<emu-note>
<p>Consecutive |Term|s try to simultaneously match consecutive portions of _Input_. If the left |Alternative|, the right |Term|, and the sequel of the regular expression all have choice points, all choices in the sequel are tried before moving on to the next choice in the right |Term|, and all choices in the right |Term| are tried before moving on to the next choice in the left |Alternative|.</p>
<p>Consecutive |Term|s try to simultaneously match consecutive portions of _Input_. When _direction_ is equal to +1, if the left |Alternative|, the right |Term|, and the sequel of the regular expression all have choice points, all choices in the sequel are tried before moving on to the next choice in the right |Term|, and all choices in the right |Term| are tried before moving on to the next choice in the left |Alternative|. When _direction_ is equal to -1, the evaluation order of |Alternative| and |Term| are reversed.</p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it clear here to use "+1" and "-1", and then also to use "left" and "right", which might not be correct in an RTL context?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it’s clear that this refers to the grammar productions, which are always in a left-to-right order.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of -1/+1, what would you think about using the strings "left" and "right"? (obv they're merely spec artifacts either way)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of -1/+1, what would you think about using the strings "left" and "right"?

No, it is not “left” and “right”, it is “in the direction of the beginning of the string” (-1) and “in the direction of the end of the string” (+1), regardless of the directionality of its characters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that would be more clear? I’m in favor of making things more readable but personally don’t see how this change does that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ~forwards~ and ~backwards~ (in other parts of the spec, we have ~failure~, ~empty~) is probably better than +1 and -1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what's unclear about +1/-1. In an RTL context, it's still +1 in terms of indexing the string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@littledan if a regex is in unicode mode, it might advance +2 when it finds a surrogate pair, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It'll only be advancing +-1 code point...) I don't feel strongly about this. I'm fine with switching to ~forwards~ and ~backwards~.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a regex is in unicode mode, it might advance +2 when it finds a surrogate pair, no?

In the context of RegExp matching, “one step forth/back” is always literally +1/-1 in terms of “indexing” the string. The difference is that, in the presence of the u-flag, the RegExp algorithm use an alternative way to “index” the string.

If I had to rewrite the algorithms now, I would probably use “forwards/backwards” instead of “+1/-1”, because it is more descriptive. But I don’t think it adds much clarity, because +1/-1 couldn’t mean anything else... unless you ask yourself inappropriate questions regarding RTL and/or astral characters, in which case you’ll have trouble to understand the algorithms anyway.

@IgnoredAmbience
Copy link
Contributor

Note: my effort in this PR was to port the changes from an older version of the spec to the most recent version for the proposal. I've no in-depth views on the content, unfortunately - hence removing myself from the review list.

@mathiasbynens mathiasbynens force-pushed the lookbehind branch 2 times, most recently from a0b1ee8 to f46f101 Compare January 26, 2018 05:04
@mathiasbynens
Copy link
Member Author

Rebased against master; this involved rewriting the patch to change the (new) BackreferenceMatcher.

@bterlson bterlson merged commit bf8a9be into tc39:master Jan 26, 2018
@mysticatea
Copy link
Contributor

I have a question here.

This PR didn't look to update B.1.4 Regular Expressions Patterns section. Is there a plan to update the section?

claudepache added a commit to claudepache/ecma262 that referenced this pull request Aug 21, 2019
…rn evaluate semantics in annex b

Bugfix: When speccing lookbehind assertions (PR tc39#1029), a `direction` parameter were added, in particular to the `evaluate` semantics of Atom productions and to the  CharacterSetMatcher abstract operation. The ExtendedAtom productions found in Annex B were forgotten to be amended in the same way.
ljharb pushed a commit to claudepache/ecma262 that referenced this pull request Jan 2, 2020
…rn evaluate semantics in Annex B (tc39#1675)

Bugfix: When speccing lookbehind assertions (PR tc39#1029), a `direction` parameter were added, in particular to the `evaluate` semantics of Atom productions and to the  CharacterSetMatcher abstract operation. The ExtendedAtom productions found in Annex B were forgotten to be amended in the same way.

Fixes tc39#1674.
@mathiasbynens mathiasbynens deleted the lookbehind branch August 25, 2022 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending stage 4 This proposal has not yet achieved stage 4, but may otherwise be ready to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants