[Normative] Add RegExp lookbehind assertions #1029

mathiasbynens · 2017-11-12T10:45:15Z

@littledan asked me to prepare the PR for the proposal adding lookbehind assertions to ECMAScript regular expressions.

Proposal repo: https://github.com/tc39/proposal-regexp-lookbehind by Gorkem Yakin, Nozomu Katō, and @littledan.

@littledan, I’ve added you as the commit author — I hope I’m not misrepresenting you :)

littledan · 2017-11-16T21:04:41Z

Thank you so much for writing this PR. To give credit where credit is due, the spec text here was actually written by @claudepache and @IgnoredAmbience ; I just did the presentations in the committee . Review coming soon (maybe those folks will want to take a look as well).

littledan · 2017-11-16T21:08:24Z

spec.html

              1. Set _cap_[_parenIndex_+1] to _s_.
              1. Let _z_ be the State (_ye_, _cap_).
              1. Call _c_(_z_) and return its result.
            1. Call _m_(_x_, _d_) and return its result.
        </emu-alg>
        <p>The production <emu-grammar>Atom :: `(` `?` `:` Disjunction `)`</emu-grammar> evaluates as follows:</p>
        <emu-alg>
-          1. Return the Matcher that is the result of evaluating |Disjunction|.
+          1. Return the Matcher that is the result of evaluating |Disjunction| with argument _direction_.
        </emu-alg>

        <!-- es6num="21.2.2.8.1" -->
        <emu-clause id="sec-runtime-semantics-charactersetmatcher-abstract-operation" aoid="CharacterSetMatcher">
          <h1>Runtime Semantics: CharacterSetMatcher ( _A_, _invert_ )</h1>


Seems like the third argument should be added here.

littledan

Just these couple nits, generally LGTM

littledan · 2017-11-16T21:12:47Z

spec.html

+            1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps when evaluated:
+              1. Let _d_ be a Continuation that takes a State argument _y_ and returns the result of calling _m2_(_y_, _c_).
+              1. Call _m1_(_x_, _d_) and return its result.
+          1. Else, _direction_ is equal to -1.
          1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps when evaluated:


Nit: Indent this line and the following two lines.

mathiasbynens · 2017-11-16T21:47:04Z

Thanks for the thorough review, @littledan! Feedback addressed.

To give credit where credit is due, the spec text here was actually written by @claudepache and @IgnoredAmbience

Hmm… Should I change the commit author? Or does it make more sense to mention the authors in the commit message?

littledan · 2017-11-16T21:49:41Z

I don't have a good idea of sharing credit here; these things are always the work of a lot of people. If I were you, I'd just leave Mathias as the commit author, and give callouts to @hashseed, @claudepache, @bterlson and everyone else who contributed when you get a chance, like in a blog post or talk. (This is a lot of work though!)

littledan · 2017-11-16T23:31:37Z

spec.html

-              1. Let _s_ be a new List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive).
+              1. If _direction_ is equal +1, then
+                1. Let _s_ be a fresh List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive).
+              1. Else, let _direction_ is equal to -1.


This wording is somewhat unusual. Maybe better wording would be, "

If direction is equal to +1, then

Let s be a fresh List whose characters are the characters of Input at indices xe (inclusive) through ye (exclusive).

Else,

Assert: direction is equal to -1

Let s be a fresh List whose characters are the characters of Input at indices ye (inclusive) through xe (exclusive).

This wording is somewhat unusual.

I would even say, it’s plain typos: “If direction is equal to +1” and “Else, ~~let~~ direction is equal to -1”.

claudepache

Just a few typos.

claudepache · 2017-11-17T11:11:09Z

spec.html

-            1. Call _m1_(_x_, _d_) and return its result.
+          1. Evaluate |Alternative| with argument _direction_ to obtain a Matcher _m1_.
+          1. Evaluate |Term| with argument _direction_ to obtain a Matcher _m2_.
+          1. If _direction_ is equal to +1, then,


no comma after “then”

@claudepache Fixed!

claudepache · 2017-11-17T11:13:56Z

spec.html

+            1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps when evaluated:
+              1. Let _d_ be a Continuation that takes a State argument _y_ and returns the result of calling _m2_(_y_, _c_).
+              1. Call _m1_(_x_, _d_) and return its result.
+          1. Else, _direction_ is equal to -1.


“Else direction is equal to -1,” (no comma after “else”, a comma at the end of the line).
Alternatively:

Else,

Assert: direction is equal to -1.

claudepache · 2017-11-17T11:16:00Z

spec.html

          1. Let _parenIndex_ be the number of left-capturing parentheses in the entire regular expression that occur to the left of this |Atom|. This is the total number of <emu-grammar>Atom :: `(` Disjunction `)`</emu-grammar> Parse Nodes prior to or enclosing this |Atom|.
          1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps:
            1. Let _d_ be an internal Continuation closure that takes one State argument _y_ and performs the following steps:
              1. Let _cap_ be a copy of _y_'s _captures_ List.
              1. Let _xe_ be _x_'s _endIndex_.
              1. Let _ye_ be _y_'s _endIndex_.
-              1. Let _s_ be a new List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive).
+              1. If _direction_ is equal +1, then


“is equal to +1,”

claudepache · 2017-11-17T11:17:12Z

spec.html

-              1. Let _s_ be a new List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive).
+              1. If _direction_ is equal +1, then
+                1. Let _s_ be a fresh List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive).
+              1. Else, let _direction_ is equal to -1.


“Else direction is equal to -1,” (no comma after “else”, no extraneous “let”, a comma at the end of the line).
Alternatively:

Else,

Assert: direction is equal to -1.

wangyi7099 · 2017-12-04T01:51:04Z

I also expect this feature very much!!!

mathiasbynens · 2017-12-05T00:36:39Z

@claudepache I’ve addressed your feedback. Please take another look and update your review status if everything seems okay now.

littledan · 2017-12-05T05:20:18Z

LGTM

claudepache

Except for one remaining correction, LGTM.

claudepache · 2017-12-13T17:50:56Z

spec.html

-              1. Let _s_ be a new List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive).
+              1. If _direction_ is equal to +1, then
+                1. Let _s_ be a fresh List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive).
+              1. Else, let _direction_ is equal to -1.


Has to be:

1. Else, 1. Assert: _direction_ is equal to -1.

Optionally: It is implicit in the formulation of these steps that the numerical order of xe and ye is bound to the value of direction (i.e., xe ≤ ye if direction is +1, etc.) I leave to your judgment to decide whether it adds value to make that relation explicit, i.e.

1. If _direction_ is equal to +1, then 1. Assert: _xe_ ≤ _ye_. 1. Let _s_ be a fresh List whose characters are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive). 1. Else, 1. Assert: _direction_ is equal to -1. 1. Assert: _ye_ ≤ _xe_. 1. Let _s_ be a fresh List whose characters are the characters of _Input_ at indices _ye_ (inclusive) through _xe_ (exclusive).

Done, and done. I agree being explicit makes it easier to follow.

ljharb · 2017-12-13T18:22:22Z

spec.html

        </emu-alg>
        <emu-note>
-          <p>Consecutive |Term|s try to simultaneously match consecutive portions of _Input_. If the left |Alternative|, the right |Term|, and the sequel of the regular expression all have choice points, all choices in the sequel are tried before moving on to the next choice in the right |Term|, and all choices in the right |Term| are tried before moving on to the next choice in the left |Alternative|.</p>
+          <p>Consecutive |Term|s try to simultaneously match consecutive portions of _Input_. When _direction_ is equal to +1, if the left |Alternative|, the right |Term|, and the sequel of the regular expression all have choice points, all choices in the sequel are tried before moving on to the next choice in the right |Term|, and all choices in the right |Term| are tried before moving on to the next choice in the left |Alternative|. When _direction_ is equal to -1, the evaluation order of |Alternative| and |Term| are reversed.</p>


is it clear here to use "+1" and "-1", and then also to use "left" and "right", which might not be correct in an RTL context?

IMHO it’s clear that this refers to the grammar productions, which are always in a left-to-right order.

Instead of -1/+1, what would you think about using the strings "left" and "right"? (obv they're merely spec artifacts either way)

Instead of -1/+1, what would you think about using the strings "left" and "right"?

No, it is not “left” and “right”, it is “in the direction of the beginning of the string” (-1) and “in the direction of the end of the string” (+1), regardless of the directionality of its characters.

Do you think that would be more clear? I’m in favor of making things more readable but personally don’t see how this change does that.

Yes, ~forwards~ and ~backwards~ (in other parts of the spec, we have ~failure~, ~empty~) is probably better than +1 and -1.

I don't understand what's unclear about +1/-1. In an RTL context, it's still +1 in terms of indexing the string.

@littledan if a regex is in unicode mode, it might advance +2 when it finds a surrogate pair, no?

(It'll only be advancing +-1 code point...) I don't feel strongly about this. I'm fine with switching to ~forwards~ and ~backwards~.

if a regex is in unicode mode, it might advance +2 when it finds a surrogate pair, no?

In the context of RegExp matching, “one step forth/back” is always literally +1/-1 in terms of “indexing” the string. The difference is that, in the presence of the u-flag, the RegExp algorithm use an alternative way to “index” the string.

If I had to rewrite the algorithms now, I would probably use “forwards/backwards” instead of “+1/-1”, because it is more descriptive. But I don’t think it adds much clarity, because +1/-1 couldn’t mean anything else... unless you ask yourself inappropriate questions regarding RTL and/or astral characters, in which case you’ll have trouble to understand the algorithms anyway.

IgnoredAmbience · 2017-12-13T20:10:18Z

Note: my effort in this PR was to port the changes from an older version of the spec to the most recent version for the proposal. I've no in-depth views on the content, unfortunately - hence removing myself from the review list.

Proposal repo: https://github.com/tc39/proposal-regexp-lookbehind

mathiasbynens · 2018-01-26T05:39:19Z

Rebased against master; this involved rewriting the patch to change the (new) BackreferenceMatcher.

mysticatea · 2018-02-08T05:00:35Z

I have a question here.

This PR didn't look to update B.1.4 Regular Expressions Patterns section. Is there a plan to update the section?

…rn evaluate semantics in annex b Bugfix: When speccing lookbehind assertions (PR tc39#1029), a `direction` parameter were added, in particular to the `evaluate` semantics of Atom productions and to the CharacterSetMatcher abstract operation. The ExtendedAtom productions found in Annex B were forgotten to be amended in the same way.

…rn evaluate semantics in Annex B (tc39#1675) Bugfix: When speccing lookbehind assertions (PR tc39#1029), a `direction` parameter were added, in particular to the `evaluate` semantics of Atom productions and to the CharacterSetMatcher abstract operation. The ExtendedAtom productions found in Annex B were forgotten to be amended in the same way. Fixes tc39#1674.

mathiasbynens added the pending stage 4 This proposal has not yet achieved stage 4, but may otherwise be ready to merge. label Nov 12, 2017

mathiasbynens force-pushed the lookbehind branch from db464a7 to bdd31fd Compare November 12, 2017 10:46

littledan reviewed Nov 16, 2017

View reviewed changes

ljharb requested a review from IgnoredAmbience November 16, 2017 21:20

littledan approved these changes Nov 16, 2017

View reviewed changes

mathiasbynens force-pushed the lookbehind branch from bdd31fd to ef8cdd3 Compare November 16, 2017 21:45

littledan approved these changes Nov 16, 2017

View reviewed changes

mathiasbynens force-pushed the lookbehind branch from ef8cdd3 to 5c0b518 Compare November 16, 2017 21:52

littledan reviewed Nov 16, 2017

View reviewed changes

claudepache suggested changes Nov 17, 2017

View reviewed changes

mathiasbynens force-pushed the lookbehind branch 2 times, most recently from 6c43916 to ec63b15 Compare November 17, 2017 16:17

mathiasbynens requested a review from bterlson December 5, 2017 00:35

bterlson force-pushed the lookbehind branch from ec63b15 to 8d3ac1e Compare December 8, 2017 21:31

claudepache suggested changes Dec 13, 2017

View reviewed changes

ljharb reviewed Dec 13, 2017

View reviewed changes

mathiasbynens force-pushed the lookbehind branch from 8d3ac1e to 9e29ec4 Compare December 13, 2017 18:35

IgnoredAmbience removed their request for review December 13, 2017 20:09

mathiasbynens force-pushed the lookbehind branch 2 times, most recently from a0b1ee8 to f46f101 Compare January 26, 2018 05:04

[Normative] Add RegExp lookbehind assertions

920022c

Proposal repo: https://github.com/tc39/proposal-regexp-lookbehind

mathiasbynens force-pushed the lookbehind branch from f46f101 to 920022c Compare January 26, 2018 05:37

mathiasbynens mentioned this pull request Jan 26, 2018

Missing [N] in RegExp grammar? #1081

Open

bterlson merged commit bf8a9be into tc39:master Jan 26, 2018

mysticatea mentioned this pull request Feb 16, 2018

Normative: add RegExp lookbehind to annex-B #1102

Merged

jmdyck mentioned this pull request Aug 21, 2019

bug? re lookbehind assertions in Annex B #1674

Closed

claudepache mentioned this pull request Aug 21, 2019

Editorial: Add missing _direction_ parameter in extended regexp pattern evaluate semantics in annex b #1675

Merged

mathiasbynens deleted the lookbehind branch August 25, 2022 09:36

[Normative] Add RegExp lookbehind assertions #1029

[Normative] Add RegExp lookbehind assertions #1029

Conversation

mathiasbynens commented Nov 12, 2017 • edited Loading

littledan commented Nov 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

littledan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mathiasbynens commented Nov 16, 2017

littledan commented Nov 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

claudepache left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangyi7099 commented Dec 4, 2017

mathiasbynens commented Dec 5, 2017

littledan commented Dec 5, 2017

claudepache left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IgnoredAmbience commented Dec 13, 2017

mathiasbynens commented Jan 26, 2018

mysticatea commented Feb 8, 2018

mathiasbynens commented Nov 12, 2017 •

edited

Loading

littledan commented Nov 16, 2017 •

edited

Loading