Skip to content
This repository has been archived by the owner on Jan 26, 2022. It is now read-only.

Normative: when regexp arg is a string, minimize observable operations #33

Merged
merged 7 commits into from
Mar 20, 2018

Conversation

ljharb
Copy link
Member

@ljharb ljharb commented Feb 2, 2018

Per #32.

cc @anba

spec.emu Outdated
1. Else,
1. Let _flags_ be `"g"`.
1. Let _matcher_ be ? RegExpCreate(_regexp_, _flags_).
1. If ? IsRegExp(_matcher_) is not *true*, throw a *TypeError* exception.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you want to check IsRegExp after creating a regular expression object through RegExpCreate? I don't see any reason to care about programs which have deleted RegExp.prototype[@@match].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do; anything that's possible in an engine is something I care about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any other places in the specification with similar logic? There are many weird things that you can do with RegExp subclassing, but they don't tend to all have guards against them.

cc @allenwb

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely! Here's all the RegExp-related places with fallback logic when the well-known symbol isn't callable:

etc.

In other words, every single one of these methods has an identical fallback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how those fallbacks are like this one. Those are used to determine whether to take the string path, not whether to throw a TypeError. The places I see IsRegExp used are more about throwing a TypeError when a RegExp is passed into a method intended for a string (and that's throwing if IsRegExp is true, not false),

spec.emu Outdated
1. Let _global_ be *true*.
1. Let _fullUnicode_ be *false*.
1. Let _lastIndex_ be *0*.
1. Assert: ! Get(_matcher_, `"lastIndex"`) is *0*.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't assert no abrupt completion occurs here, unless you remove IsRegExp. 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate on why not?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var RegExpPrototypeMatch = RegExp.prototype[Symbol.match];
Object.defineProperty(RegExp.prototype, Symbol.match, {
  // This getter function is called when the abstract operation IsRegExp is called.
  get() {
    this.lastIndex++;
    return RegExpPrototypeMatch;
  }
});

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what way does that produce an abrupt completion? Or are you pointing out that at this point, a throwing getter could be defined on this for lastIndex?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry I'm dumb. Pretend that I've written "You can't assert lastIndex is 0".

(Because if RegExp.prototype[@@match] has been redefined into a getter, it could have modified lastIndex.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, right. that's fair.

Do you think this should throw if it's not zero? or do you instead think that the check should just be removed, and replaced with a runtime check (every time lastIndex is retrieved) to ensure that lastIndex is a valid index?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this should throw if it's not zero?

I think IsRegExp should be removed.

spec.emu Outdated
1. Let _R_ be _regexp_.
1. Let _C_ be ? SpeciesConstructor(_R_, %RegExp%).
1. Let _flags_ be ? ToString(? Get(_R_, `"flags"`)).
1. Let _matcher_ be ? Construct(_C_, « _R_, _flags_ »).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #32, I'd prefer Let matcher be ? Construct(C, « R »). here.

Copy link
Member Author

@ljharb ljharb Feb 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as noted here, the fast path is explicitly and intentionally prohibited, so a more important fast path can be taken (avoiding observable creation of this regex entirely in the common case).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the justification w.r.t lastIndex. Where does lastIndex come into play when 21.2.3.1, step 4 is taken?

Copy link
Collaborator

@anba anba Feb 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Let C be ? SpeciesConstructor(R, %RegExp%).
  2. Let flags be ? ToString(? Get(R, "flags")).
  3. Let matcher be ? Construct(C, « R, flags »).
  4. Let global be ? ToBoolean(? Get(matcher, "global")).
  5. Let fullUnicode be ? ToBoolean(? Get(matcher, "unicode").

Needs the following guards to avoid creating a separate RegExp object:

  • C is the built-in RegExp constructor
  • R is a built-in RegExp object
  • R doesn't redefine built-in RegExp.prototype functionality
  • R.[[Prototype]] is the built-in RegExp.prototype object (Note: Ignoring subclassing for now.)
  • RegExp.prototype.flags is the built-in getter
  • Implied by RegExp.prototype.flags:
    • RegExp.prototype.ignoreCase is the built-in getter
    • RegExp.prototype.multiline is the built-in getter
    • RegExp.prototype.dotAll is the built-in getter
    • RegExp.prototype.sticky is the built-in getter
  • RegExp.prototype.global is the built-in getter
  • RegExp.prototype.unicode is the built-in getter
  • RegExp.prototype.source is the built-in getter (Edited)
  • RegExp.prototype[@@ match] is the built-in function

If all those guards hold, we can directly access R.[[OriginalSource]] and R.[[OriginalFlags]] and save them in the RegExp String Iterator object. Then we also store R in the RegExp String Iterator object and for %RegExpStringIteratorPrototype%.next() we can reuse R for matching as long as the current R.[[OriginalSource]] and R.[[OriginalFlags]] match the original properties (that's why we need to store them in the RegExp String Iterator object!). When we reuse R in this case, the internal matching obviously needs to ensure that it doesn't modify the lastIndex property of R.

If we use the alternative:

  1. Let C be ? SpeciesConstructor(R, %RegExp%).
  2. Let matcher be ? Construct(C, « R »).
  3. Let global be ? ToBoolean(? Get(matcher, "global")).
  4. Let fullUnicode be ? ToBoolean(? Get(matcher, "unicode").

we need the following guards:

  • C is the built-in RegExp constructor
  • R is a built-in RegExp object
  • R doesn't redefine built-in RegExp.prototype functionality
  • R.[[Prototype]] is the built-in RegExp.prototype object (Note: Ignoring subclassing for now.)
  • RegExp.prototype.global is the built-in getter
  • RegExp.prototype.unicode is the built-in getter
  • RegExp.prototype[@@ match] is the built-in function

So the alternative avoids the guards for RegExp.prototype.{global,ignoreCase,multiline,dotAll,source (Edited)}. Does this make any difference for implementors? Well, it depends on the actual implementation. For example for SpiderMonkey we will only need to add a guard for RegExp.prototype.source to [1], (Edited) the other getters are already checked. But I have no idea if it'll make any difference for other implementors, we'd need to ask them...

[1] https://searchfox.org/mozilla-central/rev/a539f8f7fe288d0b4d8d6a32510be20f52bb7a0f/js/src/builtin/RegExp.cpp#1614

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your second example, however, if after the iterator is created, any of those guards are violated, wouldn't you have to start down a much slower path than in the first example?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment above to remove RegExp.prototype.source from the list of properties which need to guarded.

Copy link
Collaborator

@anba anba Feb 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For whatever reason I've always read 21.2.3.1, step 4 to require flags to be undefined... 😒

So... 21.2.3.1, step 4 will basically always be taken in the fast path (where fast path means that R is a built-in RegExp object).

In your second example, however, if after the iterator is created, any of those guards are violated, wouldn't you have to start down a much slower path than in the first example?

No, I don't think so, because in both cases we create a fresh RegExp object (at least in the spec).

The important part for me was if we can take 21.2.3.1, step 4, but given that I've misread if-condition in 21.2.3.1, my original concern was moot and we can probably leave it as is for now. As soon as we get some implementation experience, we may want to look at it again and see how good it can be optimized.

I may also propose some other changes for spec consistency reasons, but I haven't yet decided which ones make more sense. For example RegExp.prototype [ @@split ] derives its unicodeMatching variable from the retrieved flags instead of getting the unicode property. So we may want to do the same MatchAllIterator. Or RegExp.prototype [ @@match ] and RegExp.prototype [ @@replace ] only retrieve unicode when global is true, whereas MatchAllIterator always gets the unicode property, which is another slight inconsistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retrieved flags are derived from checking each of the relevant properties: https://tc39.github.io/ecma262/#sec-get-regexp.prototype.flags

spec.emu Outdated
1. Else,
1. Let _R_ be ? RegExpCreate(_regexp_, `"g"`).
1. Let _matcher_ be ? GetMethod(_R_, @@matchAll).
1. Let _matcher_ be ? GetMethod(_regexp_, @@matchAll).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String.prototype methods, including String.prototype.match, generally allow null/undefined as parameters and implicitly coerce them to the expected type. Why should String.prototype.matchAll differ here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch; this is missing a "neither undefined nor null, then" check.

spec.emu Outdated

<p>The abstract operation _MatchAllIterator_ performs the following steps:</p>
<emu-alg>
1. If ? IsRegExp(_R_) is not *true*, throw a *TypeError* exception.
1. Perform ! RequireObjectCoercible(_regexp_).
1. If ? IsRegExp(_regexp_) is `true`, then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RegExp.prototype methods, including RegExp.prototype[@@match], generally throw a TypeError if the this-value is not an object. Why should RegExp.prototype[@@matchAll] differ here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't? Perform ! RequireObjectCoercible(_regexp_). throws a TypeError if regexp is not an object.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't cover RegExp.prototype[@@matchAll].call(true, "") which should throw a TypeError, because true is not an object.

spec.emu Outdated
1. Let _global_ be *true*.
1. Let _fullUnicode_ be *false*.
1. Let _lastIndex_ be *0*.
1. Assert: ! Get(_matcher_, `"lastIndex"`) is *0*.
1. Let _S_ be ? ToString(_O_).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved into String.prototype.matchAll resp. RegExp.prototype[@@matchAll] for consistency with other String.prototype resp. RegExp.prototype methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure what you mean by "resp." here, can you elaborate?

@ljharb
Copy link
Member Author

ljharb commented Feb 10, 2018

It sounds like #33 (comment) is the only remaining item on this PR; once that's resolved, it'd be great to get an approval so this can be merged :-)

Absolutely I'd love future issues to be filed if implementation experience indicates anything.

@anba
Copy link
Collaborator

anba commented Feb 12, 2018

"null".match(null) returns ["null"], but "null".matchAll(null) throws a TypeError with the current patch.

@ljharb
Copy link
Member Author

ljharb commented Feb 26, 2018

@anba with the current state of the PR, 'null'.matchAll(null) should be identical to 'null'.matchAll('null'), no?

@anba
Copy link
Collaborator

anba commented Feb 27, 2018

String.prototype.matchAll(regexp)

  1. Let O be ? RequireObjectCoercible(this value).
  2. If regexp is neither undefined nor null, then
    1. // not executed for regexp = null
  3. Return ? MatchAllIterator(regexp, O).

MatchAllIterator ( regexp, O )

  1. Perform ! RequireObjectCoercible(regexp). // <- Throws for regexp = null

@ljharb
Copy link
Member Author

ljharb commented Feb 27, 2018

Ah, thanks, you're right. In this case that's a bug, because the ! is an assertion. Will fix.

@anba
Copy link
Collaborator

anba commented Feb 27, 2018

And RegExp.prototype [ @@matchAll ] ( string ) should start with:

  1. Let rx be the this value.
  2. If Type(rx) is not Object, throw a TypeError exception.
  3. Return ? MatchAllIterator(rx, string).

@ljharb
Copy link
Member Author

ljharb commented Feb 27, 2018

aha, thanks :-) that's why i'd added it originally. fixing that too.

@ljharb ljharb force-pushed the less_observability branch from 3547095 to 2197b6c Compare February 27, 2018 20:01
@ljharb
Copy link
Member Author

ljharb commented Feb 27, 2018

k; the PR is now updated; since the thread is quite long, can anyone (@littledan, @anba) post any remaining blockers for this PR - I would greatly appreciate further changes to be filed as separate issues, so I can consider those with a clean slate - as normal comments? (Linking to an existing review thread is totally great too)

@anba
Copy link
Collaborator

anba commented Mar 5, 2018

There are two nits in MatchAllIterator:

  • Let _R_ be _regexp_. could be removed and instead of R the regexp parameter can be used directly.
  • Let _lastIndex_ be *0*. can be removed, because lastIndex is unused in the else-branch.

Apart from that, I'd still like to see the IsRegExp call removed in the else-branch of MatchAllIterator.

And I'm not sure about the first IsRegExp call in MatchAllIterator and I wonder if it makes more sense to always take the first branch in MatchAllIterator if called from RegExp.prototype[@@matchAll] for consistency with RegExp.prototype[@@split]. But that should definitely be discussed in a different issue.

And finally, probably also for a different issue, the iterator calls CreateIterResultObject(null, true) instead of the normal CreateIterResultObject(undefined, true) when it is finished. I don't know if this is an intentional design decision or a bug?

@anba
Copy link
Collaborator

anba commented Mar 5, 2018

I've added some initial patches for SpiderMonkey at https://bugzilla.mozilla.org/show_bug.cgi?id=1435829.

Edit:
Performance-wise it looks promising, for example when I was running this µ-benchmark:

function f() {
    var s = "acbcbcab";
    // Simple RegExp so we don't measure the internal regular expression engine.
    var r = /a/g;
    var q = 0;
    var t = dateNow();
    for (var i = 0; i < 500000; ++i) {
        var m = s.match(r);
        if (m === null)
            continue;
        for (var x of m) {
            q += x.length;
        }
    }
    return [dateNow() - t, q];
}
for (var i = 0; i < 10; ++i) print(f());

Against the matchAll version:

function f() {
    var s = "acbcbcab";
    var r = /a/g;
    var q = 0;
    var t = dateNow();
    for (var i = 0; i < 500000; ++i) {
        for (var m of s.matchAll(r)) {
            q += m[0].length;
        }
    }
    return [dateNow() - t, q];
}
for (var i = 0; i < 10; ++i) print(f());

matchAll was only ~12.5% slower (in current SpiderMonkey tip with --nursery-strings=on). When more matches are present, e.g. with s = "acbcbcab".repeat(4), matchAll was even slightly faster (~5%).

The native matchAll implementation is also on a par with what I'd call a typical user-land implementation of String.prototype.matchAll. (This version intentionally doesn't follow the spec steps at all. It only provides a comparable interface, so we can compare it against the native matchAll implementation.):

class RegExpStringIter {
    constructor(regexp, string) {
        this.regexp = regexp;
        this.string = string;
        this.done = false;
    }
    [Symbol.iterator]() {
        return this;
    }
    next() {
        var value = null;
        if (!this.done) {
            value = this.regexp.exec(this.string);
            this.done = value === null || !this.regexp.global;
        }
        return {value, done: value === null};
    }
}

String.prototype.matchAll = function(regexp) {
    "use strict";
    return new RegExpStringIter(regexp, this);
};

@ljharb
Copy link
Member Author

ljharb commented Mar 5, 2018

There are two nits in MatchAllIterator:

Done.

And I'm not sure about the first IsRegExp call in MatchAllIterator

#34

And finally, probably also for a different issue, the iterator calls CreateIterResultObject(null, true) instead of the normal CreateIterResultObject(undefined, true) when it is finished. I don't know if this is an intentional design decision or a bug?

Unintentional; I've fixed that in master.

Thanks! If that takes care of all the issues for this PR, I'd appreciate a PR review approval :-D

@anba
Copy link
Collaborator

anba commented Mar 5, 2018

Thanks! If that takes care of all the issues for this PR, I'd appreciate a PR review approval :-D

There's still that issue with the IsRegExp call in the else-branch of MatchAllIterator. Do we also want to move the discussion for that one to a different PR?

@ljharb
Copy link
Member Author

ljharb commented Mar 5, 2018

@anba I filed #34 to discuss that; are you comfortable merging this as-is, and discussing that change there?

@anba
Copy link
Collaborator

anba commented Mar 6, 2018

#34 is about the first IsRegExp call, but the other IsRegExp call is still problematic, as discussed in #33 (comment) and #33 (comment).

@ljharb
Copy link
Member Author

ljharb commented Mar 6, 2018

I've changed the incorrect lastIndex assertion per #33 (comment); could we move all "call IsRegExp less" discussion to #34? I'm happy to update its description.

peterwmwong added a commit to peterwmwong/test262 that referenced this pull request Mar 18, 2018
…egExpStringIteratorPrototype%

Tests were updated and assuming tc39/proposal-string-matchall#33 will be merged.
peterwmwong added a commit to peterwmwong/test262 that referenced this pull request Mar 18, 2018
…matchall], and %RegExpStringIteratorPrototype%

Tests were updated and assuming tc39/proposal-string-matchall#33 will be merged.
peterwmwong added a commit to peterwmwong/test262 that referenced this pull request Mar 18, 2018
…matchall], and %RegExpStringIteratorPrototype%

Tests were updated and assuming tc39/proposal-string-matchall#33 will be merged.
peterwmwong added a commit to peterwmwong/test262 that referenced this pull request Mar 19, 2018
…matchall], and %RegExpStringIteratorPrototype%

Tests were updated and assuming tc39/proposal-string-matchall#33 will be merged.
@ljharb
Copy link
Member Author

ljharb commented Mar 20, 2018

Thanks for all the feedback so far! I'm going to merge this; let's continue further review in separate issues/PRs.

@ljharb ljharb merged commit 9c13a85 into master Mar 20, 2018
@ljharb ljharb deleted the less_observability branch March 20, 2018 11:45
ljharb pushed a commit to peterwmwong/test262 that referenced this pull request Mar 20, 2018
…matchall], and %RegExpStringIteratorPrototype%

Tests were updated and assuming tc39/proposal-string-matchall#33 will be merged.
ljharb pushed a commit to ljharb/test262 that referenced this pull request Mar 21, 2018
…matchall], and %RegExpStringIteratorPrototype%

Tests were updated and assuming tc39/proposal-string-matchall#33 will be merged.
zloirock added a commit to zloirock/core-js that referenced this pull request Mar 21, 2018
zloirock added a commit to zloirock/core-js that referenced this pull request Mar 27, 2018
zloirock added a commit to zloirock/core-js that referenced this pull request Mar 27, 2018
zloirock added a commit to zloirock/core-js that referenced this pull request Apr 3, 2018
zloirock added a commit to zloirock/core-js that referenced this pull request Apr 9, 2018
ljharb pushed a commit to ljharb/test262 that referenced this pull request Apr 10, 2018
…matchall], and %RegExpStringIteratorPrototype%

Tests were updated and assuming tc39/proposal-string-matchall#33 will be merged.
zloirock added a commit to zloirock/core-js that referenced this pull request Apr 21, 2018
zloirock added a commit to zloirock/core-js that referenced this pull request May 4, 2018
zloirock added a commit to zloirock/core-js that referenced this pull request May 5, 2018
zloirock added a commit to zloirock/core-js that referenced this pull request May 7, 2018
zloirock added a commit to zloirock/core-js that referenced this pull request May 14, 2018
zloirock added a commit to zloirock/core-js that referenced this pull request May 18, 2018
zloirock added a commit to zloirock/core-js that referenced this pull request May 26, 2018
ljharb added a commit to es-shims/String.prototype.matchAll that referenced this pull request May 31, 2018
ljharb added a commit to es-shims/String.prototype.matchAll that referenced this pull request May 31, 2018
chicoxyzzy pushed a commit to chicoxyzzy/test262 that referenced this pull request May 14, 2019
…matchall], and %RegExpStringIteratorPrototype%

Tests were updated and assuming tc39/proposal-string-matchall#33 will be merged.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants