-
Notifications
You must be signed in to change notification settings - Fork 12
Normative: when regexp
arg is a string, minimize observable operations
#33
Conversation
spec.emu
Outdated
1. Else, | ||
1. Let _flags_ be `"g"`. | ||
1. Let _matcher_ be ? RegExpCreate(_regexp_, _flags_). | ||
1. If ? IsRegExp(_matcher_) is not *true*, throw a *TypeError* exception. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would you want to check IsRegExp
after creating a regular expression object through RegExpCreate
? I don't see any reason to care about programs which have deleted RegExp.prototype[@@match]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do; anything that's possible in an engine is something I care about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any other places in the specification with similar logic? There are many weird things that you can do with RegExp subclassing, but they don't tend to all have guards against them.
cc @allenwb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely! Here's all the RegExp-related places with fallback logic when the well-known symbol isn't callable:
- https://tc39.github.io/ecma262/#sec-string.prototype.match
- https://tc39.github.io/ecma262/#sec-string.prototype.split
- https://tc39.github.io/ecma262/#sec-string.prototype.search
- https://tc39.github.io/ecma262/#sec-isregexp
- https://tc39.github.io/ecma262/#sec-string.prototype.replace
etc.
In other words, every single one of these methods has an identical fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how those fallbacks are like this one. Those are used to determine whether to take the string path, not whether to throw a TypeError. The places I see IsRegExp used are more about throwing a TypeError when a RegExp is passed into a method intended for a string (and that's throwing if IsRegExp is true, not false),
spec.emu
Outdated
1. Let _global_ be *true*. | ||
1. Let _fullUnicode_ be *false*. | ||
1. Let _lastIndex_ be *0*. | ||
1. Assert: ! Get(_matcher_, `"lastIndex"`) is *0*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't assert no abrupt completion occurs here, unless you remove IsRegExp
. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate on why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var RegExpPrototypeMatch = RegExp.prototype[Symbol.match];
Object.defineProperty(RegExp.prototype, Symbol.match, {
// This getter function is called when the abstract operation IsRegExp is called.
get() {
this.lastIndex++;
return RegExpPrototypeMatch;
}
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what way does that produce an abrupt completion? Or are you pointing out that at this point, a throwing getter could be defined on this
for lastIndex
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry I'm dumb. Pretend that I've written "You can't assert lastIndex
is 0".
(Because if RegExp.prototype[@@match]
has been redefined into a getter, it could have modified lastIndex
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, right. that's fair.
Do you think this should throw if it's not zero? or do you instead think that the check should just be removed, and replaced with a runtime check (every time lastIndex is retrieved) to ensure that lastIndex is a valid index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this should throw if it's not zero?
I think IsRegExp should be removed.
spec.emu
Outdated
1. Let _R_ be _regexp_. | ||
1. Let _C_ be ? SpeciesConstructor(_R_, %RegExp%). | ||
1. Let _flags_ be ? ToString(? Get(_R_, `"flags"`)). | ||
1. Let _matcher_ be ? Construct(_C_, « _R_, _flags_ »). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in #32, I'd prefer Let matcher be ? Construct(C, « R »).
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as noted here, the fast path is explicitly and intentionally prohibited, so a more important fast path can be taken (avoiding observable creation of this regex entirely in the common case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the justification w.r.t lastIndex
. Where does lastIndex
come into play when 21.2.3.1, step 4 is taken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Let C be ? SpeciesConstructor(R, %RegExp%).
- Let flags be ? ToString(? Get(R, "flags")).
- Let matcher be ? Construct(C, « R, flags »).
- Let global be ? ToBoolean(? Get(matcher, "global")).
- Let fullUnicode be ? ToBoolean(? Get(matcher, "unicode").
Needs the following guards to avoid creating a separate RegExp object:
- C is the built-in RegExp constructor
- R is a built-in RegExp object
- R doesn't redefine built-in RegExp.prototype functionality
- R.[[Prototype]] is the built-in RegExp.prototype object (Note: Ignoring subclassing for now.)
- RegExp.prototype.flags is the built-in getter
- Implied by RegExp.prototype.flags:
- RegExp.prototype.ignoreCase is the built-in getter
- RegExp.prototype.multiline is the built-in getter
- RegExp.prototype.dotAll is the built-in getter
- RegExp.prototype.sticky is the built-in getter
- RegExp.prototype.global is the built-in getter
- RegExp.prototype.unicode is the built-in getter
RegExp.prototype.source is the built-in getter(Edited)- RegExp.prototype[@@ match] is the built-in function
If all those guards hold, we can directly access R.[[OriginalSource]] and R.[[OriginalFlags]] and save them in the RegExp String Iterator object. Then we also store R in the RegExp String Iterator object and for %RegExpStringIteratorPrototype%.next() we can reuse R for matching as long as the current R.[[OriginalSource]] and R.[[OriginalFlags]] match the original properties (that's why we need to store them in the RegExp String Iterator object!). When we reuse R in this case, the internal matching obviously needs to ensure that it doesn't modify the lastIndex
property of R.
If we use the alternative:
- Let C be ? SpeciesConstructor(R, %RegExp%).
- Let matcher be ? Construct(C, « R »).
- Let global be ? ToBoolean(? Get(matcher, "global")).
- Let fullUnicode be ? ToBoolean(? Get(matcher, "unicode").
we need the following guards:
- C is the built-in RegExp constructor
- R is a built-in RegExp object
- R doesn't redefine built-in RegExp.prototype functionality
- R.[[Prototype]] is the built-in RegExp.prototype object (Note: Ignoring subclassing for now.)
- RegExp.prototype.global is the built-in getter
- RegExp.prototype.unicode is the built-in getter
- RegExp.prototype[@@ match] is the built-in function
So the alternative avoids the guards for RegExp.prototype.{global,ignoreCase,multiline,dotAll,source (Edited)}. Does this make any difference for implementors? Well, it depends on the actual implementation. For example for SpiderMonkey we will only need to add a guard for RegExp.prototype.source to [1], (Edited) the other getters are already checked. But I have no idea if it'll make any difference for other implementors, we'd need to ask them...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your second example, however, if after the iterator is created, any of those guards are violated, wouldn't you have to start down a much slower path than in the first example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment above to remove RegExp.prototype.source
from the list of properties which need to guarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For whatever reason I've always read 21.2.3.1, step 4 to require flags
to be undefined... 😒
So... 21.2.3.1, step 4 will basically always be taken in the fast path (where fast path means that R
is a built-in RegExp object).
In your second example, however, if after the iterator is created, any of those guards are violated, wouldn't you have to start down a much slower path than in the first example?
No, I don't think so, because in both cases we create a fresh RegExp object (at least in the spec).
The important part for me was if we can take 21.2.3.1, step 4, but given that I've misread if-condition in 21.2.3.1, my original concern was moot and we can probably leave it as is for now. As soon as we get some implementation experience, we may want to look at it again and see how good it can be optimized.
I may also propose some other changes for spec consistency reasons, but I haven't yet decided which ones make more sense. For example RegExp.prototype [ @@split ]
derives its unicodeMatching
variable from the retrieved flags
instead of getting the unicode
property. So we may want to do the same MatchAllIterator
. Or RegExp.prototype [ @@match ]
and RegExp.prototype [ @@replace ]
only retrieve unicode
when global
is true, whereas MatchAllIterator
always gets the unicode
property, which is another slight inconsistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retrieved flags are derived from checking each of the relevant properties: https://tc39.github.io/ecma262/#sec-get-regexp.prototype.flags
spec.emu
Outdated
1. Else, | ||
1. Let _R_ be ? RegExpCreate(_regexp_, `"g"`). | ||
1. Let _matcher_ be ? GetMethod(_R_, @@matchAll). | ||
1. Let _matcher_ be ? GetMethod(_regexp_, @@matchAll). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String.prototype
methods, including String.prototype.match
, generally allow null
/undefined
as parameters and implicitly coerce them to the expected type. Why should String.prototype.matchAll
differ here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch; this is missing a "neither undefined nor null, then" check.
spec.emu
Outdated
|
||
<p>The abstract operation _MatchAllIterator_ performs the following steps:</p> | ||
<emu-alg> | ||
1. If ? IsRegExp(_R_) is not *true*, throw a *TypeError* exception. | ||
1. Perform ! RequireObjectCoercible(_regexp_). | ||
1. If ? IsRegExp(_regexp_) is `true`, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RegExp.prototype
methods, including RegExp.prototype[@@match]
, generally throw a TypeError if the this-value is not an object. Why should RegExp.prototype[@@matchAll]
differ here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't? Perform ! RequireObjectCoercible(_regexp_).
throws a TypeError if regexp
is not an object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't cover RegExp.prototype[@@matchAll].call(true, "")
which should throw a TypeError, because true
is not an object.
spec.emu
Outdated
1. Let _global_ be *true*. | ||
1. Let _fullUnicode_ be *false*. | ||
1. Let _lastIndex_ be *0*. | ||
1. Assert: ! Get(_matcher_, `"lastIndex"`) is *0*. | ||
1. Let _S_ be ? ToString(_O_). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved into String.prototype.matchAll
resp. RegExp.prototype[@@matchAll]
for consistency with other String.prototype
resp. RegExp.prototype
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure what you mean by "resp." here, can you elaborate?
It sounds like #33 (comment) is the only remaining item on this PR; once that's resolved, it'd be great to get an approval so this can be merged :-) Absolutely I'd love future issues to be filed if implementation experience indicates anything. |
|
@anba with the current state of the PR, |
String.prototype.matchAll(regexp)
MatchAllIterator ( regexp, O )
|
Ah, thanks, you're right. In this case that's a bug, because the |
And
|
aha, thanks :-) that's why i'd added it originally. fixing that too. |
3547095
to
2197b6c
Compare
k; the PR is now updated; since the thread is quite long, can anyone (@littledan, @anba) post any remaining blockers for this PR - I would greatly appreciate further changes to be filed as separate issues, so I can consider those with a clean slate - as normal comments? (Linking to an existing review thread is totally great too) |
There are two nits in MatchAllIterator:
Apart from that, I'd still like to see the And I'm not sure about the first And finally, probably also for a different issue, the iterator calls |
I've added some initial patches for SpiderMonkey at https://bugzilla.mozilla.org/show_bug.cgi?id=1435829. Edit: function f() {
var s = "acbcbcab";
// Simple RegExp so we don't measure the internal regular expression engine.
var r = /a/g;
var q = 0;
var t = dateNow();
for (var i = 0; i < 500000; ++i) {
var m = s.match(r);
if (m === null)
continue;
for (var x of m) {
q += x.length;
}
}
return [dateNow() - t, q];
}
for (var i = 0; i < 10; ++i) print(f()); Against the function f() {
var s = "acbcbcab";
var r = /a/g;
var q = 0;
var t = dateNow();
for (var i = 0; i < 500000; ++i) {
for (var m of s.matchAll(r)) {
q += m[0].length;
}
}
return [dateNow() - t, q];
}
for (var i = 0; i < 10; ++i) print(f());
The native class RegExpStringIter {
constructor(regexp, string) {
this.regexp = regexp;
this.string = string;
this.done = false;
}
[Symbol.iterator]() {
return this;
}
next() {
var value = null;
if (!this.done) {
value = this.regexp.exec(this.string);
this.done = value === null || !this.regexp.global;
}
return {value, done: value === null};
}
}
String.prototype.matchAll = function(regexp) {
"use strict";
return new RegExpStringIter(regexp, this);
}; |
2197b6c
to
81ab6e7
Compare
Done.
Unintentional; I've fixed that in master. Thanks! If that takes care of all the issues for this PR, I'd appreciate a PR review approval :-D |
There's still that issue with the IsRegExp call in the else-branch of MatchAllIterator. Do we also want to move the discussion for that one to a different PR? |
#34 is about the first IsRegExp call, but the other IsRegExp call is still problematic, as discussed in #33 (comment) and #33 (comment). |
I've changed the incorrect lastIndex assertion per #33 (comment); could we move all "call IsRegExp less" discussion to #34? I'm happy to update its description. |
…egExpStringIteratorPrototype% Tests were updated and assuming tc39/proposal-string-matchall#33 will be merged.
…matchall], and %RegExpStringIteratorPrototype% Tests were updated and assuming tc39/proposal-string-matchall#33 will be merged.
…matchall], and %RegExpStringIteratorPrototype% Tests were updated and assuming tc39/proposal-string-matchall#33 will be merged.
…matchall], and %RegExpStringIteratorPrototype% Tests were updated and assuming tc39/proposal-string-matchall#33 will be merged.
Thanks for all the feedback so far! I'm going to merge this; let's continue further review in separate issues/PRs. |
…matchall], and %RegExpStringIteratorPrototype% Tests were updated and assuming tc39/proposal-string-matchall#33 will be merged.
…matchall], and %RegExpStringIteratorPrototype% Tests were updated and assuming tc39/proposal-string-matchall#33 will be merged.
…matchall], and %RegExpStringIteratorPrototype% Tests were updated and assuming tc39/proposal-string-matchall#33 will be merged.
…matchall], and %RegExpStringIteratorPrototype% Tests were updated and assuming tc39/proposal-string-matchall#33 will be merged.
Per #32.
cc @anba