Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operations to avoid .lastIndex hacks? #5

Closed
rauschma opened this issue Jul 4, 2024 · 10 comments
Closed

Operations to avoid .lastIndex hacks? #5

rauschma opened this issue Jul 4, 2024 · 10 comments

Comments

@rauschma
Copy link
Contributor

rauschma commented Jul 4, 2024

Are there any ideas for operations so that we don’t need .lastIndex hacks anymore (I use them to tokenize efficiently)?

Maybe (with better names...):

  • matchAt(string, regex, pos): {match, newPos}
    • Eventually: String.prototype.matchAt()
  • testAt(regex, string, pos): {isMatch, newPos}
    • Eventually: RegExp.prototype.testAt()
@slevithan
Copy link
Owner

This library's implementation also liberally uses manual lastIndex adjustments. 😊

Do you envision these methods being top-level exports? I'm very open to discussion, but right now regex is exclusively focused on being a better regex constructor, and doesn't introduce utility functions.

The XRegExp library (which regex is a more focused and lightweight spiritual successor to) includes optional pos and sticky arguments for the XRegExp.test and XRegExp.exec methods, for just this purpose. (Its sticky argument predates ES6 adding flag /y, but uses /y when available.) And the hackiness of lastIndex is something I've extensively ranted about, e.g. here (from 2010). So I'm very interested in your ideas, but I'm not sure if it's the right fit for this library.

@slevithan
Copy link
Owner

slevithan commented Jul 5, 2024

Your implied proposal for new JS String.prototype.matchAt and RegExp.prototype.testAt methods is a cool idea, since a pos argument probably can't be added to the existing match and test methods due to its overlap with lastIndex handling. By creating new methods, they could always ignore lastIndex, like XRegExp's corresponding methods do. A couple thoughts though:

  • IMO RegExp.prototype.execAt would be preferable to String.prototype.matchAt, since match without /g is just an alias for exec, and it avoids (1) the need to consider how to change handling with /g, and/or (2) the restriction on not using it with /g without changing the result type.
  • Returning a newPos property as you proposed might not be needed or preferable, since the match object's index property + the length of match[0] gives you what you need. Maybe an end property could be added to match objects returned by these methods. Or a nextPos property which automatically adds + 1 for zero-length matches.
  • Potentially String.prototype.matchAllAt could return an iterator like matchAll.

@rauschma
Copy link
Contributor Author

rauschma commented Jul 5, 2024

Good points!

  • I’m curious how many people would actually use such methods. My poll: https://fosstodon.org/@rauschma/112735659502970884
  • If such functionality were to be part of regex, I’d add another entry to "exports", so that it only shows up in bundles if it is actually used.
  • .matchAllAt() may be too niche (I can’t think of a use case).

Below is my preliminary implementation. execAt() adds a property .nextIndex (as you suggested, but a different name). I’ll update it if anyone comes up with improvements.

// Updated: 2024-07-11
function execAt(regExp, str, index) {
  setLastIndex(regExp, index);
  return regExp.exec(str);
}
function testAt(regExp, str, index) {
  setLastIndex(regExp, index);
  return regExp.test(str);
}

function setLastIndex(regExp, lastIndex) {
  if (!regExp.sticky && !regExp.global) {
    throw new TypeError(
      `Either flag /g or flag /y must be set for matching at (or after) a particular index: ${regExp}`
    );
  }
  regExp.lastIndex = lastIndex;
}

@slevithan
Copy link
Owner

  • I like your nextIndex name -- nice symmetry.
  • Your implementation does not add + 1 to nextIndex after zero-length matches, so people will need to check the length of the last match if using it in a loop. My personal preference would be to auto-increment nextIndex (since it doesn't take anything away -- if they want the non-incremented version there is already a way to compute it), and to set it to zero after zero-length matches that occur at the end of the target string (is there another better approach?)
  • I'm not understanding the purpose of regExp.lastIndex = savedLastIndex at the end. IMO lastIndex should be set the same as if you were using standard exec/test, including by not modifying it if the regex doesn't use /g.
  • There are significant downsides and no user benefits I can think of from throwing if flag /g or /y is not set.
    • Suggestion for accomplishing this when flag /g or /y is not set: Copy the regex with new RegExp(regExp, regExp.flags + 'g'), then use the copy to search.

@rauschma
Copy link
Contributor Author

rauschma commented Jul 7, 2024

Your implementation does not add + 1 to nextIndex after zero-length matches

I defer to your expertise:

  • I personally would be OK with checking matchObj.index === matchObj.nextIndex.
  • But you have seen more people use regular expressions, so you know better what most of them need/want.
  • I made nextIndexAfterLength an option. Then some people could use (e.g.) null. Not sure if that is useful though.

I'm not understanding the purpose of regExp.lastIndex = savedLastIndex at the end.

I’d like to pretend that RegExp doesn’t store any state. I’d expect an actual built-in method to completely bypass .lastIndex and to be usable even with frozen RegExps.

There are two predecents:

  • .matchAll() considers .lastIndex but doesn’t change it. (I’d rather it didn’t consider .lastIndex).
  • .replaceAll() ignores .lastIndex and resets it to 0.

My approach feels the cleanest to me but maybe we should follow the precedent of .replaceAll().

There are significant downsides and no user benefits I can think of from throwing if flag /g or /y is not set.

I see two options:

  1. Let /y dictate if matching is sticky or not. Then flags matter and, IMO, we should complain if they are not right.
  • That is similar to how .matchAll() and .replaceAll() work.
  • It makes the polyfill a tiny bit more efficient. But that shouldn’t really influence the design – if we expect this operation to eventually be built in.
  1. Provide non-sticky and sticky versions – e.g.:
  • execAt and execAtSticky (etc.)
  • execAtOrLater and execAt (etc.)

In case # 2, we can ignore /g and /y and use your approach.

(I have updated the code, but not yet switched to # 2.)

@slevithan
Copy link
Owner

slevithan commented Jul 8, 2024

I made nextIndexAfterLength an option. Then some people could use (e.g.) null. Not sure if that is useful though.

IMO it's probably overly complicated. One way to remove the complexity would be to go further than removing this and just not provide any kind of nextIndex property. It's not needed since its derivable, and it creates questions without obvious answers, like this one.

I’d like to pretend that RegExp doesn’t store any state. I’d expect an actual built-in method to completely bypass .lastIndex and to be usable even with frozen RegExps.

IMO it's better to maintain the state (if not frozen) and all the expectations from precedent, for the sake of all the legacy methods that rely on lastIndex (most of which are not actually legacy), but then abstract away the need to interact with these bits, via new methods and arguments. So bypass lastIndex for handling, but still set it based on precedent whenever possible (with the obvious precedents being the behavior of exec and test).

There are two predecents:

  • .matchAll() considers .lastIndex but doesn’t change it. (I’d rather it didn’t consider .lastIndex).
  • .replaceAll() ignores .lastIndex and resets it to 0.

I'd be curious to hear what TC39 thinks, but I think these might not be the right precedents.

  • All of the regex/string methods except matchAll reset lastIndex to 0 when they've reached the end of the string and the regex uses /g, so replaceAll doing this is standard behavior.
  • All of the string methods except matchAll (and except match without /g, but this is just an alias for exec which is not a string method) ignore lastIndex for their search start position.
  • matchAll not updating lastIndex feels like it's related to matchAll returning an iterator, and not wanting it to be possible for lastIndex to be manually adjusted between next() calls. The same is true for replace/replaceAll with /g not wanting lastIndex to be modified between matches (from within callbacks used as replacements).

I agree with you on wishing that matchAll didn't consider lastIndex, since the precedent from other string methods (as opposed to RegExp methods, and I consider non-global match to effectively also not be a string method since it is an alias for the canonical RegExp method exec) is that lastIndex doesn't set the search start position. But, perhaps, it can be justified, since matchAll essentially just repeatedly calls the RegExp method exec, and RegExp methods do consider lastIndex.

[...] Then flags matter and, IMO, we should complain if they are not right.

matchAll and replaceAll throw for regexes without /g, but that is because of the "All" in their names. I don't see anything incorrect about allowing both /g and non-/g with execAt/testAt, since exec/test already allow /g and non-/g (as do string methods search/split). I think the only impact of /g with execAt/testAt should be on whether lastIndex is updated. And I don't think separate sticky versions are needed. IMO /y should be respected with execAt/testAt, just like it's respected (or ignored if irrelevant, as with split) by every other method.


Aside: My high-level view is that /g, /y, and lastIndex were all API design mistakes. Flag /g would have been better handled as an argument (and now, after the introduction of replaceAll, even such an argument is not needed anywhere, if match didn't have a non-/g mode that was an alias for exec, which again I think was a mistake). lastIndex would have been better as a pos argument for exec and test, or handled via separate execAt/testAt methods that we're now discussing. And instead of /y, it would have been preferable to follow the prior art of the more flexible \G assertion from many other regex flavors, rather than standardizing Firefox v3's (at the time) non-standard flag.

ES3's flag /g copied Perl, but didn't copy Perl's system for tracking position on the search target rather than on the regex. Some other flavors don't use /g, and instead accept an argument for pos. JS's features for this can be worked with/around, but it's needlessly complicated, and (as you obviously know!) stateful regexes are a frequent source of bugs and surprise.

@slevithan
Copy link
Owner

slevithan commented Jul 11, 2024

Another aside:

There are a few different approaches that could be used to simplify, clean up, and improve JS's regex APIs. And I'll encourage and be happy to help anyone who wants to build their own version, based on whatever they think is best. Especially if it's someone like the awesome @rauschma who is well informed by all the existing complexity and legacy! 😊 Unfortunately, I don't think many people will use it if it's not native (added through new TC39 proposals), since JS's native APIs are good enough, and most people's usage of regexes is very simple.

The idea that there's not much appetite for non-native solutions to this stems from my experience with XRegExp. XRegExp already provided the APIs to totally disregard lastIndex, /g, and /y (making regexes act as if they were stateless), via its own XRegExp.exec / test / forEach / match / replace / replaceEach / split. But from what I can tell, almost no one used their pos, sticky, and scope arguments. So I didn't have much incentive to further improve them.

@rauschma
Copy link
Contributor Author

rauschma commented Jul 11, 2024

IMO it's probably overly complicated. One way to remove the complexity would be to go further than removing this and just not provide any kind of nextIndex property. It's not needed since its derivable, and it creates questions without obvious answers, like this one.

Good point.

IMO it's better to maintain the state (if not frozen) and all the expectations from precedent, for the sake of all the legacy methods that rely on lastIndex (most of which are not actually legacy), but then abstract away the need to interact with these bits, via new methods and arguments. So bypass lastIndex for handling, but still set it based on precedent whenever possible (with the obvious precedents being the behavior of exec and test).

Makes sense.

I don't see anything incorrect about allowing both /g and non-/g with execAt/testAt

I would argue:

  • If we stay close to the existing API (as we did with .lastIndex) then it makes more sense not to work around the constraints imposed by /g and /y.
  • The code of a polyfill is simpler if we don’t make those flags optional.

Aside: My high-level view is that /g, /y, and lastIndex were all API design mistakes.

I very much agree! I used \G via Rust’s fancy-regex and it was lovely.

I’m wondering if the upcoming flag /x would give us the opportunity to overhaul RegExp. My thoughts (based on your ideas) are here: https://gist.github.com/rauschma/8db338a96f9ead8df0714d233e81fed4

@slevithan
Copy link
Owner

slevithan commented Jul 11, 2024

I would argue:

  • If we stay close to the existing API (as we did with .lastIndex) then it makes more sense not to work around the constraints imposed by /g and /y.

This is a good explanation for your PoV--thanks! I think I still favor the flag-agnostic approach I described, but this helps me understand where you're coming from, and it's a totally reasonable perspective.

I’m wondering if the upcoming flag /x would give us the opportunity to overhaul RegExp. My thoughts (based on your ideas) are here: https://gist.github.com/rauschma/8db338a96f9ead8df0714d233e81fed4

You're very generous to credit me for inspiring this. This is a very ambitious proposal that essentially deprecates a whole swath of things (including recently added features like /d and matchAll) and introduces a lot of features simultaneously. I think I love it, but I'm curious how receptive TC39 would be. I think you'd have to make your case as strong as can be, and I think they'd want you to split it up regardless (I'll add some thoughts about this on this on the gist). If you get traction, I'd be happy to co-champion the proposal if you want help. 😊

A couple things that might be relevant/helpful:

@slevithan
Copy link
Owner

Gonna go ahead and close since this proposal now has a home and it seems like the proposed methods might be more appropriate as their own polyfill rather than as new exports of regex. But I'm happy to reopen if more discussion is helpful here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants