Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference for stage 3 regex-escaping #36928

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Josh-Cena
Copy link
Member

Part of #36915

@Josh-Cena Josh-Cena requested a review from a team as a code owner November 22, 2024 19:05
@Josh-Cena Josh-Cena requested review from chrisdavidmills and hamishwillee and removed request for a team and chrisdavidmills November 22, 2024 19:05
@github-actions github-actions bot added Content:JS JavaScript docs size/m [PR only] 51-500 LoC changed labels Nov 22, 2024
Copy link
Contributor

github-actions bot commented Nov 22, 2024

Preview URLs

Flaws (1)

Note! 4 documents with no flaws that don't need to be listed. 🎉

URL: /en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/escape
Title: RegExp.escape()
Flaw count: 1

  • bad_bcd_queries:
    • No BCD data for query: javascript.builtins.RegExp.escape
External URLs (1)

URL: /en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/escape
Title: RegExp.escape()

(comment last updated: 2024-11-27 06:19:49)

@hamishwillee
Copy link
Collaborator

hamishwillee commented Nov 25, 2024

@Josh-Cena Looks great. I don't fully understand this gist linked from the proposal. But I take it to mean there might be risky places to insert the literal, even escaped. I assume this is to an issue because you would have mentioned it.

As an aside, you link to the escape sequence https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions#escape_sequences
The proposal escapes space in its example - I don't see space mentioned in the list. Might not be important - just noting in case it is an omission.

@@ -41,7 +41,7 @@ A new string, with all matches of a pattern replaced by a replacement.

This method does not mutate the string value it's called on. It returns a new string.

Unlike [`replace()`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace), this method would replace all occurrences of a string, not just the first one. This is especially useful if the string is not statically known, as calling the [`RegExp()`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/RegExp) constructor without escaping special characters may unintentionally change its semantics.
Unlike [`replace()`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace), this method would replace all occurrences of a string, not just the first one. This is especially useful if the string is not statically known, as calling the [`RegExp()`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/RegExp) constructor without escaping special characters may unintentionally change its semantics. (You can also use {{jsxref("RegExp.escape()")}} to make the replacement string a literal pattern, but that is more indirection than just calling `replaceAll()`.)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you're trying to say with ", but that is more indirection than just calling replaceAll().)"

It reads as though you are not recommending using RegExp.escape() for this case of the string for the regexp constructor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye, I'm not recommending RegExp.escape. I'm just preempting someone saying, "Buuuut you can actually use replace() + RegExp(), you just need to escape it!!"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it is actually the "This is especially useful if" part that is confusing me - why is being able to replace all occurrences of a string especially useful if the string is not statically known?

Or to put it another way, don't I have to escape the pattern I use in this method too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is being able to replace all occurrences of a string especially useful if the string is not statically known?

I mean, that doesn't sound like an obscure use case at all, is it? Maybe the replacement string comes from user configuration, maybe it's retrieved from a database...

don't I have to escape the pattern I use in this method too?

In replaceAll? No, since the idea is that you can just do replaceAll(oldString, newString). The problem with replace(oldString, newString) is that it only does the replacement once, so if you want to replace all occurrences of oldString, you have to first convert it to a global regex, but the conversion is non-trivial.

Copy link
Collaborator

@hamishwillee hamishwillee Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you wouldn't use replace() and I understand why you would use replaceAll for a static string.

But replaceAll takes a pattern that can be a regexp,

Copy link
Member Author

@Josh-Cena Josh-Cena Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume by "static string" you just mean "string"? Since it doesn't matter where the string comes from. For regexp there's a difference between "static regexp" and "dynamically assembled regexp" though since the latter is subject to regex syntax injection. String replacement is safe. Regex replacement should preferably only be used with static regexes. It happens that replace() and replaceAll(), when taking a regex, have exactly the same behavior (and replaceAll() throws if the regex is non-global), so yeah you could say that using replaceAll with a regex is useless since it's always equivalent to a replace anyway.

The point is, there are three ways to globally replace a string, in the order when they became available:

replace(new RegExp(oldS, "g"), newS); // Unsafe
replaceAll(oldS, newS); // Safe
replace(new RegExp(RegExp.escape(oldS), "g"), newS); // Safe, too long

@Josh-Cena
Copy link
Member Author

Oh I forgot to ask but @ljharb @bakkot your reviews will be appreciated too. Thanks!


## See also

- [Polyfill of `RegExp.escape` in `core-js`](https://github.com/zloirock/core-js#regexp-escaping)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you still only linking to core-js polyfills? if not, it'd be great to include https://www.npmjs.com/package/regexp.escape

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally would not object, but I would need to see what others say. In the meantime let's hold the same policy that other maintainers have found acceptable, which is to only consistently include core-js polyfills. Hope you would understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Where can I go to revisit this discussion? Last time I tried it never went anywhere, and another polyfill maintainer tried recently and it seemed to get overly heated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll discuss it internally.


A new string that can be safely used as a literal pattern for the {{jsxref("RegExp/RegExp", "RegExp()")}} constructor. Namely, the following things in the input string are replaced:

- The first character of the string, if it's either a decimal digit (0–9) or ASCII letter (a–z, A–Z), is escaped using the `\x` [character escape](/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_escape) syntax. For example, `RegExp.escape("foo")` returns `"\\x66oo"` (here and after, the two backslashes in a string literal denote a single backslash character). This step ensures that if this escaped string is embedded into a bigger pattern where it's immediately preceded by `\0`, `\c`, `\x0`, `\u000`, etc., the leading character doesn't get interpreted as part of the escape sequence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The first character of the string, if it's either a decimal digit (0–9) or ASCII letter (a–z, A–Z), is escaped using the `\x` [character escape](/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_escape) syntax. For example, `RegExp.escape("foo")` returns `"\\x66oo"` (here and after, the two backslashes in a string literal denote a single backslash character). This step ensures that if this escaped string is embedded into a bigger pattern where it's immediately preceded by `\0`, `\c`, `\x0`, `\u000`, etc., the leading character doesn't get interpreted as part of the escape sequence.
- The first character of the string, if it's either a decimal digit (0–9) or ASCII letter (a–z, A–Z), is escaped using the `\x` [character escape](/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_escape) syntax. For example, `RegExp.escape("foo")` returns `"\\x66oo"` (here and after, the two backslashes in a string literal denote a single backslash character). This step ensures that if this escaped string is embedded into a bigger pattern where it's immediately preceded by `\1`, `\x0`, `\u000`, etc., the leading character doesn't get interpreted as part of the escape sequence.

It's true that the escapes are also necessary for the contexts after \0 and \c, but understanding why requires knowledge of the arcana of the regexp grammar which no one should have. Whereas \1 is something readers actually should be familiar with.

In other words, I don't want people to read this sentence and discover that \cA or \06 are things they could put in their regexes, even though it is technically accurate that these contexts are among those we considered when designing this feature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, is \cA deprecated? It's still mentioned in https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_escape. Also for \0, my intention was that "this \0 was actually supposed to be a null character by itself, but now it's inadvertently joined with something else". OTOH, \1 is mentioned in https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions#escape_sequences as already-deprecated:

\ followed by any other digit character becomes a legacy octal escape sequence, which is forbidden in Unicode-aware mode.

So my hope is that \1 would never appear as a valid regex sequence anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, is \cA deprecated?

Not in the sense of being in annex B, but it's still not something people should use (because no one will know what it means).

now it's inadvertently joined with something else

Yeah but the only way it can be inadvertently joined with something else is to form a legacy octal escape sequence, which people usually shouldn't have to know about.

So my hope is that \1 would never appear as a valid regex sequence anyway.

That wording could use updating. \1 is a perfectly normal thing to appear in a regex as a reference to the first capturing group: /(.)\1/ is a regex which matches any character appearing twice in a row. Similarly \12 in a regex with 12 capturing groups. So you don't need to know about legacy octal escape sequences to understand why you need to escape leading digits.

It only falls back to being a legacy octal escape in the case that there are fewer capturing groups than the number it would represent.

Copy link
Member Author

@Josh-Cena Josh-Cena Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only falls back to being a legacy octal escape in the case that there are fewer capturing groups than the number it would represent.

Oh. I get what you mean. Forgot about backreferences. Thanks!

Yeah but the only way it can be inadvertently joined with something else is to form a legacy octal escape sequence, which people usually shouldn't have to know about.

No, I meant this:

const pattern = new RegExp(`\\0${userString}`, "gu");
// Intention: search for null character followed by some other user-provided text

If userString starts with a digit, this will be a syntax error (because u mode), or be something else. There's no "intention" to cause bad things to happen, it just doesn't do what you expect it to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Content:JS JavaScript docs size/m [PR only] 51-500 LoC changed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants