-
Notifications
You must be signed in to change notification settings - Fork 22.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reference for stage 3 regex-escaping #36928
base: main
Are you sure you want to change the base?
Conversation
Preview URLs
Flaws (1)Note! 4 documents with no flaws that don't need to be listed. 🎉 URL:
External URLs (1)URL:
(comment last updated: 2024-11-27 06:19:49) |
files/en-us/web/javascript/reference/global_objects/regexp/escape/index.md
Show resolved
Hide resolved
files/en-us/web/javascript/reference/global_objects/regexp/escape/index.md
Outdated
Show resolved
Hide resolved
files/en-us/web/javascript/reference/global_objects/regexp/escape/index.md
Show resolved
Hide resolved
files/en-us/web/javascript/reference/global_objects/regexp/escape/index.md
Outdated
Show resolved
Hide resolved
@Josh-Cena Looks great. I don't fully understand this gist linked from the proposal. But I take it to mean there might be risky places to insert the literal, even escaped. I assume this is to an issue because you would have mentioned it. As an aside, you link to the escape sequence https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions#escape_sequences |
@@ -41,7 +41,7 @@ A new string, with all matches of a pattern replaced by a replacement. | |||
|
|||
This method does not mutate the string value it's called on. It returns a new string. | |||
|
|||
Unlike [`replace()`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace), this method would replace all occurrences of a string, not just the first one. This is especially useful if the string is not statically known, as calling the [`RegExp()`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/RegExp) constructor without escaping special characters may unintentionally change its semantics. | |||
Unlike [`replace()`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace), this method would replace all occurrences of a string, not just the first one. This is especially useful if the string is not statically known, as calling the [`RegExp()`](/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/RegExp) constructor without escaping special characters may unintentionally change its semantics. (You can also use {{jsxref("RegExp.escape()")}} to make the replacement string a literal pattern, but that is more indirection than just calling `replaceAll()`.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you're trying to say with ", but that is more indirection than just calling replaceAll()
.)"
It reads as though you are not recommending using RegExp.escape()
for this case of the string for the regexp constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aye, I'm not recommending RegExp.escape
. I'm just preempting someone saying, "Buuuut you can actually use replace() + RegExp()
, you just need to escape it!!"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it is actually the "This is especially useful if" part that is confusing me - why is being able to replace all occurrences of a string especially useful if the string is not statically known?
Or to put it another way, don't I have to escape the pattern I use in this method too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is being able to replace all occurrences of a string especially useful if the string is not statically known?
I mean, that doesn't sound like an obscure use case at all, is it? Maybe the replacement string comes from user configuration, maybe it's retrieved from a database...
don't I have to escape the pattern I use in this method too?
In replaceAll
? No, since the idea is that you can just do replaceAll(oldString, newString)
. The problem with replace(oldString, newString)
is that it only does the replacement once, so if you want to replace all occurrences of oldString
, you have to first convert it to a global regex, but the conversion is non-trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why you wouldn't use replace()
and I understand why you would use replaceAll
for a static string.
But replaceAll
takes a pattern that can be a regexp,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume by "static string" you just mean "string"? Since it doesn't matter where the string comes from. For regexp there's a difference between "static regexp" and "dynamically assembled regexp" though since the latter is subject to regex syntax injection. String replacement is safe. Regex replacement should preferably only be used with static regexes. It happens that replace()
and replaceAll()
, when taking a regex, have exactly the same behavior (and replaceAll()
throws if the regex is non-global), so yeah you could say that using replaceAll
with a regex is useless since it's always equivalent to a replace
anyway.
The point is, there are three ways to globally replace a string, in the order when they became available:
replace(new RegExp(oldS, "g"), newS); // Unsafe
replaceAll(oldS, newS); // Safe
replace(new RegExp(RegExp.escape(oldS), "g"), newS); // Safe, too long
Co-authored-by: Hamish Willee <[email protected]>
files/en-us/web/javascript/reference/global_objects/regexp/escape/index.md
Outdated
Show resolved
Hide resolved
files/en-us/web/javascript/reference/global_objects/regexp/escape/index.md
Outdated
Show resolved
Hide resolved
files/en-us/web/javascript/reference/global_objects/regexp/escape/index.md
Outdated
Show resolved
Hide resolved
files/en-us/web/javascript/reference/global_objects/regexp/escape/index.md
Outdated
Show resolved
Hide resolved
|
||
## See also | ||
|
||
- [Polyfill of `RegExp.escape` in `core-js`](https://github.com/zloirock/core-js#regexp-escaping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you still only linking to core-js polyfills? if not, it'd be great to include https://www.npmjs.com/package/regexp.escape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally would not object, but I would need to see what others say. In the meantime let's hold the same policy that other maintainers have found acceptable, which is to only consistently include core-js polyfills. Hope you would understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Where can I go to revisit this discussion? Last time I tried it never went anywhere, and another polyfill maintainer tried recently and it seemed to get overly heated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll discuss it internally.
…ape/index.md Co-authored-by: Kevin Gibbons <[email protected]>
|
||
A new string that can be safely used as a literal pattern for the {{jsxref("RegExp/RegExp", "RegExp()")}} constructor. Namely, the following things in the input string are replaced: | ||
|
||
- The first character of the string, if it's either a decimal digit (0–9) or ASCII letter (a–z, A–Z), is escaped using the `\x` [character escape](/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_escape) syntax. For example, `RegExp.escape("foo")` returns `"\\x66oo"` (here and after, the two backslashes in a string literal denote a single backslash character). This step ensures that if this escaped string is embedded into a bigger pattern where it's immediately preceded by `\0`, `\c`, `\x0`, `\u000`, etc., the leading character doesn't get interpreted as part of the escape sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The first character of the string, if it's either a decimal digit (0–9) or ASCII letter (a–z, A–Z), is escaped using the `\x` [character escape](/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_escape) syntax. For example, `RegExp.escape("foo")` returns `"\\x66oo"` (here and after, the two backslashes in a string literal denote a single backslash character). This step ensures that if this escaped string is embedded into a bigger pattern where it's immediately preceded by `\0`, `\c`, `\x0`, `\u000`, etc., the leading character doesn't get interpreted as part of the escape sequence. | |
- The first character of the string, if it's either a decimal digit (0–9) or ASCII letter (a–z, A–Z), is escaped using the `\x` [character escape](/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_escape) syntax. For example, `RegExp.escape("foo")` returns `"\\x66oo"` (here and after, the two backslashes in a string literal denote a single backslash character). This step ensures that if this escaped string is embedded into a bigger pattern where it's immediately preceded by `\1`, `\x0`, `\u000`, etc., the leading character doesn't get interpreted as part of the escape sequence. |
It's true that the escapes are also necessary for the contexts after \0
and \c
, but understanding why requires knowledge of the arcana of the regexp grammar which no one should have. Whereas \1
is something readers actually should be familiar with.
In other words, I don't want people to read this sentence and discover that \cA
or \06
are things they could put in their regexes, even though it is technically accurate that these contexts are among those we considered when designing this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, is \cA
deprecated? It's still mentioned in https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_escape. Also for \0
, my intention was that "this \0
was actually supposed to be a null character by itself, but now it's inadvertently joined with something else". OTOH, \1
is mentioned in https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions#escape_sequences as already-deprecated:
\
followed by any other digit character becomes a legacy octal escape sequence, which is forbidden in Unicode-aware mode.
So my hope is that \1
would never appear as a valid regex sequence anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, is
\cA
deprecated?
Not in the sense of being in annex B, but it's still not something people should use (because no one will know what it means).
now it's inadvertently joined with something else
Yeah but the only way it can be inadvertently joined with something else is to form a legacy octal escape sequence, which people usually shouldn't have to know about.
So my hope is that
\1
would never appear as a valid regex sequence anyway.
That wording could use updating. \1
is a perfectly normal thing to appear in a regex as a reference to the first capturing group: /(.)\1/
is a regex which matches any character appearing twice in a row. Similarly \12
in a regex with 12 capturing groups. So you don't need to know about legacy octal escape sequences to understand why you need to escape leading digits.
It only falls back to being a legacy octal escape in the case that there are fewer capturing groups than the number it would represent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only falls back to being a legacy octal escape in the case that there are fewer capturing groups than the number it would represent.
Oh. I get what you mean. Forgot about backreferences. Thanks!
Yeah but the only way it can be inadvertently joined with something else is to form a legacy octal escape sequence, which people usually shouldn't have to know about.
No, I meant this:
const pattern = new RegExp(`\\0${userString}`, "gu");
// Intention: search for null character followed by some other user-provided text
If userString
starts with a digit, this will be a syntax error (because u
mode), or be something else. There's no "intention" to cause bad things to happen, it just doesn't do what you expect it to do.
Part of #36915