-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce unescape module #60261
introduce unescape module #60261
Conversation
r? @eddyb (rust_highfive has picked a reviewer for you, use r? to override) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Meta: for reviewing convenience it's better to update UI test outputs and satisfy tidy to make CI green, even if the changes in test results are temporarily wrong / intended to disappear. |
This comment has been minimized.
This comment has been minimized.
Apparently, running |
a39c555
to
f91fbdc
Compare
This comment has been minimized.
This comment has been minimized.
Question: what happens if a literal is lexed, but never "parsed properly"? (P.S. I haven't reviewed everything yet, will continue tomorrow.) |
Good question! Given that The following compiles, while it shouldn't (the 6F literal is out of range for char): macro_rules! erase {
($($tt:tt)*) => {}
}
fn main() {
erase! {
'\u{FFFFFF}'
}
} If we pursue the approach in this PR, then we should run |
Hm, or is the above example an expected behavior? We don't check ranges of integer literals, for example: macro_rules! erase {
($($tt:tt)*) => {}
}
fn main() {
erase!(999u8);
} for chars, we do check that there are at most six hex digits in the lexer, but we only do precise check for range and surrogates in the parser, which seems somewhat arbitrary. |
This comment has been minimized.
This comment has been minimized.
@matklad I think for out of range integer literals, we warn and truncate later in the compilation, but there's not much we can do for That said, maybe literals should be checked by AST -> HIR lowering, i.e. only unescaped once, to store both versions, or just the unescaped one, in HIR? One notable exception to "AST literals can be opaque, only HIR needs unescaping" is string literals used in attributes (such as |
I think we should do validation before macro expansion, for two reasons at least:
As for actual unescapeing, agree that, abstractly, it makes sense to do it as late as possible. Ideally (not necessary practically), running So my current plan is:
|
This comment has been minimized.
This comment has been minimized.
introduce unescape module A WIP PR to gauge early feedback Currently, we deal with escape sequences twice: once when we [lex](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/lexer/mod.rs#L928-L1065) a string, and a second time when we [unescape](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/mod.rs#L313-L366) literals. Note that we also produce different sets of diagnostics in these two cases. This PR aims to remove this duplication, by introducing a new `unescape` module as a single source of truth for character escaping rules. I think this would be a useful cleanup by itself, but I also need this for #59706. In the current state, the PR has `unescape` module which fully (modulo bugs) deals with string and char literals. I am quite happy about the state of this module What this PR doesn't have yet are: * [x] handling of byte and byte string literals (should be simple to add) * [x] good diagnostics * [x] actual removal of code from lexer (giant `scan_char_or_byte` should go away completely) * [ ] performance check * [x] general cleanup of the new code Diagnostics will be the most labor-consuming bit here, but they are mostly a question of just correctly adjusting spans to sub-tokens. The current setup for diagnostics is that `unescape` produces a plain old `enum` with various problems, and they are rendered into `Handler` separately. This bit is not actually required (it is possible to just pass the `Handler` in), but I like the separation between diagnostics and logic this approach imposes, and such separation should again be useful for #59706 cc @eddyb , @petrochenkov
☀️ Try build successful - checks-travis |
This probably should be tagged with Breaking Change and Waiting on Crater presumably? |
@craterbot run mode=check-only |
👌 Experiment ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more |
🚧 Experiment ℹ️ Crater is a tool to run experiments across parts of the Rust ecosystem. Learn more |
@rust-timer build bfdcf6d |
Success: Queued bfdcf6d with parent 1891bfa, comparison URL. |
Finished benchmarking try commit bfdcf6d |
Looks like there are no significant perf differences, let's wait what crater says |
🎉 Experiment
|
@bors r+ |
📌 Commit 1835cbe has been approved by |
introduce unescape module A WIP PR to gauge early feedback Currently, we deal with escape sequences twice: once when we [lex](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/lexer/mod.rs#L928-L1065) a string, and a second time when we [unescape](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/mod.rs#L313-L366) literals. Note that we also produce different sets of diagnostics in these two cases. This PR aims to remove this duplication, by introducing a new `unescape` module as a single source of truth for character escaping rules. I think this would be a useful cleanup by itself, but I also need this for #59706. In the current state, the PR has `unescape` module which fully (modulo bugs) deals with string and char literals. I am quite happy about the state of this module What this PR doesn't have yet are: * [x] handling of byte and byte string literals (should be simple to add) * [x] good diagnostics * [x] actual removal of code from lexer (giant `scan_char_or_byte` should go away completely) * [x] performance check * [x] general cleanup of the new code Diagnostics will be the most labor-consuming bit here, but they are mostly a question of just correctly adjusting spans to sub-tokens. The current setup for diagnostics is that `unescape` produces a plain old `enum` with various problems, and they are rendered into `Handler` separately. This bit is not actually required (it is possible to just pass the `Handler` in), but I like the separation between diagnostics and logic this approach imposes, and such separation should again be useful for #59706 cc @eddyb , @petrochenkov
☀️ Test successful - checks-travis, status-appveyor |
FWIW, this is now used by rust-analyzer: rust-lang/rust-analyzer#1253 |
A WIP PR to gauge early feedback
Currently, we deal with escape sequences twice: once when we lex a string, and a second time when we unescape literals. Note that we also produce different sets of diagnostics in these two cases.
This PR aims to remove this duplication, by introducing a new
unescape
module as a single source of truth for character escaping rules.I think this would be a useful cleanup by itself, but I also need this for #59706.
In the current state, the PR has
unescape
module which fully (modulo bugs) deals with string and char literals. I am quite happy about the state of this moduleWhat this PR doesn't have yet are:
scan_char_or_byte
should go away completely)Diagnostics will be the most labor-consuming bit here, but they are mostly a question of just correctly adjusting spans to sub-tokens. The current setup for diagnostics is that
unescape
produces a plain oldenum
with various problems, and they are rendered intoHandler
separately. This bit is not actually required (it is possible to just pass theHandler
in), but I like the separation between diagnostics and logic this approach imposes, and such separation should again be useful for #59706cc @eddyb , @petrochenkov