-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track casing of r-string prefixes in the tokenizer and AST #10314
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. Thank you for writing all the documentation. It helps in understanding the code :)
There are a few instances where the documentation is in the form of a question for functions which returns a bool
. I think my preference would be to just write it in a simple way like "Returns true if ..." or "Checks if ...". We could even top it up with an example like Returns true if ... e.g., r"foo"
There are a few nits which I leave at your discretion to apply or not.
crates/ruff_python_ast/src/nodes.rs
Outdated
/// A "raw" format-string, that has an `r` or `R` prefix | ||
RawFormat { uppercase_r: bool }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just have two separate variants instead of a boolean value: Raw
and RawUpper
. It's easier to type (FStringPrefix::RawUpper
vs FStringPrefix::RawFormat { uppercase: true }
) and it clearly separates the two variants w.r.t. an enum. What do you think?
This would apply to other *Prefix
enums as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm somewhat torn here. I agree with you that it's easier to type if they're two separate variants. But semantically, they're the same kind of string in Python, just represented in two different ways; having them be two flavours of the same variant feels more expressive of the semantics here. I also think that if I were matching on the prefix in a lint rule (for example), it would be more natural to do this:
match string_expr.flags.prefix() {
StringLiteralPrefix::Regular => ...,
StringLiteralPrefix::Raw { .. } => ...,
}
rather than
match string_expr.flags.prefix() {
StringLiteralPrefix::Regular => ...,
StringLiteralPrefix::RawLower | StringLiteralPrefix::RawUpper => ...,
}
(In most lint rules, I doubt you'll care much whether it's an R
or an r
, since they have the same semantics at runtime for Python.)
Some(StringPrefix::Bytes | StringPrefix::RawBytes { .. }) => { | ||
panic!("Attempting to convert a bytestring into a non-bytestring!") | ||
} | ||
Some(StringPrefix::Format | StringPrefix::RawFormat { .. }) => { | ||
panic!("Attempting to convert an f-string into a non-fstring!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unfortunate that we need to use panic
here but I understand the reasoning behind it which is that we're going from a wide type containing all possible string prefix to a narrow one for a specific type of string (for example StringPrefix
to FStringPrefix
).
I think this should actually be an unreachable
instead as our lexer will raise an error for invalid combination of prefixes used.
I'm sure you must have thought of this but if possible, I would try to find any type-safe way to do this conversion to avoid a potential panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is somewhat unfortunate. It might be worth noting, though, that this isn't really any less type-safe than previously -- the panic!
s in this function are really just doing the work for us that the debug_assert!
s were previously.
The only other way I can see of doing this would be to implement TryFrom
instead of From
here -- return an Err
variant instead of panicking. But that feels incorrect to me. If we hit the panicking branch, I think we should be immediately bailing from this function, since that indicates a logical error in our code somewhere. If we've written the tokenizer and parser correctly, it should be impossible for us to get here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the
panic!
s in this function are really just doing the work for us that thedebug_assert!
s were previously.
You're correct although they differ from a user perspective. debug_assert!
will only panic on debug builds while panic
will bail out even on the release version.
I think we should be immediately bailing from this function, since that indicates a logical error in our code somewhere. If we've written the tokenizer and parser correctly, it should be impossible for us to get here.
Yes, this is the reason I suggest using the unreachable
macro instead. As a developer it's hard to reason about a panic
usage unless there's an indicator like a comment. However, if unreachable
is used, one can say that this is indeed a logical error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense — I'll switch to unreachable
, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense — I'll switch to
unreachable
, thanks
Done in 3ee6454
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way in which we could make this type-safe would be to emit different tokens for string literals and bytes literals. Then we would be able to have the same three-way f-string/bytestring/everything-else
distinction at the token level as we do at the AST level, which would mean we would be able to have three different bitflags at the token level. We wouldn't need to do the type narrowing here when creating an AST node from a token because the type would already be narrowed.
But that would be quite a large change to make.
Hmm I'm not sure if we should track this. It goes a bit beyond what an AST tends to capture. But it's also not clear to me what the downsides would be. What does the formatter use the lower prefix for? I thought we always normalize to lower case |
The formatter normalises all prefixes to lowercase, except for raw strings, where it retains the casing in line with what black does for raw strings: https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#r-strings-and-r-strings
|
I see and I guess it's fine adding it now that we already track prefixes The reason I'm hesitant to add such fine grained data to the ast are
|
Hmm, but internally we're just storing the new information on the same bitflag, no? The enum variants are only created from the bitflag "on demand" when the |
93bfa37
to
7a8a214
Compare
Thanks @dhruvmanila! I've implemented many of your suggestions, in particular most of the ones relating to naming and documentation |
This was my initial reaction too, but...
|
3ee6454
to
21e2911
Compare
It depends on the CST design. For example, Roslyn, rust-analyzer, Biome, and libswift all use a uniform representation for all nodes that mainly consists of the node's kind, its children and possibly some flags (single flags structure shared between all nodes). That means that node specific attributes or flags aren't possible. You can see this in Biome's playground when opening the syntax tab and looking at the CST content. For each level, it only shows the kind followed by its children. A uniform representation has the advantage of traversing the tree without casting to the specific AST nodes. This makes it easy to implement traversal methods like The way we would need to model this with such a CST design is to look at the tokens directly (similar to what we used to do with Locator)
That's correct and what I mentioned in
|
crates/ruff_python_ast/src/nodes.rs
Outdated
match prefix { | ||
FStringPrefix::Regular => {} | ||
FStringPrefix::Raw { uppercase_r: true } => { | ||
self.0 |= FStringFlagsInner::R_PREFIX_UPPER; | ||
} | ||
FStringPrefix::Raw { uppercase_r: false } => { | ||
self.0 |= FStringFlagsInner::R_PREFIX_LOWER; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following will create an invalid prefix:
FStringFlags::default().with_prefix(FStringPrefix::Raw { uppercase_r: true }).with_prefix(FStringPrefix::Raw { uppercase_r: false })
Or passing regular after uppercase won't have the desired outcome:
FStringFlags::default().with_prefix(FStringPrefix::Raw { uppercase_r: true }).with_prefix(FStringPrefix::Regular)
.prefix().is_raw() // still returns true
You'll need to unset the Raw flags before setting them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed this in bc3e406
But I also wondered about simply prohibiting calling .with_prefix()
if a prefix flag had already been set (using debug_assert!(!self.0.intersects(FStringFlagsInner::R_PREFIX_UPPER | FStringFlagsInner::R_PREFIX_LOWER), "Cannot call .with_prefix() if a prefix has already been set")
. This way is more type-safe, though :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would find it surprising if I'm not allowed to call a method multiple times. An alternative is to have a from_prefix
method that constructs Self
. This way, the compiler guarantees that the method can only be called once during construction.
Co-authored-by: Micha Reiser <[email protected]>
…s already been set
Summary
Following #10298, I wanted to look for places in ruff where we could simplify and/or speedup our code by using the information newly available to us in the AST nodes. One obvious place was the formatter, but when I looked at the formatter's code, I realised it was important to the formatter whether or not an uppercase or lowercase
r
prefix was used for a raw string:ruff/crates/ruff_python_formatter/src/string/mod.rs
Lines 165 to 171 in 0c84fbb
ruff/crates/ruff_python_formatter/src/string/mod.rs
Lines 107 to 118 in 0c84fbb
That information isn't currently captured in the tokens or AST, so we wouldn't, right now, be able to simplify any of that formatter code. This PR makes it so the tokens and AST do capture that information.
The PR should be easiest to review commit-by-commit.
Test Plan
cargo test