Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track casing of r-string prefixes in the tokenizer and AST #10314

Merged
merged 11 commits into from
Mar 18, 2024

Conversation

AlexWaygood
Copy link
Member

Summary

Following #10298, I wanted to look for places in ruff where we could simplify and/or speedup our code by using the information newly available to us in the AST nodes. One obvious place was the formatter, but when I looked at the formatter's code, I realised it was important to the formatter whether or not an uppercase or lowercase r prefix was used for a raw string:

// Retain the casing for the raw prefix:
// https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#r-strings-and-r-strings
if self.contains(StringPrefix::RAW) {
token("r").fmt(f)?;
} else if self.contains(StringPrefix::RAW_UPPER) {
token("R").fmt(f)?;
}

bitflags! {
#[derive(Copy, Clone, Debug, PartialEq, Eq)]
pub(crate) struct StringPrefix: u8 {
const UNICODE = 0b0000_0001;
/// `r"test"`
const RAW = 0b0000_0010;
/// `R"test"
const RAW_UPPER = 0b0000_0100;
const BYTE = 0b0000_1000;
const F_STRING = 0b0001_0000;
}
}

That information isn't currently captured in the tokens or AST, so we wouldn't, right now, be able to simplify any of that formatter code. This PR makes it so the tokens and AST do capture that information.

The PR should be easiest to review commit-by-commit.

Test Plan

cargo test

@AlexWaygood AlexWaygood requested a review from dhruvmanila March 9, 2024 14:51
@AlexWaygood AlexWaygood requested a review from MichaReiser as a code owner March 9, 2024 14:51
@AlexWaygood AlexWaygood changed the title Track casing of r string prefixes in the tokenizer and AST Track casing of r-string prefixes in the tokenizer and AST Mar 9, 2024
Copy link
Contributor

github-actions bot commented Mar 9, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

Copy link
Member

@dhruvmanila dhruvmanila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Thank you for writing all the documentation. It helps in understanding the code :)

There are a few instances where the documentation is in the form of a question for functions which returns a bool. I think my preference would be to just write it in a simple way like "Returns true if ..." or "Checks if ...". We could even top it up with an example like Returns true if ... e.g., r"foo"

There are a few nits which I leave at your discretion to apply or not.

Comment on lines 1202 to 1203
/// A "raw" format-string, that has an `r` or `R` prefix
RawFormat { uppercase_r: bool },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just have two separate variants instead of a boolean value: Raw and RawUpper. It's easier to type (FStringPrefix::RawUpper vs FStringPrefix::RawFormat { uppercase: true }) and it clearly separates the two variants w.r.t. an enum. What do you think?

This would apply to other *Prefix enums as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat torn here. I agree with you that it's easier to type if they're two separate variants. But semantically, they're the same kind of string in Python, just represented in two different ways; having them be two flavours of the same variant feels more expressive of the semantics here. I also think that if I were matching on the prefix in a lint rule (for example), it would be more natural to do this:

match string_expr.flags.prefix() {
    StringLiteralPrefix::Regular => ...,
    StringLiteralPrefix::Raw { .. } => ...,
}

rather than

match string_expr.flags.prefix() {
    StringLiteralPrefix::Regular => ...,
    StringLiteralPrefix::RawLower | StringLiteralPrefix::RawUpper => ...,
}

(In most lint rules, I doubt you'll care much whether it's an R or an r, since they have the same semantics at runtime for Python.)

crates/ruff_python_ast/src/nodes.rs Show resolved Hide resolved
crates/ruff_python_ast/src/nodes.rs Outdated Show resolved Hide resolved
crates/ruff_python_ast/src/nodes.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved
Comment on lines 363 to 367
Some(StringPrefix::Bytes | StringPrefix::RawBytes { .. }) => {
panic!("Attempting to convert a bytestring into a non-bytestring!")
}
Some(StringPrefix::Format | StringPrefix::RawFormat { .. }) => {
panic!("Attempting to convert an f-string into a non-fstring!")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate that we need to use panic here but I understand the reasoning behind it which is that we're going from a wide type containing all possible string prefix to a narrow one for a specific type of string (for example StringPrefix to FStringPrefix).

I think this should actually be an unreachable instead as our lexer will raise an error for invalid combination of prefixes used.

I'm sure you must have thought of this but if possible, I would try to find any type-safe way to do this conversion to avoid a potential panic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is somewhat unfortunate. It might be worth noting, though, that this isn't really any less type-safe than previously -- the panic!s in this function are really just doing the work for us that the debug_assert!s were previously.

The only other way I can see of doing this would be to implement TryFrom instead of From here -- return an Err variant instead of panicking. But that feels incorrect to me. If we hit the panicking branch, I think we should be immediately bailing from this function, since that indicates a logical error in our code somewhere. If we've written the tokenizer and parser correctly, it should be impossible for us to get here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the panic!s in this function are really just doing the work for us that the debug_assert!s were previously.

You're correct although they differ from a user perspective. debug_assert! will only panic on debug builds while panic will bail out even on the release version.

I think we should be immediately bailing from this function, since that indicates a logical error in our code somewhere. If we've written the tokenizer and parser correctly, it should be impossible for us to get here.

Yes, this is the reason I suggest using the unreachable macro instead. As a developer it's hard to reason about a panic usage unless there's an indicator like a comment. However, if unreachable is used, one can say that this is indeed a logical error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense — I'll switch to unreachable, thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense — I'll switch to unreachable, thanks

Done in 3ee6454

Copy link
Member Author

@AlexWaygood AlexWaygood Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way in which we could make this type-safe would be to emit different tokens for string literals and bytes literals. Then we would be able to have the same three-way f-string/bytestring/everything-else distinction at the token level as we do at the AST level, which would mean we would be able to have three different bitflags at the token level. We wouldn't need to do the type narrowing here when creating an AST node from a token because the type would already be narrowed.

But that would be quite a large change to make.

@MichaReiser
Copy link
Member

Hmm I'm not sure if we should track this. It goes a bit beyond what an AST tends to capture. But it's also not clear to me what the downsides would be.

What does the formatter use the lower prefix for? I thought we always normalize to lower case

@AlexWaygood
Copy link
Member Author

What does the formatter use the lower prefix for? I thought we always normalize to lower case

The formatter normalises all prefixes to lowercase, except for raw strings, where it retains the casing in line with what black does for raw strings: https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#r-strings-and-r-strings

Black normalizes string quotes as well as string prefixes, making them lowercase. One exception to this rule is r-strings. It turns out that the very popular MagicPython syntax highlighter, used by default by (among others) GitHub and Visual Studio Code, differentiates between r-strings and R-strings. The former are syntax highlighted as regular expressions while the latter are treated as true raw strings with no special semantics.

@MichaReiser
Copy link
Member

I see and I guess it's fine adding it now that we already track prefixes

The reason I'm hesitant to add such fine grained data to the ast are

  • This kind of Informationen could probably not be represented if we ever migrate to a CST (or would limit us in the design options for the CST)
  • Tracking the information increases the struct sizes which has a cost in memory consumption, data to write and to read (both slow operations). It can often times be faster (when considering the entire compilation pipeline) to reparse information that is needed rarely than paying the cost everywhere

@AlexWaygood
Copy link
Member Author

  • Tracking the information increases the struct sizes which has a cost in memory consumption, data to write and to read (both slow operations). It can often times be faster (when considering the entire compilation pipeline) to reparse information that is needed rarely than paying the cost everywhere

Hmm, but internally we're just storing the new information on the same bitflag, no? The enum variants are only created from the bitflag "on demand" when the prefix() method is called. Possibly I'm misunderstanding, but I don't think this should increase memory consumption significantly

@AlexWaygood
Copy link
Member Author

AlexWaygood commented Mar 10, 2024

Also, I think the arguments for preserving the casing of r-strings in the formatter apply equally well to the expression generator. Being able to losslessly round-trip these kinds of details in the expression generator was the original motivation for this series of changes (#7799, #9663, etc)

@AlexWaygood AlexWaygood force-pushed the r-prefix-casing branch 2 times, most recently from 93bfa37 to 7a8a214 Compare March 10, 2024 12:54
@AlexWaygood
Copy link
Member Author

Thanks @dhruvmanila! I've implemented many of your suggestions, in particular most of the ones relating to naming and documentation

@AlexWaygood AlexWaygood requested a review from dhruvmanila March 10, 2024 13:05
@charliermarsh
Copy link
Member

The reason I'm hesitant to add such fine grained data to the ast are

This kind of Informationen could probably not be represented if we ever migrate to a CST (or would limit us in the design options for the CST)
Tracking the information increases the struct sizes which has a cost in memory consumption, data to write and to read (both slow operations). It can often times be faster (when considering the entire compilation pipeline) to reparse information that is needed rarely than paying the cost everywhere

This was my initial reaction too, but...

  1. Can you explain more about why this would be a problem for a CST? I don't understand that part.
  2. I don't think this has a cost, since we're using a bitflag of the same size either way, though I may be mistaken.

@AlexWaygood AlexWaygood added the internal An internal refactor or improvement label Mar 11, 2024
@MichaReiser
Copy link
Member

Can you explain more about why this would be a problem for a CST? I don't understand that part.

It depends on the CST design. For example, Roslyn, rust-analyzer, Biome, and libswift all use a uniform representation for all nodes that mainly consists of the node's kind, its children and possibly some flags (single flags structure shared between all nodes). That means that node specific attributes or flags aren't possible. You can see this in Biome's playground when opening the syntax tab and looking at the CST content. For each level, it only shows the kind followed by its children.

A uniform representation has the advantage of traversing the tree without casting to the specific AST nodes. This makes it easy to implement traversal methods like descendants or storing a reference to a node.

The way we would need to model this with such a CST design is to look at the tokens directly (similar to what we used to do with Locator)

I don't think this has a cost, since we're using a bitflag of the same size either way, though I may be mistaken.

That's correct and what I mentioned in

I see and I guess it's fine adding it now that we already track prefixes

crates/ruff_python_ast/src/nodes.rs Outdated Show resolved Hide resolved
Comment on lines 1260 to 1268
match prefix {
FStringPrefix::Regular => {}
FStringPrefix::Raw { uppercase_r: true } => {
self.0 |= FStringFlagsInner::R_PREFIX_UPPER;
}
FStringPrefix::Raw { uppercase_r: false } => {
self.0 |= FStringFlagsInner::R_PREFIX_LOWER;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following will create an invalid prefix:

FStringFlags::default().with_prefix(FStringPrefix::Raw { uppercase_r: true }).with_prefix(FStringPrefix::Raw { uppercase_r: false })

Or passing regular after uppercase won't have the desired outcome:

FStringFlags::default().with_prefix(FStringPrefix::Raw { uppercase_r: true }).with_prefix(FStringPrefix::Regular)
    .prefix().is_raw() // still returns true

You'll need to unset the Raw flags before setting them.

Copy link
Member Author

@AlexWaygood AlexWaygood Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed this in bc3e406

But I also wondered about simply prohibiting calling .with_prefix() if a prefix flag had already been set (using debug_assert!(!self.0.intersects(FStringFlagsInner::R_PREFIX_UPPER | FStringFlagsInner::R_PREFIX_LOWER), "Cannot call .with_prefix() if a prefix has already been set"). This way is more type-safe, though :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would find it surprising if I'm not allowed to call a method multiple times. An alternative is to have a from_prefix method that constructs Self. This way, the compiler guarantees that the method can only be called once during construction.

crates/ruff_python_ast/src/nodes.rs Show resolved Hide resolved
crates/ruff_python_ast/src/nodes.rs Outdated Show resolved Hide resolved
crates/ruff_python_ast/src/nodes.rs Outdated Show resolved Hide resolved
crates/ruff_python_ast/src/nodes.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/lexer.rs Show resolved Hide resolved
crates/ruff_python_parser/src/string_token_flags.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/string_token_flags.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/string_token_flags.rs Outdated Show resolved Hide resolved
@AlexWaygood AlexWaygood enabled auto-merge (squash) March 18, 2024 17:11
@AlexWaygood AlexWaygood merged commit 162d2eb into astral-sh:main Mar 18, 2024
17 checks passed
@AlexWaygood AlexWaygood deleted the r-prefix-casing branch March 18, 2024 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal An internal refactor or improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants