Track casing of r-string prefixes in the tokenizer and AST #10314

AlexWaygood · 2024-03-09T14:51:07Z

Summary

Following #10298, I wanted to look for places in ruff where we could simplify and/or speedup our code by using the information newly available to us in the AST nodes. One obvious place was the formatter, but when I looked at the formatter's code, I realised it was important to the formatter whether or not an uppercase or lowercase r prefix was used for a raw string:

ruff/crates/ruff_python_formatter/src/string/mod.rs

Lines 165 to 171 in 0c84fbb

    
           // Retain the casing for the raw prefix: 
        
           // https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#r-strings-and-r-strings 
        
           if self.contains(StringPrefix::RAW) { 
        
               token("r").fmt(f)?; 
        
           } else if self.contains(StringPrefix::RAW_UPPER) { 
        
               token("R").fmt(f)?; 
        
           }

ruff/crates/ruff_python_formatter/src/string/mod.rs

Lines 107 to 118 in 0c84fbb

    
           bitflags! { 
        
               #[derive(Copy, Clone, Debug, PartialEq, Eq)] 
        
               pub(crate) struct StringPrefix: u8 { 
        
                   const UNICODE   = 0b0000_0001; 
        
                   /// `r"test"` 
        
                   const RAW       = 0b0000_0010; 
        
                   /// `R"test" 
        
                   const RAW_UPPER = 0b0000_0100; 
        
                   const BYTE      = 0b0000_1000; 
        
                   const F_STRING  = 0b0001_0000; 
        
               } 
        
           }

That information isn't currently captured in the tokens or AST, so we wouldn't, right now, be able to simplify any of that formatter code. This PR makes it so the tokens and AST do capture that information.

The PR should be easiest to review commit-by-commit.

Test Plan

cargo test

github-actions · 2024-03-09T15:09:05Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

dhruvmanila

This makes sense. Thank you for writing all the documentation. It helps in understanding the code :)

There are a few instances where the documentation is in the form of a question for functions which returns a bool. I think my preference would be to just write it in a simple way like "Returns true if ..." or "Checks if ...". We could even top it up with an example like Returns true if ... e.g., r"foo"

There are a few nits which I leave at your discretion to apply or not.

dhruvmanila · 2024-03-10T02:11:48Z

crates/ruff_python_ast/src/nodes.rs

+    /// A "raw" format-string, that has an `r` or `R` prefix
+    RawFormat { uppercase_r: bool },


I think we should just have two separate variants instead of a boolean value: Raw and RawUpper. It's easier to type (FStringPrefix::RawUpper vs FStringPrefix::RawFormat { uppercase: true }) and it clearly separates the two variants w.r.t. an enum. What do you think?

This would apply to other *Prefix enums as well.

I'm somewhat torn here. I agree with you that it's easier to type if they're two separate variants. But semantically, they're the same kind of string in Python, just represented in two different ways; having them be two flavours of the same variant feels more expressive of the semantics here. I also think that if I were matching on the prefix in a lint rule (for example), it would be more natural to do this:

match string_expr.flags.prefix() { StringLiteralPrefix::Regular => ..., StringLiteralPrefix::Raw { .. } => ..., }

rather than

match string_expr.flags.prefix() { StringLiteralPrefix::Regular => ..., StringLiteralPrefix::RawLower | StringLiteralPrefix::RawUpper => ..., }

(In most lint rules, I doubt you'll care much whether it's an R or an r, since they have the same semantics at runtime for Python.)

crates/ruff_python_ast/src/nodes.rs

crates/ruff_python_parser/src/lexer.rs

dhruvmanila · 2024-03-10T03:08:53Z

crates/ruff_python_parser/src/string_token_flags.rs

+            Some(StringPrefix::Bytes | StringPrefix::RawBytes { .. }) => {
+                panic!("Attempting to convert a bytestring into a non-bytestring!")
+            }
+            Some(StringPrefix::Format | StringPrefix::RawFormat { .. }) => {
+                panic!("Attempting to convert an f-string into a non-fstring!")


It's unfortunate that we need to use panic here but I understand the reasoning behind it which is that we're going from a wide type containing all possible string prefix to a narrow one for a specific type of string (for example StringPrefix to FStringPrefix).

I think this should actually be an unreachable instead as our lexer will raise an error for invalid combination of prefixes used.

I'm sure you must have thought of this but if possible, I would try to find any type-safe way to do this conversion to avoid a potential panic.

I agree this is somewhat unfortunate. It might be worth noting, though, that this isn't really any less type-safe than previously -- the panic!s in this function are really just doing the work for us that the debug_assert!s were previously.

The only other way I can see of doing this would be to implement TryFrom instead of From here -- return an Err variant instead of panicking. But that feels incorrect to me. If we hit the panicking branch, I think we should be immediately bailing from this function, since that indicates a logical error in our code somewhere. If we've written the tokenizer and parser correctly, it should be impossible for us to get here.

the panic!s in this function are really just doing the work for us that the debug_assert!s were previously.

You're correct although they differ from a user perspective. debug_assert! will only panic on debug builds while panic will bail out even on the release version.

I think we should be immediately bailing from this function, since that indicates a logical error in our code somewhere. If we've written the tokenizer and parser correctly, it should be impossible for us to get here.

Yes, this is the reason I suggest using the unreachable macro instead. As a developer it's hard to reason about a panic usage unless there's an indicator like a comment. However, if unreachable is used, one can say that this is indeed a logical error.

Makes sense — I'll switch to unreachable, thanks

Makes sense — I'll switch to unreachable, thanks

Done in 3ee6454

One way in which we could make this type-safe would be to emit different tokens for string literals and bytes literals. Then we would be able to have the same three-way f-string/bytestring/everything-else distinction at the token level as we do at the AST level, which would mean we would be able to have three different bitflags at the token level. We wouldn't need to do the type narrowing here when creating an AST node from a token because the type would already be narrowed.

But that would be quite a large change to make.

MichaReiser · 2024-03-10T08:50:14Z

Hmm I'm not sure if we should track this. It goes a bit beyond what an AST tends to capture. But it's also not clear to me what the downsides would be.

What does the formatter use the lower prefix for? I thought we always normalize to lower case

AlexWaygood · 2024-03-10T08:53:59Z

What does the formatter use the lower prefix for? I thought we always normalize to lower case

The formatter normalises all prefixes to lowercase, except for raw strings, where it retains the casing in line with what black does for raw strings: https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#r-strings-and-r-strings

Black normalizes string quotes as well as string prefixes, making them lowercase. One exception to this rule is r-strings. It turns out that the very popular MagicPython syntax highlighter, used by default by (among others) GitHub and Visual Studio Code, differentiates between r-strings and R-strings. The former are syntax highlighted as regular expressions while the latter are treated as true raw strings with no special semantics.

MichaReiser · 2024-03-10T09:46:37Z

I see and I guess it's fine adding it now that we already track prefixes

The reason I'm hesitant to add such fine grained data to the ast are

This kind of Informationen could probably not be represented if we ever migrate to a CST (or would limit us in the design options for the CST)
Tracking the information increases the struct sizes which has a cost in memory consumption, data to write and to read (both slow operations). It can often times be faster (when considering the entire compilation pipeline) to reparse information that is needed rarely than paying the cost everywhere

AlexWaygood · 2024-03-10T10:00:00Z

Tracking the information increases the struct sizes which has a cost in memory consumption, data to write and to read (both slow operations). It can often times be faster (when considering the entire compilation pipeline) to reparse information that is needed rarely than paying the cost everywhere

Hmm, but internally we're just storing the new information on the same bitflag, no? The enum variants are only created from the bitflag "on demand" when the prefix() method is called. Possibly I'm misunderstanding, but I don't think this should increase memory consumption significantly

AlexWaygood · 2024-03-10T10:12:23Z

Also, I think the arguments for preserving the casing of r-strings in the formatter apply equally well to the expression generator. Being able to losslessly round-trip these kinds of details in the expression generator was the original motivation for this series of changes (#7799, #9663, etc)

AlexWaygood · 2024-03-10T13:05:46Z

Thanks @dhruvmanila! I've implemented many of your suggestions, in particular most of the ones relating to naming and documentation

charliermarsh · 2024-03-10T16:13:57Z

The reason I'm hesitant to add such fine grained data to the ast are

This kind of Informationen could probably not be represented if we ever migrate to a CST (or would limit us in the design options for the CST)
Tracking the information increases the struct sizes which has a cost in memory consumption, data to write and to read (both slow operations). It can often times be faster (when considering the entire compilation pipeline) to reparse information that is needed rarely than paying the cost everywhere

This was my initial reaction too, but...

Can you explain more about why this would be a problem for a CST? I don't understand that part.
I don't think this has a cost, since we're using a bitflag of the same size either way, though I may be mistaken.

MichaReiser · 2024-03-18T12:41:50Z

Can you explain more about why this would be a problem for a CST? I don't understand that part.

It depends on the CST design. For example, Roslyn, rust-analyzer, Biome, and libswift all use a uniform representation for all nodes that mainly consists of the node's kind, its children and possibly some flags (single flags structure shared between all nodes). That means that node specific attributes or flags aren't possible. You can see this in Biome's playground when opening the syntax tab and looking at the CST content. For each level, it only shows the kind followed by its children.

A uniform representation has the advantage of traversing the tree without casting to the specific AST nodes. This makes it easy to implement traversal methods like descendants or storing a reference to a node.

The way we would need to model this with such a CST design is to look at the tokens directly (similar to what we used to do with Locator)

I don't think this has a cost, since we're using a bitflag of the same size either way, though I may be mistaken.

That's correct and what I mentioned in

I see and I guess it's fine adding it now that we already track prefixes

crates/ruff_python_ast/src/nodes.rs

MichaReiser · 2024-03-18T12:48:50Z

crates/ruff_python_ast/src/nodes.rs

+        match prefix {
+            FStringPrefix::Regular => {}
+            FStringPrefix::Raw { uppercase_r: true } => {
+                self.0 |= FStringFlagsInner::R_PREFIX_UPPER;
+            }
+            FStringPrefix::Raw { uppercase_r: false } => {
+                self.0 |= FStringFlagsInner::R_PREFIX_LOWER;
+            }
+        }


The following will create an invalid prefix:

FStringFlags::default().with_prefix(FStringPrefix::Raw { uppercase_r: true }).with_prefix(FStringPrefix::Raw { uppercase_r: false })

Or passing regular after uppercase won't have the desired outcome:

FStringFlags::default().with_prefix(FStringPrefix::Raw { uppercase_r: true }).with_prefix(FStringPrefix::Regular) .prefix().is_raw() // still returns true

You'll need to unset the Raw flags before setting them.

I addressed this in bc3e406

But I also wondered about simply prohibiting calling .with_prefix() if a prefix flag had already been set (using debug_assert!(!self.0.intersects(FStringFlagsInner::R_PREFIX_UPPER | FStringFlagsInner::R_PREFIX_LOWER), "Cannot call .with_prefix() if a prefix has already been set"). This way is more type-safe, though :-)

I would find it surprising if I'm not allowed to call a method multiple times. An alternative is to have a from_prefix method that constructs Self. This way, the compiler guarantees that the method can only be called once during construction.

crates/ruff_python_ast/src/nodes.rs

crates/ruff_python_parser/src/lexer.rs

crates/ruff_python_parser/src/string_token_flags.rs

Co-authored-by: Micha Reiser <[email protected]>

…s already been set

AlexWaygood requested a review from dhruvmanila March 9, 2024 14:51

AlexWaygood requested a review from MichaReiser as a code owner March 9, 2024 14:51

AlexWaygood changed the title ~~Track casing of r string prefixes in the tokenizer and AST~~ Track casing of r-string prefixes in the tokenizer and AST Mar 9, 2024

dhruvmanila reviewed Mar 10, 2024

View reviewed changes

AlexWaygood force-pushed the r-prefix-casing branch 2 times, most recently from 93bfa37 to 7a8a214 Compare March 10, 2024 12:54

AlexWaygood requested a review from dhruvmanila March 10, 2024 13:05

AlexWaygood added the internal An internal refactor or improvement label Mar 11, 2024

AlexWaygood added 6 commits March 14, 2024 17:57

Track casing of r string prefixes in the tokenizer

5b432d2

Track casing of r prefixes in f-string AST nodes

03eb789

Track casing of r prefixes in StringLiteral AST nodes

355aeb1

Track casing of r prefixes in BytesLiteral AST nodes

0fae040

Reduce duplication -- use an enum of enums

ee6e076

Switch to unreachable! over panic!

21e2911

AlexWaygood force-pushed the r-prefix-casing branch from 3ee6454 to 21e2911 Compare March 14, 2024 18:02

MichaReiser approved these changes Mar 18, 2024

View reviewed changes

AlexWaygood and others added 5 commits March 18, 2024 13:10

Address review: avoid write!() macro where it's unnecessary

ea8dcda

Co-authored-by: Micha Reiser <[email protected]>

Address review: make with_prefix() methods work even if a prefix ha…

bc3e406

…s already been set

Move prefix() logic to the StringPrefix enum

ca832a1

Merge branch 'main' into r-prefix-casing

eef975e

Debug all the way down

6529776

AlexWaygood enabled auto-merge (squash) March 18, 2024 17:11

AlexWaygood merged commit 162d2eb into astral-sh:main Mar 18, 2024
17 checks passed

AlexWaygood deleted the r-prefix-casing branch March 18, 2024 17:20

AlexWaygood mentioned this pull request Mar 20, 2024

Simplify formatting of strings by using flags from the AST nodes #10489

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track casing of r-string prefixes in the tokenizer and AST #10314

Track casing of r-string prefixes in the tokenizer and AST #10314

AlexWaygood commented Mar 9, 2024

github-actions bot commented Mar 9, 2024 •

edited

Loading

dhruvmanila left a comment

dhruvmanila Mar 10, 2024

AlexWaygood Mar 10, 2024

dhruvmanila Mar 10, 2024

AlexWaygood Mar 10, 2024

dhruvmanila Mar 11, 2024

AlexWaygood Mar 11, 2024

AlexWaygood Mar 11, 2024

AlexWaygood Mar 11, 2024 •

edited

Loading

MichaReiser commented Mar 10, 2024

AlexWaygood commented Mar 10, 2024

MichaReiser commented Mar 10, 2024

AlexWaygood commented Mar 10, 2024

AlexWaygood commented Mar 10, 2024 •

edited

Loading

AlexWaygood commented Mar 10, 2024

charliermarsh commented Mar 10, 2024

MichaReiser commented Mar 18, 2024

MichaReiser Mar 18, 2024

AlexWaygood Mar 18, 2024 •

edited

Loading

MichaReiser Mar 18, 2024

	// Retain the casing for the raw prefix:
	// https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#r-strings-and-r-strings
	if self.contains(StringPrefix::RAW) {
	token("r").fmt(f)?;
	} else if self.contains(StringPrefix::RAW_UPPER) {
	token("R").fmt(f)?;
	}

	bitflags! {
	#[derive(Copy, Clone, Debug, PartialEq, Eq)]
	pub(crate) struct StringPrefix: u8 {
	const UNICODE = 0b0000_0001;
	/// `r"test"`
	const RAW = 0b0000_0010;
	/// `R"test"
	const RAW_UPPER = 0b0000_0100;
	const BYTE = 0b0000_1000;
	const F_STRING = 0b0001_0000;
	}
	}

		/// A "raw" format-string, that has an `r` or `R` prefix
		RawFormat { uppercase_r: bool },

Track casing of r-string prefixes in the tokenizer and AST #10314

Track casing of r-string prefixes in the tokenizer and AST #10314

Conversation

AlexWaygood commented Mar 9, 2024

Summary

Test Plan

github-actions bot commented Mar 9, 2024 • edited Loading

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

dhruvmanila left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexWaygood Mar 11, 2024 • edited Loading

Choose a reason for hiding this comment

MichaReiser commented Mar 10, 2024

AlexWaygood commented Mar 10, 2024

MichaReiser commented Mar 10, 2024

AlexWaygood commented Mar 10, 2024

AlexWaygood commented Mar 10, 2024 • edited Loading

AlexWaygood commented Mar 10, 2024

charliermarsh commented Mar 10, 2024

MichaReiser commented Mar 18, 2024

Choose a reason for hiding this comment

AlexWaygood Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Mar 9, 2024 •

edited

Loading

`ruff-ecosystem` results

AlexWaygood Mar 11, 2024 •

edited

Loading

AlexWaygood commented Mar 10, 2024 •

edited

Loading

AlexWaygood Mar 18, 2024 •

edited

Loading