Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISC001 fix can modify octal escape sequences #12936

Closed
dscorbett opened this issue Aug 16, 2024 · 9 comments · Fixed by #13118
Closed

ISC001 fix can modify octal escape sequences #12936

dscorbett opened this issue Aug 16, 2024 · 9 comments · Fixed by #13118
Labels
bug Something isn't working fixes Related to suggested fixes for violations help wanted Contributions especially welcome

Comments

@dscorbett
Copy link

The fix for ISC001 changes behavior when the first string ends with a one- or two-digit octal escape sequence and the second begins with an octal digit.

$ ruff --version
ruff 0.6.0
$ cat isc001.py
print("\12""0")
$ python isc001.py

0
$ ruff check --isolated --select ISC001 isc001.py --fix
Found 1 error (1 fixed, 0 remaining).
$ python isc001.py
P

The least disruptive fix would be to detect this specific situation and pad the escape sequence to three digits. In the above example, that would result in "\0120".

@AlexWaygood
Copy link
Member

The reason for this seems to be that CPython seems to process the octal escape before it concatenates the two strings when it parses the source to create the AST:

>>> import ast
>>> print(ast.dump(ast.parse(r'"\12" "0"')))
Module(body=[Expr(value=Constant(value='\n0'))], type_ignores=[])
>>> print(ast.dump(ast.parse(r'"\120"')))
Module(body=[Expr(value=Constant(value='P'))], type_ignores=[])

Our AST also understands this, but it seems like the fix for this rule doesn't.

@AlexWaygood AlexWaygood added bug Something isn't working fixes Related to suggested fixes for violations help wanted Contributions especially welcome labels Aug 17, 2024
@dhruvmanila
Copy link
Member

Related #12753

@dylwil3
Copy link
Collaborator

dylwil3 commented Aug 22, 2024

Related #12753

As you suggest there, a quick fix would be to move this rule and UP012 to an ast-based check instead of a token-based one. However, since the parser evaluates some escape sequences, this would result in replacing

"\12" "0"

with

"\n0"

for example.

If we instead want to preserve the escape sequence and suggest the fix "\0120", then we may have to re-implement 'half' of the parsing logic inside this rule (which is fine).

Is there a preference?

@MichaReiser
Copy link
Member

I'm slightly leaning towards omitting the fix if there's an octal escape (or other escapes). Or is detecting the octal escape equally hard to emitting the right fix?

@dylwil3
Copy link
Collaborator

dylwil3 commented Aug 23, 2024

My guess is that all three options

  1. detecting octals at the end of the first string and skipping the fix
  2. having the fix be ast-based (and therefore evaluating the octals prior to concatenating)
  3. detecting octals at the end of the first string and normalizing the octal before concatenating

are about the same level of difficulty - none seem too difficult - with option (3) being a tinge fussier than (1) and (2).

@dhruvmanila
Copy link
Member

At least for #12753, I think we should avoid omitting the fix if there's an escape sequence.

I remember talking with @MichaReiser about adding a token flag for strings which suggests that this string / f-string token contains an escape sequence. This flag would be set by the lexer as it's already looking at each character. But, I think it might be a bit difficult because the lexer would directly skip to the ending quote character using memchr

let Some(index) =
memchr::memchr3(quote_byte, b'\r', b'\n', self.cursor.rest().as_bytes())

Another option would be to set the flag only on the AST node which should be easier to do because of the StringParser:

bitflags! {
#[derive(Debug, Default, Copy, Clone, PartialEq, Eq, Hash)]
struct StringLiteralFlagsInner: u8 {

But, I would prefer if the flag could be added to the TokenFlags instead.

Another solution might be to move the rule to use the AST where we can access the concatenated value which would make it easier to detect the escape sequence. Once we detect the escape sequence, we can avoid emitting a fix for that string expression.

/// Returns an iterator over the [`char`]s of each string literal part.
pub fn chars(&self) -> impl Iterator<Item = char> + Clone + '_ {
self.iter().flat_map(|part| part.value.chars())
}

Curious to hear @MichaReiser's opinion.

@MichaReiser
Copy link
Member

MichaReiser commented Aug 24, 2024

I'm leaning towards a local fix in ISC001. It seems simple enough to look at the individual literal elements and search backward. Flagging whether the string contains any escape seems overly aggressive and would remove the fix even when the escape isn't at the end of the string. Migrating to AST rules seems fine to me, but only if we are okay with the fix replacing other escapes (which I think it should not).

We only need to be careful about that \\01 is not an octal escape but \\\01 is.

@dhruvmanila
Copy link
Member

Yeah, I think implementing a local fix seems reasonable to me. What do you think? @dylwil3

@dylwil3
Copy link
Collaborator

dylwil3 commented Aug 26, 2024

Sounds good, I'll give it a shot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fixes Related to suggested fixes for violations help wanted Contributions especially welcome
Projects
None yet
5 participants