Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider line continuation character for re-lexing #12008

Merged
merged 4 commits into from
Jun 25, 2024

Conversation

dhruvmanila
Copy link
Member

Summary

This PR fixes a bug where the re-lexing logic didn't consider the line continuation character being present before the newline character. This meant that the lexer was being moved back to the newline character which is actually ignored via \.

Considering the following code:

f'middle {'string':\
        'format spec'}

The old token stream is:

...
Colon 18..19
FStringMiddle 19..29 (flags = F_STRING)
Newline 20..21
Indent 21..29
String 29..42
Rbrace 42..43
...

Notice how the ranges are overlapping between the FStringMiddle token and the tokens emitted after moving the lexer backwards.

After this fix, the new token stream which is without moving the lexer backwards in this scenario:

FStringStart 0..2 (flags = F_STRING)
FStringMiddle 2..9 (flags = F_STRING)
Lbrace 9..10
String 10..18
Colon 18..19
FStringMiddle 19..29 (flags = F_STRING)
FStringEnd 29..30 (flags = F_STRING)
Name 30..36
Name 37..41
Unknown 41..44
Newline 44..45

fixes: #12004

Test Plan

Add a test case and update the snapshots.

@dhruvmanila dhruvmanila added bug Something isn't working parser Related to the parser labels Jun 24, 2024
@dhruvmanila dhruvmanila requested a review from MichaReiser as a code owner June 24, 2024 07:24
@dhruvmanila dhruvmanila changed the title Consider line continuation char for re-lexing Consider line continuation character for re-lexing Jun 24, 2024
Copy link
Contributor

github-actions bot commented Jun 24, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@dhruvmanila dhruvmanila requested a review from MichaReiser June 24, 2024 11:54
Comment on lines 1388 to 1393
if let Some(second_slash) = reverse_chars.next_if_eq(&'\\') {
// Line continuation character has been escaped: `\\\n`
newline_position = Some(current_position);
// Set the newline position before updating the current position.
current_position -= first_slash.text_len() - second_slash.text_len();
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still more complicate than this. What about \\\ Here, we have an escaped backslash followed by a continuation :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I guess we'd need to count the number of backslashes and make a decision based on whether it's odd or even.

@dhruvmanila dhruvmanila requested a review from MichaReiser June 24, 2024 12:43
Comment on lines +1388 to +1391
let mut backslash_count = 0;
while reverse_chars.next_if_eq(&'\\').is_some() {
backslash_count += 1;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comment :) Do we need to restrict the escape handling to cases where we know we're inside a string? Or wouldn't that work in case of an unterminated string literal?

I'm not sure if it matters because the parser is already in an error recovery state when encountering an escaped \ outside of a string, but it might be worth to add a test for it

test + a \\\
more

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow here. Do you mean to ask whether this logic needs to be restricted to only recovering within a string or not? I don't think so that is necessary, I'll add a test case for line continuation character encountered while re-lexing outside of a string.

@dhruvmanila dhruvmanila enabled auto-merge (squash) June 25, 2024 02:10
@dhruvmanila dhruvmanila merged commit 68a8978 into main Jun 25, 2024
17 checks passed
@dhruvmanila dhruvmanila deleted the dhruv/re-lexing-newline-escape branch June 25, 2024 02:13
dhruvmanila added a commit that referenced this pull request Jun 26, 2024
## Summary

This PR fixes a bug introduced in
#12008 which didn't consider the
two character newline after the line continuation character.

For example, consider the following code highlighted with whitespaces:
```py
call(foo # comment \\r\n
\r\n
def bar():\r\n
....pass\r\n
```
The lexer is at `def` when it's running the re-lexing logic and trying
to move back to a newline character. It encounters `\n` and it's being
escaped (incorrect) but `\r` is being escaped, so it moves the lexer to
`\n` character. This creates an overlap in token ranges which causes the
panic.

```
Name 0..4
Lpar 4..5
Name 5..8
Comment 9..20
NonLogicalNewline 20..22 <-- overlap between
Newline 21..22           <-- these two tokens
NonLogicalNewline 22..23
Def 23..26
...
```

fixes: #12028 

## Test Plan

Add a test case with line continuation and windows style newline
character.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working parser Related to the parser
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rule B027 cause panic
2 participants