Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid consuming newline for unterminated string #12067

Merged
merged 1 commit into from
Jun 27, 2024

Conversation

dhruvmanila
Copy link
Member

@dhruvmanila dhruvmanila commented Jun 27, 2024

Summary

This PR fixes the lexer logic to not consume the newline character for an unterminated string literal.

Currently, the lexer would consume it to be part of the string itself but that would be bad for recovery because then the lexer wouldn't emit the newline token ever. This PR fixes that to avoid consuming the newline character in that case.

This was discovered during #12060.

Test Plan

Update the snapshots and validate them.

@dhruvmanila dhruvmanila added the parser Related to the parser label Jun 27, 2024
@dhruvmanila dhruvmanila marked this pull request as ready for review June 27, 2024 10:31
@dhruvmanila dhruvmanila requested a review from MichaReiser as a code owner June 27, 2024 10:31
Comment on lines +163 to +171
|
1 | 'hello' 'world
| ^ Syntax Error: Expected a statement
2 | 1 + 1
3 | 'hello' f'world {x}
4 | 2 + 2
|


Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this "Expected a statement" error is because the current token is Newline which doesn't start a module statement nor ends a list of module statements.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might get fixed if we start emitting String tokens instead of Unknown token because currently the parser thinks that the statement ended after 'hello' but then it encountered an Unknown token (for unterminated string) and a Newline token both of which doesn't start or end a statement.

Copy link
Contributor

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

@dhruvmanila dhruvmanila merged commit e137c82 into main Jun 27, 2024
20 checks passed
@dhruvmanila dhruvmanila deleted the dhruv/unterminated-string branch June 27, 2024 11:32
dhruvmanila added a commit that referenced this pull request Jun 27, 2024
## Summary

This PR splits the re-lexing logic into two parts:
1. `TokenSource`: The token source will be responsible to find the
position the lexer needs to be moved to
2. `Lexer`: The lexer will be responsible to reduce the nesting level
and move itself to the new position if recovered from a parenthesized
context

This split makes it easy to find the new lexer position without needing
to implement the backwards lexing logic again which would need to handle
cases involving:
* Different kinds of newlines
* Line continuation character(s)
* Comments
* Whitespaces

### F-strings

This change did reveal one thing about re-lexing f-strings. Consider the
following example:
```py
f'{'
#  ^
f'foo'
```

Here, the quote as highlighted by the caret (`^`) is the start of a
string inside an f-string expression. This is unterminated string which
means the token emitted is actually `Unknown`. The parser tries to
recover from it but there's no newline token in the vector so the new
logic doesn't recover from it. The previous logic does recover because
it's looking at the raw characters instead.

The parser would be at `FStringStart` (the one for the second line) when
it calls into the re-lexing logic to recover from an unterminated
f-string on the first line. So, moving backwards the first character
encountered is a newline character but the first token encountered is an
`Unknown` token.

This is improved with #12067 

fixes: #12046 
fixes: #12036

## Test Plan

Update the snapshot and validate the changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser Related to the parser
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants