-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid consuming newline for unterminated string #12067
Conversation
| | ||
1 | 'hello' 'world | ||
| ^ Syntax Error: Expected a statement | ||
2 | 1 + 1 | ||
3 | 'hello' f'world {x} | ||
4 | 2 + 2 | ||
| | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this "Expected a statement" error is because the current token is Newline
which doesn't start a module statement nor ends a list of module statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might get fixed if we start emitting String
tokens instead of Unknown
token because currently the parser thinks that the statement ended after 'hello'
but then it encountered an Unknown
token (for unterminated string) and a Newline
token both of which doesn't start or end a statement.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
## Summary This PR splits the re-lexing logic into two parts: 1. `TokenSource`: The token source will be responsible to find the position the lexer needs to be moved to 2. `Lexer`: The lexer will be responsible to reduce the nesting level and move itself to the new position if recovered from a parenthesized context This split makes it easy to find the new lexer position without needing to implement the backwards lexing logic again which would need to handle cases involving: * Different kinds of newlines * Line continuation character(s) * Comments * Whitespaces ### F-strings This change did reveal one thing about re-lexing f-strings. Consider the following example: ```py f'{' # ^ f'foo' ``` Here, the quote as highlighted by the caret (`^`) is the start of a string inside an f-string expression. This is unterminated string which means the token emitted is actually `Unknown`. The parser tries to recover from it but there's no newline token in the vector so the new logic doesn't recover from it. The previous logic does recover because it's looking at the raw characters instead. The parser would be at `FStringStart` (the one for the second line) when it calls into the re-lexing logic to recover from an unterminated f-string on the first line. So, moving backwards the first character encountered is a newline character but the first token encountered is an `Unknown` token. This is improved with #12067 fixes: #12046 fixes: #12036 ## Test Plan Update the snapshot and validate the changes.
Summary
This PR fixes the lexer logic to not consume the newline character for an unterminated string literal.
Currently, the lexer would consume it to be part of the string itself but that would be bad for recovery because then the lexer wouldn't emit the newline token ever. This PR fixes that to avoid consuming the newline character in that case.
This was discovered during #12060.
Test Plan
Update the snapshots and validate them.