Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement re-lexing logic for better error recovery #11845

Merged
merged 5 commits into from
Jun 17, 2024
Merged

Conversation

dhruvmanila
Copy link
Member

@dhruvmanila dhruvmanila commented Jun 12, 2024

Summary

This PR implements the re-lexing logic in the parser.

This logic is only applied when recovering from an error during list parsing. The logic is as follows:

  1. During list parsing, if an unexpected token is encountered and it detects that an outer context can understand it and thus recover from it, it invokes the re-lexing logic in the lexer
  2. This logic first checks if the lexer is in a parenthesized context and returns if it's not. Thus, the logic is a no-op if the lexer isn't in a parenthesized context
  3. It then reduces the nesting level by 1. It shouldn't reset it to 0 because otherwise the recovery from nested list parsing will be incorrect
  4. Then, it tries to find last newline character going backwards from the current position of the lexer. This avoids any whitespaces but if it encounters any character other than newline or whitespace, it aborts.
  5. Now, if there's a newline character, then it needs to be re-lexed in a logical context which means that the lexer needs to emit it as a Newline token instead of NonLogicalNewline.
  6. If the re-lexing gives a different token than the current one, the token source needs to update it's token collection to remove all the tokens which comes after the new current position.

It turns out that the list parsing isn't that happy with the results so it requires some re-arranging such that the following two errors are raised correctly:

  1. Expected comma
  2. Recovery context error

For (1), the following scenarios needs to be considered:

  • Missing comma between two elements
  • Half parsed element because the grammar doesn't allow it (for example, named expressions)

For (2), the following scenarios needs to be considered:

  1. If the parser is at a comma which means that there's a missing element otherwise the comma would've been consumed by the first eat call above. And, the parser doesn't take the re-lexing route on a comma token.
  2. If it's the first element and the current token is not a comma which means that it's an invalid element.

resolves: #11640

Test Plan

  • Update existing test snapshots and validate them
  • Add additional test cases specific to the re-lexing logic and validate the snapshots
  • Run the fuzzer on 3000+ valid inputs
  • Run the fuzzer on invalid inputs
  • Run the parser on various open source projects
  • Make sure the ecosystem changes are none

@dhruvmanila dhruvmanila added the parser Related to the parser label Jun 12, 2024
3 | from x import (a, b,
| ^ Syntax Error: Expected ')', found newline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah, this is so much better. Nice!

@dhruvmanila dhruvmanila force-pushed the dhruv/list-terminator-kind branch from c8a0997 to 55163c7 Compare June 12, 2024 08:29
Base automatically changed from dhruv/list-terminator-kind to main June 12, 2024 08:33
Copy link
Contributor

github-actions bot commented Jun 12, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@dhruvmanila dhruvmanila force-pushed the dhruv/re-lexing branch 2 times, most recently from 2294170 to d550b3b Compare June 12, 2024 17:07
@dhruvmanila dhruvmanila marked this pull request as ready for review June 12, 2024 17:42
@MichaReiser
Copy link
Member

It then reduces the nesting level by 1. It shouldn't reset it to 0 because otherwise the recovery from nested list parsing will be incorrect

What happens if we have something like

c
)

In this case, the parser should recover from the unclosed [, but the parser and lexer should remain in a parenthesied context because of the (. Or does that work out of the box already, because ) is a list terminator, and the parser won't recover past it?

@dhruvmanila
Copy link
Member Author

In this case, the parser should recover from the unclosed [, but the parser and lexer should remain in a parenthesied context because of the (. Or does that work out of the box already, because ) is a list terminator, and the parser won't recover past it?

Is the code snippet up-to date as you've mentioned [ and ( which isn't present in the snippet? :)

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this. It's so exciting to see how the hard work is paying off and we now get much better error messages.

I've left some open questions before approving but this is definetely looking good!

Do you know how our messages compare with CPython or Pyright? Do you see any significant difference?

crates/ruff_python_parser/src/lexer.rs Show resolved Hide resolved
/// and the caller is responsible for updating it's state accordingly.
///
/// This method is a no-op if the lexer isn't in a parenthesized context.
pub(crate) fn re_lex_logical_token(&mut self) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your reasoning for returning a bool over the re-lexed kind of the token?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think initially I kept it as Option<TokenKind> but the current usage doesn't really require the token itself. I don't think there's any other particular reason to use bool apart from that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was to return the re-lexed token kind or the current token kind if no re-lexing happens, so that it's just a TokenKind (and in line with lex_token). But if all we need is a bool, than that's fine.

/// previous current token. This also means that the current position of the lexer has changed
/// and the caller is responsible for updating it's state accordingly.
///
/// This method is a no-op if the lexer isn't in a parenthesized context.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expand the comment with some explanation for which tokens this is relevant. Like which tokens may change in this situation or what kind of different tokens could be emitted.

crates/ruff_python_parser/src/parser/mod.rs Outdated Show resolved Hide resolved
crates/ruff_python_parser/src/token_source.rs Show resolved Hide resolved
@@ -108,19 +108,5 @@ Module(
1 | class Foo[T1, *T2(a, b):
| ^ Syntax Error: Expected ']', found '('
2 | pass
3 | x = 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty cool! Nice to see how all the work accumulated to now having a single, accurate error message.



|
3 | def foo():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, no more EOF parse errors!

Comment on lines +215 to +216
2 | def foo():
| ^^^ Syntax Error: Expected an indented block after function definition
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

|
28 | # The lexer is nested with multiple levels of parentheses
29 | if call(foo, [a, b
| ^ Syntax Error: Expected ']', found NonLogicalNewline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we map NonLogicalNewline to newline? It seems an odd error message to show to users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is odd. I could map it to "newline" in the Display implementation.

@MichaReiser
Copy link
Member

Is the code snippet up-to date as you've mentioned [ and ( which isn't present in the snippet? :)

Uhm, not sure what happened here

(a, [b, 
	c
)

@dhruvmanila
Copy link
Member Author

Uhm, not sure what happened here

(a, [b, 
	c
)

Ok, so this is a bit problematic because when the lexer emitted the ) token, it reduced the nesting level by 1 and now the re-lexing also reduces the nesting level by 1 making it 0 which means that there'll be Newline token after c which is incorrect. We could check if the current token would reduce the nesting level in re-lexing logic and avoid reducing it again. I'll need to do some more testing for these kind of scenarios.

@dhruvmanila
Copy link
Member Author

Uhm, not sure what happened here

(a, [b, 
	c
)

Ok, so this is a bit problematic because when the lexer emitted the ) token, it reduced the nesting level by 1 and now the re-lexing also reduces the nesting level by 1 making it 0 which means that there'll be Newline token after c which is incorrect. We could check if the current token would reduce the nesting level in re-lexing logic and avoid reducing it again. I'll need to do some more testing for these kind of scenarios.

Ok, I've fixed this bug and expanded the documentation and inline comments.

@dhruvmanila dhruvmanila requested a review from MichaReiser June 13, 2024 10:21
@dhruvmanila
Copy link
Member Author

Do you know how our messages compare with CPython or Pyright? Do you see any significant difference?

The CPython parser higlights two kinds of errors which is probably done by the lexer:

  1. Highlighting the opening parenthesis which doesn't have a closing parenthesis:
    if call(foo, [a, b]
           ^
SyntaxError: '(' was never closed
  1. Highlighting the closing parenthesis to consider it a mismatch with the opening one:
    if call(foo, [a, b)
                      ^
SyntaxError: closing parenthesis ')' does not match opening parenthesis '['

While Pyright gets really confused:

Copy link

codspeed-hq bot commented Jun 14, 2024

CodSpeed Performance Report

Merging #11845 will not alter performance

Comparing dhruv/re-lexing (1989946) with main (1f654ee)

Summary

✅ 30 untouched benchmarks

@dhruvmanila dhruvmanila enabled auto-merge (squash) June 17, 2024 06:44
@dhruvmanila dhruvmanila merged commit 8499abf into main Jun 17, 2024
15 of 16 checks passed
@dhruvmanila dhruvmanila deleted the dhruv/re-lexing branch June 17, 2024 06:47
dhruvmanila added a commit that referenced this pull request Jun 18, 2024
## Summary

This PR is a follow-up on #11845 to add the re-lexing logic for normal
list parsing.

A normal list parsing is basically parsing elements without any
separator in between i.e., there can only be trivia tokens in between
the two elements. Currently, this is only being used for parsing
**assignment statement** and **f-string elements**. Assignment
statements cannot be in a parenthesized context, but f-string can have
curly braces so this PR is specifically for them.

I don't think this is an ideal recovery but the problem is that both
lexer and parser could add an error for f-strings. If the lexer adds an
error it'll emit an `Unknown` token instead while the parser adds the
error directly. I think we'd need to move all f-string errors to be
emitted by the parser instead. This way the parser can correctly inform
the lexer that it's out of an f-string and then the lexer can pop the
current f-string context out of the stack.

## Test Plan

Add test cases, update the snapshots, and run the fuzzer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser Related to the parser
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement re-lexing logic for improved error recovery
2 participants