-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace autogenerated parser with hand-written parser #9152
Conversation
These nodes are going to be used by the new parser to represent invalid syntax. These nodes are handled in the `ruff_python_formatter` by emitting the text as it is written in the source code. For the type inference in the `ruff_python_semantic`, the new nodes are resolved to `Unknown`. In the `ruff_python_codegen` they are emitted as it is written in the source code.
This commit replaces the autogenerated lalrpop parser in favor of a handwritten parser. Beyond the performance improvements, one of the main differences between the handwritten parser and the autogenerated parser is, the new parser is error persistent. That is, the parser can create an AST even if the source code contains invalid syntax. This also means that, the parser now collects all syntax errors found in the source code, while the old parser would stop parsing after finding the first invalid syntax. The new parser still preserves the old behavior of returning only the first syntax error it encounters. The Python syntax supported by the parser is of the version 3.12. There is a 14.33% (if my math is correct) overall performance improvement with the new parser. Refactor `ParseError` to use `TextRange` for the error location instead of `TextSize`. The following `LexicalErrorType` were moved to `ParseErrorType`: - `PositionalArgumentError` - `UnpackedArgumentError` - `DuplicateArgumentError` - `DuplicateKeywordArgumentError` - `AssignmentError` These errors arise during the parsing phase rather than the lexing phase. The `StringType` is a helper type used when parsing string/f-string literals. Since the parser now handles invalid syntax, we need a way to represent an invalid string/f-string when parsing string/f-string literals. So we add the `Invalid` type to deal with that. Here's how the parser handles invalid strings: For a single invalid string literal, e.g., a string with invalid escape sequence `"a \xxx"`, the resulting AST node for that will be an `Expr::Invalid`. The current implementation for string literal parsing has one limitation when it comes to implicit concatenated strings. We can't know yet, which specific string within the implicit concatenated strings are invalid. That is, the resulting AST node won't have an `Invalid` node for the invalid string literal. Additionally, the parser now parses the syntax mentioned in issue astral-sh#8914.
After updating the branch, new code was created using `ExprName` with the old `id` field type `String`.
The lexer handles a couple of errors that should be handled by the parser, e.g, checking if the `[`, `{` and `(` has its closing counterpart; checking in f-string if the `{` has its closing counterpart. These checks were moved to the parser.
…ts_at` with an offset bigger than zero
Before, errors encountered during the parsing of f-string were emitted as a `LexicalError`.
CodSpeed Performance ReportMerging #9152 will degrade performances by 6.04%Comparing Summary
Benchmarks breakdown
|
The fuzz code was still using the `ParserError` with the `TextSize` field instead of `TextRange`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really excited to see this. Thanks for all the work that went into it!
I haven't reviewed yet, but one clarifying question:
The new parser still preserves the old behavior of returning only the first syntax error it encounters. The Python syntax supported by the parser is of the version 3.12.
Do we attempt to lint after encountering a syntax error? Or do we bail (as on main
)?
I think we could lint after encountering syntax errors but I have no idea how to do that at the moment. |
This PR is fantastic! So many users will be surprised that Ruff will get even faster for them, and not just a little! And I can't wait to remove the complexity of the generated parser by some simple Rust code (where changing a single grammar rule doesn't change 10k lines 🤣 )
I'm confused by the results because they indicate that the new parser is slower, but codspeed suggests otherwise. Could you rerun them? It will take me a while to get through reviewing 60k changes 😆 and I have to finish some work this week to avoid losing context before my Christmas/New Year break. I plan to do an in-depth review in the week of the 8th of January. I hope that's okay with you. An important decision we have to make is how we plan to roll out this change because changing the parser changes the foundation of Ruff. Bugs can have far-reaching consequences, from Ruff panicking during formatting / linting to silently changing the program semantics and making Ruff useless in the worst case (because it consistently crashes). Our ecosystem checks and tests give us good coverage for valid programs. But we have little coverage for invalid programs, to which Ruff is exposed when used inside an Editor (and error recovery in parsers is hard). That's why it's worth considering how we can increase our test coverage before enabling this parser in production (not the same as merging PRs). Here are some options that I've been thinking about:
Implementing one or two requires we find a way to switch the parser at runtime. Changing the parser at runtime requires that both parsers are API compatible: they use the same AST, the same or at least similar error recovery, etc. We can either refact our current parser to We can figure out how to best test the parser together and how we can come up with a compatible API, but I'm happy to defer to you because you probably know it the best. Invalid syntax representationI haven't studied it in detail, but it seems that your AST design uses Linting / Formatting programs containing invalid syntaxFor now, I think it might be best to opt out of linting and formatting invalid ASTs to reduce the scope of this PR and get time to think through the implications thoroughly. For example, returning the unformatted code for an invalid expression may remove the parentheses of an expression or lead to invalid syntax if the parent formatting rule assumes how the child gets formatted. What worked well for Biome is to return an ThanksThanks again for working on this. I'm thrilled by the changes and want to see this happening. Let's figure out together how to roll this change out and then enjoy how our users will be stunned that Ruff just got significantely faster! |
Closing in favor of #10036. Big thanks to @LaBatata101 for their work on the new parser. |
(Supersedes #9152, authored by @LaBatata101) ## Summary This PR replaces the current parser generated from LALRPOP to a hand-written recursive descent parser. It also updates the grammar for [PEP 646](https://peps.python.org/pep-0646/) so that the parser outputs the correct AST. For example, in `data[*x]`, the index expression is now a tuple with a single starred expression instead of just a starred expression. Beyond the performance improvements, the parser is also error resilient and can provide better error messages. The behavior as seen by any downstream tools isn't changed. That is, the linter and formatter can still assume that the parser will _stop_ at the first syntax error. This will be updated in the following months. For more details about the change here, refer to the PR corresponding to the individual commits and the release blog post. ## Test Plan Write _lots_ and _lots_ of tests for both valid and invalid syntax and verify the output. ## Acknowledgements - @MichaReiser for reviewing 100+ parser PRs and continuously providing guidance throughout the project - @LaBatata101 for initiating the transition to a hand-written parser in #9152 - @addisoncrump for implementing the fuzzer which helped [catch](#10903) [a](#10910) [lot](#10966) [of](#10896) [bugs](#10877) --------- Co-authored-by: Victor Hugo Gomes <[email protected]> Co-authored-by: Micha Reiser <[email protected]>
This PR replaces the autogenerated lalrpop parser in favor of a handwritten parser. Beyond the performance improvements, one of the main differences between the handwritten parser and the autogenerated parser is, the new parser is error persistent. That is, the parser can create an AST even if the source code contains invalid syntax. This also means that, the parser now collects all syntax errors found in the source code, while the old parser would stop parsing after finding the first invalid syntax.
The new parser also enables us to create more helpful error messages, for example, in the old parser the following code would generate this error message:
Resulting error message "invalid syntax. Got unexpected token '*' ", in the new parser the error message is
starred expression "*a
is not allowed in ayield from
statement"The new parser still preserves the old behavior of returning only the first syntax error it encounters. The Python syntax supported by the parser is of the version 3.12.
Summary of Changes
Changes in the AST
Identifier
node to useSmolStr
instead ofString
Invalid
node toExpr
,Pattern
andFStringElement
Changes to error types
ParseError
andLexicalError
to useTextRange
for the error location instead ofTextSize
.FStringError
toParseError
, the lexer used to check if the{
in the f-string was unclosed, now this check happens in the parser.{
and}
LexicalErrorType
were moved toParseErrorType
:PositionalArgumentError
UnpackedArgumentError
DuplicateArgumentError
DuplicateKeywordArgumentError
AssignmentError
A couple of errors checks that are found during the parse phase were being checked in the lexing phase, now these error checks are in the parser code. For example, the lexer used to check if the
(
,[
or{
had its closing counterpart.Linter
The changes made in the linter code are mostly due to the type ofIdentifier
changing fromString
toSmolStr
.Formatter
The changes made in the formatter code are to handle the new
Invalid
nodes, the way I chose to handle these nodes in the fromatter was by just displaying the text as it was written in the source code.Misc
The
StringType
is a helper type used when parsing string/f-string literals. Since the parser now handles invalid syntax, we need a way to represent an invalid string/f-string when parsing string/f-string literals. So we add theInvalid
type to deal with that.Here's how the parser handles invalid strings:
For a single invalid string literal, e.g., a string with invalid escape sequence
"a \xxx"
, the resulting AST node for that will be anExpr::Invalid
. The current implementation for string literal parsing has one limitation when it comes to implicit concatenated strings. We can't know yet which specific string within the implicit concatenated strings are invalid. That is, the resulting AST node won't have anInvalid
node for the invalid string literal.Fixes #8914
Test Plan
Tested with
cargo test --lib
.The previous parser tests were moved to the
ruff_python_parser/src/parser/tests
.I also added more tests in
ruff_python_parser/src/parser/tests/parser.rs