Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace autogenerated parser with hand-written parser #9152

Closed
wants to merge 122 commits into from

Conversation

LaBatata101
Copy link
Contributor

@LaBatata101 LaBatata101 commented Dec 15, 2023

This PR replaces the autogenerated lalrpop parser in favor of a handwritten parser. Beyond the performance improvements, one of the main differences between the handwritten parser and the autogenerated parser is, the new parser is error persistent. That is, the parser can create an AST even if the source code contains invalid syntax. This also means that, the parser now collects all syntax errors found in the source code, while the old parser would stop parsing after finding the first invalid syntax.

The new parser also enables us to create more helpful error messages, for example, in the old parser the following code would generate this error message:

yield from *a

Resulting error message "invalid syntax. Got unexpected token '*' ", in the new parser the error message is starred expression "*a is not allowed in a yield from statement"

The new parser still preserves the old behavior of returning only the first syntax error it encounters. The Python syntax supported by the parser is of the version 3.12.

Summary of Changes

Changes in the AST

  • Change Identifier node to use SmolStr instead of String
  • Add Invalid node to Expr, Pattern and FStringElement

Changes to error types

  • Refactor ParseError and LexicalError to use TextRange for the error location instead of TextSize.
  • Add FStringError to ParseError, the lexer used to check if the { in the f-string was unclosed, now this check happens in the parser.
  • Add new error in the lexer to check if a Unicode escape sequence has a { and }
  • The following LexicalErrorType were moved to ParseErrorType:
    • PositionalArgumentError
    • UnpackedArgumentError
    • DuplicateArgumentError
    • DuplicateKeywordArgumentError
    • AssignmentError

A couple of errors checks that are found during the parse phase were being checked in the lexing phase, now these error checks are in the parser code. For example, the lexer used to check if the (, [ or { had its closing counterpart.

Linter

The changes made in the linter code are mostly due to the type of Identifier changing from String to SmolStr.

Formatter

The changes made in the formatter code are to handle the new Invalid nodes, the way I chose to handle these nodes in the fromatter was by just displaying the text as it was written in the source code.

Misc

The StringType is a helper type used when parsing string/f-string literals. Since the parser now handles invalid syntax, we need a way to represent an invalid string/f-string when parsing string/f-string literals. So we add the Invalid type to deal with that.

Here's how the parser handles invalid strings:

For a single invalid string literal, e.g., a string with invalid escape sequence "a \xxx", the resulting AST node for that will be an Expr::Invalid. The current implementation for string literal parsing has one limitation when it comes to implicit concatenated strings. We can't know yet which specific string within the implicit concatenated strings are invalid. That is, the resulting AST node won't have an Invalid node for the invalid string literal.

Fixes #8914

Test Plan

Tested with cargo test --lib.
The previous parser tests were moved to the ruff_python_parser/src/parser/tests.
I also added more tests in ruff_python_parser/src/parser/tests/parser.rs

LaBatata101 and others added 14 commits December 15, 2023 17:26
These nodes are going to be used by the new parser to represent invalid syntax.

These nodes are handled in the `ruff_python_formatter` by emitting the text as
it is written in the source code.

For the type inference in the `ruff_python_semantic`, the new nodes are resolved
to `Unknown`.

In the `ruff_python_codegen` they are emitted as it is written in the source code.
This commit replaces the autogenerated lalrpop parser in favor of a handwritten
parser. Beyond the performance improvements, one of the main differences between
the handwritten parser and the autogenerated parser is, the new parser is error
persistent. That is, the parser can create an AST even if the source code contains
invalid syntax. This also means that, the parser now collects all syntax errors
found in the source code, while the old parser would stop parsing after finding
the first invalid syntax.

The new parser still preserves the old behavior of returning only the first syntax
error it encounters. The Python syntax supported by the parser is of the version 3.12.

There is a 14.33% (if my math is correct) overall performance improvement with the new parser.

Refactor `ParseError` to use `TextRange` for the error location instead of `TextSize`.

The following `LexicalErrorType` were moved to `ParseErrorType`:
  - `PositionalArgumentError`
  - `UnpackedArgumentError`
  - `DuplicateArgumentError`
  - `DuplicateKeywordArgumentError`
  - `AssignmentError`
These errors arise during the parsing phase rather than the lexing phase.

The `StringType` is a helper type used when parsing string/f-string literals. Since the
parser now handles invalid syntax, we need a way to represent an invalid string/f-string
when parsing string/f-string literals. So we add the `Invalid` type to deal with that.

Here's how the parser handles invalid strings:

For a single invalid string literal, e.g., a string with invalid escape sequence `"a \xxx"`,
the resulting AST node for that will be an `Expr::Invalid`. The current implementation for
string literal parsing has one limitation when it comes to implicit concatenated strings.

We can't know yet, which specific string within the implicit concatenated strings are invalid.
That is, the resulting AST node won't have an `Invalid` node for the invalid string literal.

Additionally, the parser now parses the syntax mentioned in issue astral-sh#8914.
After updating the branch, new code was created using `ExprName` with the old `id`
field type `String`.
The lexer handles a couple of errors that should be handled by the parser, e.g,
checking if the `[`, `{` and `(` has its closing counterpart; checking in f-string
if the `{` has its closing counterpart. These checks were moved to the parser.
Before, errors encountered during the parsing of f-string were emitted as a
`LexicalError`.
Copy link

codspeed-hq bot commented Dec 15, 2023

CodSpeed Performance Report

Merging #9152 will degrade performances by 6.04%

Comparing LaBatata101:new-parser (f8e577d) with main (fe79798)

Summary

❌ 3 regressions
✅ 27 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main LaBatata101:new-parser Change
parser[large/dataset.py] 63.6 ms 67.5 ms -5.81%
parser[numpy/ctypeslib.py] 10.8 ms 11.5 ms -6.04%
parser[unicode/pypinyin.py] 3.8 ms 3.9 ms -4.7%

The fuzz code was still using the `ParserError` with the `TextSize` field instead of
`TextRange`
Copy link
Member

@charliermarsh charliermarsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really excited to see this. Thanks for all the work that went into it!

I haven't reviewed yet, but one clarifying question:

The new parser still preserves the old behavior of returning only the first syntax error it encounters. The Python syntax supported by the parser is of the version 3.12.

Do we attempt to lint after encountering a syntax error? Or do we bail (as on main)?

@dhruvmanila dhruvmanila added parser Related to the parser performance Potential performance improvement labels Dec 16, 2023
@LaBatata101
Copy link
Contributor Author

I'm really excited to see this. Thanks for all the work that went into it!

I haven't reviewed yet, but one clarifying question:

The new parser still preserves the old behavior of returning only the first syntax error it encounters. The Python syntax supported by the parser is of the version 3.12.

Do we attempt to lint after encountering a syntax error? Or do we bail (as on main)?

I think we could lint after encountering syntax errors but I have no idea how to do that at the moment.

@MichaReiser
Copy link
Member

MichaReiser commented Dec 18, 2023

This PR is fantastic! So many users will be surprised that Ruff will get even faster for them, and not just a little! And I can't wait to remove the complexity of the generated parser by some simple Rust code (where changing a single grammar rule doesn't change 10k lines 🤣 )

Here's the benchmark results, runned on an Intel Core i5-8265U.

I'm confused by the results because they indicate that the new parser is slower, but codspeed suggests otherwise. Could you rerun them?

It will take me a while to get through reviewing 60k changes 😆 and I have to finish some work this week to avoid losing context before my Christmas/New Year break. I plan to do an in-depth review in the week of the 8th of January. I hope that's okay with you.

An important decision we have to make is how we plan to roll out this change because changing the parser changes the foundation of Ruff. Bugs can have far-reaching consequences, from Ruff panicking during formatting / linting to silently changing the program semantics and making Ruff useless in the worst case (because it consistently crashes).

Our ecosystem checks and tests give us good coverage for valid programs. But we have little coverage for invalid programs, to which Ruff is exposed when used inside an Editor (and error recovery in parsers is hard). That's why it's worth considering how we can increase our test coverage before enabling this parser in production (not the same as merging PRs). Here are some options that I've been thinking about:

  1. Enable the new parser when running Ruff with --preview before switching the default
  2. Use fuzzing and compare the parser/linter/ formatter output between Ruff using the generated and hand-written parser. @addisoncrump or @qarmin might be able to help us with setting up fuzzing. They have done excellent fuzzing work in the past.
  3. Run Ruff over a more extensive set of repositories (by extending the ecosystem check and running it locally). However, this will only help a little with invalid syntax because users rarely commit invalid files.

Implementing one or two requires we find a way to switch the parser at runtime. Changing the parser at runtime requires that both parsers are API compatible: they use the same AST, the same or at least similar error recovery, etc. We can either refact our current parser to SmolStr OR change your parser to emit Strings for now and later change it to SmolStr everywhere. Implementing error recovery in our existing parser doesn't make sense. But, we can pretend that your new parser doesn't support error recovery by returning an Err if the parser emits any diagnostic.

We can figure out how to best test the parser together and how we can come up with a compatible API, but I'm happy to defer to you because you probably know it the best.

Invalid syntax representation

I haven't studied it in detail, but it seems that your AST design uses Invalid nodes when encountering invalid syntax, similar to what Biome / Swift, and, to some extent, Roslyn do. I would like to learn more about what alternatives you considered and why you decided on this design (and the granularity of invalid nodes) vs. creating missing placeholder nodes or other designs.

Linting / Formatting programs containing invalid syntax

For now, I think it might be best to opt out of linting and formatting invalid ASTs to reduce the scope of this PR and get time to think through the implications thoroughly.

For example, returning the unformatted code for an invalid expression may remove the parentheses of an expression or lead to invalid syntax if the parent formatting rule assumes how the child gets formatted. What worked well for Biome is to return an Err(FormatError::InvalidSyntax) for invalid nodes that aren't statements.

Thanks

Thanks again for working on this. I'm thrilled by the changes and want to see this happening. Let's figure out together how to roll this change out and then enjoy how our users will be stunned that Ruff just got significantely faster!

@MichaReiser MichaReiser self-assigned this Dec 18, 2023
@dhruvmanila
Copy link
Member

Closing in favor of #10036. Big thanks to @LaBatata101 for their work on the new parser.

dhruvmanila added a commit that referenced this pull request Apr 18, 2024
(Supersedes #9152, authored by @LaBatata101)

## Summary

This PR replaces the current parser generated from LALRPOP to a
hand-written recursive descent parser.

It also updates the grammar for [PEP
646](https://peps.python.org/pep-0646/) so that the parser outputs the
correct AST. For example, in `data[*x]`, the index expression is now a
tuple with a single starred expression instead of just a starred
expression.

Beyond the performance improvements, the parser is also error resilient
and can provide better error messages. The behavior as seen by any
downstream tools isn't changed. That is, the linter and formatter can
still assume that the parser will _stop_ at the first syntax error. This
will be updated in the following months.

For more details about the change here, refer to the PR corresponding to
the individual commits and the release blog post.

## Test Plan

Write _lots_ and _lots_ of tests for both valid and invalid syntax and
verify the output.

## Acknowledgements

- @MichaReiser for reviewing 100+ parser PRs and continuously providing
guidance throughout the project
- @LaBatata101 for initiating the transition to a hand-written parser in
#9152
- @addisoncrump for implementing the fuzzer which helped
[catch](#10903)
[a](#10910)
[lot](#10966)
[of](#10896)
[bugs](#10877)

---------

Co-authored-by: Victor Hugo Gomes <[email protected]>
Co-authored-by: Micha Reiser <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser Related to the parser performance Potential performance improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ruff fails to parse WithItem with a starred expression as optional_vars
7 participants