Replace autogenerated parser with hand-written parser #9152

LaBatata101 · 2023-12-15T21:59:56Z

This PR replaces the autogenerated lalrpop parser in favor of a handwritten parser. Beyond the performance improvements, one of the main differences between the handwritten parser and the autogenerated parser is, the new parser is error persistent. That is, the parser can create an AST even if the source code contains invalid syntax. This also means that, the parser now collects all syntax errors found in the source code, while the old parser would stop parsing after finding the first invalid syntax.

The new parser also enables us to create more helpful error messages, for example, in the old parser the following code would generate this error message:

yield from *a

Resulting error message "invalid syntax. Got unexpected token '*' ", in the new parser the error message is starred expression "*a is not allowed in a yield from statement"

The new parser still preserves the old behavior of returning only the first syntax error it encounters. The Python syntax supported by the parser is of the version 3.12.

Summary of Changes

Changes in the AST

Change Identifier node to use SmolStr instead of String
Add Invalid node to Expr, Pattern and FStringElement

Changes to error types

Refactor ParseError and LexicalError to use TextRange for the error location instead of TextSize.
Add FStringError to ParseError, the lexer used to check if the { in the f-string was unclosed, now this check happens in the parser.
Add new error in the lexer to check if a Unicode escape sequence has a { and }
The following LexicalErrorType were moved to ParseErrorType:
- PositionalArgumentError
- UnpackedArgumentError
- DuplicateArgumentError
- DuplicateKeywordArgumentError
- AssignmentError

A couple of errors checks that are found during the parse phase were being checked in the lexing phase, now these error checks are in the parser code. For example, the lexer used to check if the (, [ or { had its closing counterpart.

Linter

~~The changes made in the linter code are mostly due to the type of Identifier changing from String to SmolStr.~~

Formatter

The changes made in the formatter code are to handle the new Invalid nodes, the way I chose to handle these nodes in the fromatter was by just displaying the text as it was written in the source code.

Misc

The StringType is a helper type used when parsing string/f-string literals. Since the parser now handles invalid syntax, we need a way to represent an invalid string/f-string when parsing string/f-string literals. So we add the Invalid type to deal with that.

Here's how the parser handles invalid strings:

For a single invalid string literal, e.g., a string with invalid escape sequence "a \xxx", the resulting AST node for that will be an Expr::Invalid. The current implementation for string literal parsing has one limitation when it comes to implicit concatenated strings. We can't know yet which specific string within the implicit concatenated strings are invalid. That is, the resulting AST node won't have an Invalid node for the invalid string literal.

Fixes #8914

Test Plan

Tested with cargo test --lib.
The previous parser tests were moved to the ruff_python_parser/src/parser/tests.
I also added more tests in ruff_python_parser/src/parser/tests/parser.rs

These nodes are going to be used by the new parser to represent invalid syntax. These nodes are handled in the `ruff_python_formatter` by emitting the text as it is written in the source code. For the type inference in the `ruff_python_semantic`, the new nodes are resolved to `Unknown`. In the `ruff_python_codegen` they are emitted as it is written in the source code.

…calError`

This commit replaces the autogenerated lalrpop parser in favor of a handwritten parser. Beyond the performance improvements, one of the main differences between the handwritten parser and the autogenerated parser is, the new parser is error persistent. That is, the parser can create an AST even if the source code contains invalid syntax. This also means that, the parser now collects all syntax errors found in the source code, while the old parser would stop parsing after finding the first invalid syntax. The new parser still preserves the old behavior of returning only the first syntax error it encounters. The Python syntax supported by the parser is of the version 3.12. There is a 14.33% (if my math is correct) overall performance improvement with the new parser. Refactor `ParseError` to use `TextRange` for the error location instead of `TextSize`. The following `LexicalErrorType` were moved to `ParseErrorType`: - `PositionalArgumentError` - `UnpackedArgumentError` - `DuplicateArgumentError` - `DuplicateKeywordArgumentError` - `AssignmentError` These errors arise during the parsing phase rather than the lexing phase. The `StringType` is a helper type used when parsing string/f-string literals. Since the parser now handles invalid syntax, we need a way to represent an invalid string/f-string when parsing string/f-string literals. So we add the `Invalid` type to deal with that. Here's how the parser handles invalid strings: For a single invalid string literal, e.g., a string with invalid escape sequence `"a \xxx"`, the resulting AST node for that will be an `Expr::Invalid`. The current implementation for string literal parsing has one limitation when it comes to implicit concatenated strings. We can't know yet, which specific string within the implicit concatenated strings are invalid. That is, the resulting AST node won't have an `Invalid` node for the invalid string literal. Additionally, the parser now parses the syntax mentioned in issue astral-sh#8914.

After updating the branch, new code was created using `ExprName` with the old `id` field type `String`.

…nside f-strings

The lexer handles a couple of errors that should be handled by the parser, e.g, checking if the `[`, `{` and `(` has its closing counterpart; checking in f-string if the `{` has its closing counterpart. These checks were moved to the parser.

…ts_at` with an offset bigger than zero

Before, errors encountered during the parsing of f-string were emitted as a `LexicalError`.

… `}`

codspeed-hq · 2023-12-15T22:07:13Z

CodSpeed Performance Report

Merging #9152 will degrade performances by 6.04%

_{Comparing LaBatata101:new-parser (f8e577d) with main (fe79798)}

Summary

❌ 3 regressions
✅ 27 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`main`	`LaBatata101:new-parser`	Change
❌	`parser[large/dataset.py]`	63.6 ms	67.5 ms	-5.81%
❌	`parser[numpy/ctypeslib.py]`	10.8 ms	11.5 ms	-6.04%
❌	`parser[unicode/pypinyin.py]`	3.8 ms	3.9 ms	-4.7%

The fuzz code was still using the `ParserError` with the `TextSize` field instead of `TextRange`

crates/ruff_python_parser/src/parser/functions.rs

charliermarsh

I'm really excited to see this. Thanks for all the work that went into it!

I haven't reviewed yet, but one clarifying question:

The new parser still preserves the old behavior of returning only the first syntax error it encounters. The Python syntax supported by the parser is of the version 3.12.

Do we attempt to lint after encountering a syntax error? Or do we bail (as on main)?

crates/ruff_linter/src/rules/flake8_gettext/mod.rs

LaBatata101 · 2023-12-16T01:32:44Z

I'm really excited to see this. Thanks for all the work that went into it!

I haven't reviewed yet, but one clarifying question:

The new parser still preserves the old behavior of returning only the first syntax error it encounters. The Python syntax supported by the parser is of the version 3.12.

Do we attempt to lint after encountering a syntax error? Or do we bail (as on main)?

I think we could lint after encountering syntax errors but I have no idea how to do that at the moment.

crates/ruff_python_parser/src/lexer.rs

MichaReiser · 2023-12-18T02:38:47Z

This PR is fantastic! So many users will be surprised that Ruff will get even faster for them, and not just a little! And I can't wait to remove the complexity of the generated parser by some simple Rust code (where changing a single grammar rule doesn't change 10k lines 🤣 )

Here's the benchmark results, runned on an Intel Core i5-8265U.

I'm confused by the results because they indicate that the new parser is slower, but codspeed suggests otherwise. Could you rerun them?

It will take me a while to get through reviewing 60k changes 😆 and I have to finish some work this week to avoid losing context before my Christmas/New Year break. I plan to do an in-depth review in the week of the 8th of January. I hope that's okay with you.

An important decision we have to make is how we plan to roll out this change because changing the parser changes the foundation of Ruff. Bugs can have far-reaching consequences, from Ruff panicking during formatting / linting to silently changing the program semantics and making Ruff useless in the worst case (because it consistently crashes).

Our ecosystem checks and tests give us good coverage for valid programs. But we have little coverage for invalid programs, to which Ruff is exposed when used inside an Editor (and error recovery in parsers is hard). That's why it's worth considering how we can increase our test coverage before enabling this parser in production (not the same as merging PRs). Here are some options that I've been thinking about:

Enable the new parser when running Ruff with --preview before switching the default
Use fuzzing and compare the parser/linter/ formatter output between Ruff using the generated and hand-written parser. @addisoncrump or @qarmin might be able to help us with setting up fuzzing. They have done excellent fuzzing work in the past.
Run Ruff over a more extensive set of repositories (by extending the ecosystem check and running it locally). However, this will only help a little with invalid syntax because users rarely commit invalid files.

Implementing one or two requires we find a way to switch the parser at runtime. Changing the parser at runtime requires that both parsers are API compatible: they use the same AST, the same or at least similar error recovery, etc. We can either refact our current parser to SmolStr OR change your parser to emit Strings for now and later change it to SmolStr everywhere. Implementing error recovery in our existing parser doesn't make sense. But, we can pretend that your new parser doesn't support error recovery by returning an Err if the parser emits any diagnostic.

We can figure out how to best test the parser together and how we can come up with a compatible API, but I'm happy to defer to you because you probably know it the best.

Invalid syntax representation

I haven't studied it in detail, but it seems that your AST design uses Invalid nodes when encountering invalid syntax, similar to what Biome / Swift, and, to some extent, Roslyn do. I would like to learn more about what alternatives you considered and why you decided on this design (and the granularity of invalid nodes) vs. creating missing placeholder nodes or other designs.

Linting / Formatting programs containing invalid syntax

For now, I think it might be best to opt out of linting and formatting invalid ASTs to reduce the scope of this PR and get time to think through the implications thoroughly.

For example, returning the unformatted code for an invalid expression may remove the parentheses of an expression or lead to invalid syntax if the parent formatting rule assumes how the child gets formatted. What worked well for Biome is to return an Err(FormatError::InvalidSyntax) for invalid nodes that aren't statements.

Thanks

Thanks again for working on this. I'm thrilled by the changes and want to see this happening. Let's figure out together how to roll this change out and then enjoy how our users will be stunned that Ruff just got significantely faster!

Signed-off-by: Micha Reiser <[email protected]>

…arser-in-progress

…alues

… of the slice, e.g. `x[(a:=0):]`

dhruvmanila · 2024-02-19T04:14:45Z

Closing in favor of #10036. Big thanks to @LaBatata101 for their work on the new parser.

@LaBatata101

(Supersedes #9152, authored by @LaBatata101) ## Summary This PR replaces the current parser generated from LALRPOP to a hand-written recursive descent parser. It also updates the grammar for [PEP 646](https://peps.python.org/pep-0646/) so that the parser outputs the correct AST. For example, in `data[*x]`, the index expression is now a tuple with a single starred expression instead of just a starred expression. Beyond the performance improvements, the parser is also error resilient and can provide better error messages. The behavior as seen by any downstream tools isn't changed. That is, the linter and formatter can still assume that the parser will _stop_ at the first syntax error. This will be updated in the following months. For more details about the change here, refer to the PR corresponding to the individual commits and the release blog post. ## Test Plan Write _lots_ and _lots_ of tests for both valid and invalid syntax and verify the output. ## Acknowledgements - @MichaReiser for reviewing 100+ parser PRs and continuously providing guidance throughout the project - @LaBatata101 for initiating the transition to a hand-written parser in #9152 - @addisoncrump for implementing the fuzzer which helped [catch](#10903) [a](#10910) [lot](#10966) [of](#10896) [bugs](#10877) --------- Co-authored-by: Victor Hugo Gomes <[email protected]> Co-authored-by: Micha Reiser <[email protected]>

LaBatata101 and others added 14 commits December 15, 2023 17:26

Replace String with SmolStr for nodes Identifier and ExprName

6b37a85

Use TextRange instead of TextSize for the error location in `Lexi…

f6ddb0d

…calError`

Update code to use SmolStr instead of String

8411685

After updating the branch, new code was created using `ExprName` with the old `id` field type `String`.

Add Invalid to FStringElement AST node to handle invalid syntax i…

2345381

…nside f-strings

Remove unused ParenthesizedExpr

66e3688

Update parser to use new string and f-string AST nodes

f3559ca

Prevent the parser from panicking when calling `parse_expression_star…

d51c351

…ts_at` with an offset bigger than zero

Add FStringError to ParseErrorType

28096d9

Before, errors encountered during the parsing of f-string were emitted as a `LexicalError`.

Add errors for when the Unicode escape sequence is missing the { or…

a4dda02

… `}`

Move tests to tests directory

995dd93

fix warning

9b5ffdb

LaBatata101 added 5 commits December 15, 2023 19:12

Fix typo

82a03e5

Handle Invalid node

37b538f

Fix more typos

7939485

Fix clippy warnings

0204644

Fix syntax error in fuzzer

7c01284

The fuzz code was still using the `ParserError` with the `TextSize` field instead of `TextRange`

LaBatata101 commented Dec 15, 2023

View reviewed changes

crates/ruff_python_parser/src/parser/functions.rs Outdated Show resolved Hide resolved

Fix incorrect range in Alias's name

017fd87

charliermarsh reviewed Dec 16, 2023

View reviewed changes

crates/ruff_linter/src/rules/flake8_gettext/mod.rs Outdated Show resolved Hide resolved

dhruvmanila added parser Related to the parser performance Potential performance improvement labels Dec 16, 2023

charliermarsh reviewed Dec 16, 2023

View reviewed changes

crates/ruff_python_parser/src/lexer.rs Outdated Show resolved Hide resolved

MichaReiser self-assigned this Dec 18, 2023

MichaReiser and others added 24 commits January 18, 2024 08:37

Fix starts_at parsing

4e5718c

Merge remote-tracking branch 'origin/main' into parser-in-progress

69d6853

Basic error recovery instrastructure for parsing lists

bb641c6

Remove ctx_stack from Parser by using the call stack

23dfb80

Assignment error recovery

16a21ba

parse delimited

01c31d0

Import alias recovery, deprecate a few APIs

08612bc

Signed-off-by: Micha Reiser <[email protected]>

Fix panic in self.node_range

3ce93be

Merge branch 'parser-in-progress' of github.com:astral-sh/ruff into p…

06dca98

…arser-in-progress

fixup! Import alias recovery, deprecate a few APIs

4999153

Remove last_ctx

11b6903

Add some docs

faa9122

Add enum to represent the expression precedence instead of using u8 v…

9fb4723

…alues

Merge branch 'main' into new-parser

1a0b01c

Update snapshots

ca50ef3

Fix wrong Precedence for CircumFlex and Amper tokens

73007f2

Fix false positive error when parsing NamedExpr in the lower part…

5d6ec57

… of the slice, e.g. `x[(a:=0):]`

Add missing token in END_EXPR_SET

a96a010

Merge branch 'main' into new-parser

e453fb9

Fix minor syntax error

c23c69a

Fix assert panic

5f66f5e

Solve FIXME, don't consume the token in parse_match_pattern_literal

bdc9a6b

Small refactor in parse_string_expression to remove clone

d502bbd

Merge branch 'main' into new-parser

f8e577d

dhruvmanila assigned dhruvmanila and unassigned MichaReiser Feb 15, 2024

dhruvmanila mentioned this pull request Feb 19, 2024

Replace LALRPOP parser with hand-written parser #10036

Merged

dhruvmanila closed this Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace autogenerated parser with hand-written parser #9152

Replace autogenerated parser with hand-written parser #9152

LaBatata101 commented Dec 15, 2023 •

edited by MichaReiser

Loading

codspeed-hq bot commented Dec 15, 2023 •

edited

Loading

charliermarsh left a comment

LaBatata101 commented Dec 16, 2023

MichaReiser commented Dec 18, 2023 •

edited

Loading

dhruvmanila commented Feb 19, 2024

Replace autogenerated parser with hand-written parser #9152

Replace autogenerated parser with hand-written parser #9152

Conversation

LaBatata101 commented Dec 15, 2023 • edited by MichaReiser Loading

Summary of Changes

Changes in the AST

Changes to error types

Linter

Formatter

Misc

Test Plan

codspeed-hq bot commented Dec 15, 2023 • edited Loading

CodSpeed Performance Report

Merging #9152 will degrade performances by 6.04%

Summary

Benchmarks breakdown

charliermarsh left a comment

Choose a reason for hiding this comment

LaBatata101 commented Dec 16, 2023

MichaReiser commented Dec 18, 2023 • edited Loading

Invalid syntax representation

Linting / Formatting programs containing invalid syntax

Thanks

dhruvmanila commented Feb 19, 2024

LaBatata101 commented Dec 15, 2023 •

edited by MichaReiser

Loading

codspeed-hq bot commented Dec 15, 2023 •

edited

Loading

MichaReiser commented Dec 18, 2023 •

edited

Loading