Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove type parameter from parse_* methods #9466

Merged
merged 1 commit into from
Jan 11, 2024
Merged

Remove type parameter from parse_* methods #9466

merged 1 commit into from
Jan 11, 2024

Conversation

MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented Jan 11, 2024

This PR removes the I (tokens iterator) type parameter from all parse_* methods. This is done in preparation for #9152 to avoid accidentally monomorphizing the parser more than once.

The Parser struct defined in #9152 is parametrized by the underlying tokens iterator. This is dangerous because Rust will monomorphize the entire parser for each distinct I type parameter.

This PR removes the I type parameters from all parse_* methods and introduces a new TokenSource type in the parser that:

  • Filters out trivia tokens
  • Allows lookahead without the need for MultiPeek (which has an overhead when reading tokens)
  • Is a single type used by our Parser to read tokens.

The downside is that all parse_ methods now take a Vec<LexResult>, which requires collecting the lexer result into a Vec before parsing. Our micro-benchmarks show that this is slower than streaming the tokens one by one.

However, it turns out that both the linter and formatter both collect the tokens before parsing them to an AST. That means the path tested by the micro-benchmark isn't used in the actual product and the introduced regression doesn't affect users.

You may wonder why this is only a problem now but hasn't been a problem with lalrpop. The problem existed with lalrpop as well but to a much smaller extent because lalrpop separates the parser into two parts:

  • A state machine parametrized by I that consumes the tokens and calls into the parser methods (only passing the tokens)
  • The actual parsing methods

Only the state machine gets monomorphized, which is fine because it is limited in size. Another way to think about it is that the lalrpop pushes the tokens into the parser (and, therefore, the parser is decoupled from I). In contrast, the handwritten parser pulls the tokens (and, therefore, is generic over I).

Alternatives

Alternatives that may be less affected by the performance regression are

  • A TokenSource enum with two variants: One storing the lexed tokens and the other storing the lexer to consume the tokens lazily.
  • Storing a Box<impl Iterator> in the hand written parser

I went with the above approach because we don't need the flexibility of either lexing lazily or eagerly in Ruff outside the parser benchmark. Thus, using a Vec<LexResult> seemed the easiest solution.

CC: @LaBatata101

Copy link

codspeed-hq bot commented Jan 11, 2024

CodSpeed Performance Report

Merging #9466 will degrade performances by 6.76%

Comparing token-source (9d17580) with main (14d3fe6)

Summary

❌ 5 (👁 5) regressions
✅ 25 untouched benchmarks

Benchmarks breakdown

Benchmark main token-source Change
👁 parser[unicode/pypinyin.py] 4 ms 4.3 ms -6.76%
👁 parser[pydantic/types.py] 25.7 ms 27.3 ms -5.9%
👁 parser[numpy/globals.py] 1.1 ms 1.1 ms -4.87%
👁 parser[numpy/ctypeslib.py] 11.6 ms 12.2 ms -5.49%
👁 parser[large/dataset.py] 67.4 ms 70.8 ms -4.82%

@MichaReiser MichaReiser force-pushed the token-source branch 2 times, most recently from 6552632 to 7403d2d Compare January 11, 2024 12:08
Copy link
Contributor

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@MichaReiser MichaReiser changed the title Add TokenSource Remove type parameter from parse_* methods Jan 11, 2024
@MichaReiser MichaReiser added internal An internal refactor or improvement parser Related to the parser labels Jan 11, 2024
@MichaReiser MichaReiser marked this pull request as ready for review January 11, 2024 13:52
Copy link
Member

@charliermarsh charliermarsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me given the explanation (especially around the perceived regression) in the summary.

Copy link
Member

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I buy what you're selling.

let marker_token = (Tok::start_marker(mode), TextRange::default());
let lexer = std::iter::once(Ok(marker_token)).chain(lxr);
let lexer = std::iter::once(Ok(marker_token)).chain(TokenSource::new(tokens));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to figure out why this is a dedicated type instead of a tokens.into_iter().filter(...) but couldn't come up with a reason. (Usually it's because you want to name the type somewhere, but maybe I'm missing that.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mainly to have a named type in #9152 and a place where we can implement lookahead without using PeekMany

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotya, don't mind me then. :)

@LaBatata101
Copy link
Contributor

LGTM. I was actually going to something like this later, glad you saved me the work 😊

@MichaReiser MichaReiser merged commit f192c72 into main Jan 11, 2024
17 checks passed
@MichaReiser MichaReiser deleted the token-source branch January 11, 2024 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal An internal refactor or improvement parser Related to the parser
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants