Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up type literal lexing and make it more strict. #4430

Merged
merged 4 commits into from
Oct 24, 2024

Conversation

chandlerc
Copy link
Contributor

This rejects type literals with more digits than we can lex without APInt's help, and using a custom diagnostic. This is a pretty arbitrary implementation limit, I'm wide open to even more strict rules here.

Despite no special casing and a very simplistic approach, by not using APInt this completely eliminates the lexing overhead for i32 in the generated compilation benchmark where that specific type literal is very common. We see a 10% improvement in lexing there:

BM_CompileAPIFileDenseDecls<Phase::Lex>/256        39.0µs ± 4%  34.8µs ± 2%  -10.86%  (p=0.000 n=19+20)
BM_CompileAPIFileDenseDecls<Phase::Lex>/1024        180µs ± 1%   158µs ± 2%  -12.22%  (p=0.000 n=18+20)
BM_CompileAPIFileDenseDecls<Phase::Lex>/4096        731µs ± 2%   641µs ± 1%  -12.31%  (p=0.000 n=18+20)
BM_CompileAPIFileDenseDecls<Phase::Lex>/16384      3.20ms ± 2%  2.86ms ± 2%  -10.47%  (p=0.000 n=18+19)
BM_CompileAPIFileDenseDecls<Phase::Lex>/65536      13.8ms ± 1%  12.4ms ± 2%   -9.78%  (p=0.000 n=18+19)
BM_CompileAPIFileDenseDecls<Phase::Lex>/262144     64.0ms ± 2%  58.4ms ± 2%   -8.70%  (p=0.000 n=19+18)

This starts to fix a TODO in the diagnostic for these by giving a reasonably good diagnostic about a very large type literal. However, in practice it regresses the diagnostics because error tokens produce noisy extraneous diagnostics from parse and check currently. Leaving the TODO there, and I have a follow-up PR to start improving the extraneous diagnostics.

toolchain/lex/tokenized_buffer_test.cpp Show resolved Hide resolved
llvm::StringRef suffix = word.substr(1);
if (!CanLexInt(emitter_, suffix)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at other callers for CanLexInt, should string literals be sharing the logic? i.e., should CanLexInt be factored into just numeric literals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? I'm happy to look at those next?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG

Copy link
Contributor

@jonmeow jonmeow Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, thinking about this further... maybe you should look at it here, because you're taking a shared function and implementing a specialized version, rather than continuing sharing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, happy to look at this, but would prefer in a follow-up.

toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
Copy link
Contributor Author

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, PTAL!

toolchain/lex/tokenized_buffer_test.cpp Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
llvm::StringRef suffix = word.substr(1);
if (!CanLexInt(emitter_, suffix)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? I'm happy to look at those next?

@chandlerc chandlerc requested a review from jonmeow October 24, 2024 07:17
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, because I expect you're going to be of a different mind about the importance of sharing of code.

@chandlerc
Copy link
Contributor Author

Approving, because I expect you're going to be of a different mind about the importance of sharing of code.

Thanks, will look at the other things in follow-ups!

This rejects type literals with more digits than we can lex without
APInt's help, and using a custom diagnostic. This is a pretty arbitrary
implementation limit, I'm wide open to even more strict rules here.

Despite no special casing and a very simplistic approach, by not using
APInt this completely eliminates the lexing overhead for `i32` in the
generated compilation benchmark where that specific type literal is very
common. We see a 10% improvement in lexing there:
```
BM_CompileAPIFileDenseDecls<Phase::Lex>/256        39.0µs ± 4%  34.8µs ± 2%  -10.86%  (p=0.000 n=19+20)
BM_CompileAPIFileDenseDecls<Phase::Lex>/1024        180µs ± 1%   158µs ± 2%  -12.22%  (p=0.000 n=18+20)
BM_CompileAPIFileDenseDecls<Phase::Lex>/4096        731µs ± 2%   641µs ± 1%  -12.31%  (p=0.000 n=18+20)
BM_CompileAPIFileDenseDecls<Phase::Lex>/16384      3.20ms ± 2%  2.86ms ± 2%  -10.47%  (p=0.000 n=18+19)
BM_CompileAPIFileDenseDecls<Phase::Lex>/65536      13.8ms ± 1%  12.4ms ± 2%   -9.78%  (p=0.000 n=18+19)
BM_CompileAPIFileDenseDecls<Phase::Lex>/262144     64.0ms ± 2%  58.4ms ± 2%   -8.70%  (p=0.000 n=19+18)
```

This starts to fix a TODO in the diagnostic for these by giving
a reasonably good diagnostic about a very large type literal. However,
in practice it regresses the diagnostics because error tokens produce
noisy extraneous diagnostics from parse and check currently. Leaving the
TODO there, and I have a follow-up PR to start improving the extraneous
diagnostics.
@chandlerc
Copy link
Contributor Author

:sigh: Looks like std::from_chars isn't available on Clang 16's libc++ -- OK for me to go back to the loop?

@chandlerc
Copy link
Contributor Author

I've restored the for loop for now, but maybe there is another approach? Nothing really simple comes to mind, and the loop is at least somewhat simple...

FWIW, it appears to be a bit faster to use the loop on x86. (About the same on Arm for me.)

@chandlerc chandlerc requested a review from jonmeow October 24, 2024 21:37
@chandlerc chandlerc added this pull request to the merge queue Oct 24, 2024
Merged via the queue into carbon-language:trunk with commit 577fda1 Oct 24, 2024
8 checks passed
@chandlerc chandlerc deleted the fast-ints branch October 24, 2024 22:04
github-merge-queue bot pushed a commit that referenced this pull request Nov 2, 2024
An invalid parse due to an error token isn't likely a great diagnostic
as it will already have been diagnosed by the lexer. A common case to
start handling that is when the parser encounters an invalid token when
expecting an expression.

This removes a number of unhelpful diagnostics after the lexer has done
a good job diagnosing.

This also means that there may be parse tree errors that aren't
diagnosed when there are lexer-diagnosed errors, so track that.

Follow-up to #4430 that almost finishes addressing its diagnostic TODO.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants