Speed up type literal lexing and make it more strict. #4430

chandlerc · 2024-10-20T07:54:26Z

This rejects type literals with more digits than we can lex without APInt's help, and using a custom diagnostic. This is a pretty arbitrary implementation limit, I'm wide open to even more strict rules here.

Despite no special casing and a very simplistic approach, by not using APInt this completely eliminates the lexing overhead for i32 in the generated compilation benchmark where that specific type literal is very common. We see a 10% improvement in lexing there:

BM_CompileAPIFileDenseDecls<Phase::Lex>/256        39.0µs ± 4%  34.8µs ± 2%  -10.86%  (p=0.000 n=19+20)
BM_CompileAPIFileDenseDecls<Phase::Lex>/1024        180µs ± 1%   158µs ± 2%  -12.22%  (p=0.000 n=18+20)
BM_CompileAPIFileDenseDecls<Phase::Lex>/4096        731µs ± 2%   641µs ± 1%  -12.31%  (p=0.000 n=18+20)
BM_CompileAPIFileDenseDecls<Phase::Lex>/16384      3.20ms ± 2%  2.86ms ± 2%  -10.47%  (p=0.000 n=18+19)
BM_CompileAPIFileDenseDecls<Phase::Lex>/65536      13.8ms ± 1%  12.4ms ± 2%   -9.78%  (p=0.000 n=18+19)
BM_CompileAPIFileDenseDecls<Phase::Lex>/262144     64.0ms ± 2%  58.4ms ± 2%   -8.70%  (p=0.000 n=19+18)

This starts to fix a TODO in the diagnostic for these by giving a reasonably good diagnostic about a very large type literal. However, in practice it regresses the diagnostics because error tokens produce noisy extraneous diagnostics from parse and check currently. Leaving the TODO there, and I have a follow-up PR to start improving the extraneous diagnostics.

toolchain/lex/tokenized_buffer_test.cpp

jonmeow · 2024-10-21T16:17:58Z

toolchain/lex/lex.cpp

  llvm::StringRef suffix = word.substr(1);
-  if (!CanLexInt(emitter_, suffix)) {


Looking at other callers for CanLexInt, should string literals be sharing the logic? i.e., should CanLexInt be factored into just numeric literals?

Maybe? I'm happy to look at those next?

Actually, thinking about this further... maybe you should look at it here, because you're taking a shared function and implementing a specialized version, rather than continuing sharing?

As discussed, happy to look at this, but would prefer in a follow-up.

toolchain/lex/lex.cpp

chandlerc

Thanks, PTAL!

toolchain/lex/tokenized_buffer_test.cpp

toolchain/lex/lex.cpp

chandlerc · 2024-10-24T07:14:35Z

toolchain/lex/lex.cpp

  llvm::StringRef suffix = word.substr(1);
-  if (!CanLexInt(emitter_, suffix)) {


Maybe? I'm happy to look at those next?

jonmeow

Approving, because I expect you're going to be of a different mind about the importance of sharing of code.

chandlerc · 2024-10-24T20:26:52Z

Approving, because I expect you're going to be of a different mind about the importance of sharing of code.

Thanks, will look at the other things in follow-ups!

This rejects type literals with more digits than we can lex without APInt's help, and using a custom diagnostic. This is a pretty arbitrary implementation limit, I'm wide open to even more strict rules here. Despite no special casing and a very simplistic approach, by not using APInt this completely eliminates the lexing overhead for `i32` in the generated compilation benchmark where that specific type literal is very common. We see a 10% improvement in lexing there: ``` BM_CompileAPIFileDenseDecls<Phase::Lex>/256 39.0µs ± 4% 34.8µs ± 2% -10.86% (p=0.000 n=19+20) BM_CompileAPIFileDenseDecls<Phase::Lex>/1024 180µs ± 1% 158µs ± 2% -12.22% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Lex>/4096 731µs ± 2% 641µs ± 1% -12.31% (p=0.000 n=18+20) BM_CompileAPIFileDenseDecls<Phase::Lex>/16384 3.20ms ± 2% 2.86ms ± 2% -10.47% (p=0.000 n=18+19) BM_CompileAPIFileDenseDecls<Phase::Lex>/65536 13.8ms ± 1% 12.4ms ± 2% -9.78% (p=0.000 n=18+19) BM_CompileAPIFileDenseDecls<Phase::Lex>/262144 64.0ms ± 2% 58.4ms ± 2% -8.70% (p=0.000 n=19+18) ``` This starts to fix a TODO in the diagnostic for these by giving a reasonably good diagnostic about a very large type literal. However, in practice it regresses the diagnostics because error tokens produce noisy extraneous diagnostics from parse and check currently. Leaving the TODO there, and I have a follow-up PR to start improving the extraneous diagnostics.

chandlerc · 2024-10-24T21:18:08Z

:sigh: Looks like std::from_chars isn't available on Clang 16's libc++ -- OK for me to go back to the loop?

chandlerc · 2024-10-24T21:34:47Z

I've restored the for loop for now, but maybe there is another approach? Nothing really simple comes to mind, and the loop is at least somewhat simple...

FWIW, it appears to be a bit faster to use the loop on x86. (About the same on Arm for me.)

An invalid parse due to an error token isn't likely a great diagnostic as it will already have been diagnosed by the lexer. A common case to start handling that is when the parser encounters an invalid token when expecting an expression. This removes a number of unhelpful diagnostics after the lexer has done a good job diagnosing. This also means that there may be parse tree errors that aren't diagnosed when there are lexer-diagnosed errors, so track that. Follow-up to #4430 that almost finishes addressing its diagnostic TODO.

github-actions bot requested a review from geoffromer October 20, 2024 07:54

github-actions bot added the toolchain label Oct 20, 2024

chandlerc force-pushed the fast-ints branch from 8cd153b to c6d1893 Compare October 20, 2024 08:28

chandlerc mentioned this pull request Oct 20, 2024

Start avoiding parse diagnostics on error tokens #4431

Merged

jonmeow reviewed Oct 21, 2024

View reviewed changes

chandlerc commented Oct 24, 2024

View reviewed changes

chandlerc requested a review from jonmeow October 24, 2024 07:17

jonmeow approved these changes Oct 24, 2024

View reviewed changes

chandlerc added 3 commits October 24, 2024 20:43

improve testing and implementation

a4721fd

format

c40cc9f

chandlerc force-pushed the fast-ints branch from a178ce5 to c40cc9f Compare October 24, 2024 20:59

chandlerc enabled auto-merge October 24, 2024 21:00

swich back to a loop as from_chars is missing in Clang 16

6329d91

chandlerc requested a review from jonmeow October 24, 2024 21:37

chandlerc added this pull request to the merge queue Oct 24, 2024

Merged via the queue into carbon-language:trunk with commit 577fda1 Oct 24, 2024
8 checks passed

chandlerc deleted the fast-ints branch October 24, 2024 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up type literal lexing and make it more strict. #4430

Speed up type literal lexing and make it more strict. #4430

chandlerc commented Oct 20, 2024

jonmeow Oct 21, 2024

chandlerc Oct 24, 2024

jonmeow Oct 24, 2024

jonmeow Oct 24, 2024 •

edited

Loading

chandlerc Oct 24, 2024

chandlerc left a comment

chandlerc Oct 24, 2024

jonmeow left a comment

chandlerc commented Oct 24, 2024

chandlerc commented Oct 24, 2024

chandlerc commented Oct 24, 2024

		llvm::StringRef suffix = word.substr(1);
		if (!CanLexInt(emitter_, suffix)) {

Speed up type literal lexing and make it more strict. #4430

Speed up type literal lexing and make it more strict. #4430

Conversation

chandlerc commented Oct 20, 2024

jonmeow Oct 21, 2024

Choose a reason for hiding this comment

chandlerc Oct 24, 2024

Choose a reason for hiding this comment

jonmeow Oct 24, 2024

Choose a reason for hiding this comment

jonmeow Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

chandlerc Oct 24, 2024

Choose a reason for hiding this comment

chandlerc left a comment

Choose a reason for hiding this comment

chandlerc Oct 24, 2024

Choose a reason for hiding this comment

jonmeow left a comment

Choose a reason for hiding this comment

chandlerc commented Oct 24, 2024

chandlerc commented Oct 24, 2024

chandlerc commented Oct 24, 2024

jonmeow Oct 24, 2024 •

edited

Loading