Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up identifier detection #89

Closed
wants to merge 1 commit into from
Closed

Conversation

jserv
Copy link
Contributor

@jserv jserv commented Nov 16, 2024

Both 'is_ident1' and 'is_ident2' are now macros instead of function calls, and they are tweaked for ASCII detection in advance with the fallback to table lookup for non-ASCII characters.

@fuhsnn
Copy link
Owner

fuhsnn commented Nov 16, 2024

Thank you for the patches. Honestly performance hasn't been a priority, I consider the lexer less significant in the total pipeline compared to cache-misses from leaking and the overhead of external assembler. I do have some ideas as well, so a branch for performance experiments should make sense.

In terms of performance I want to ensure a common basis to work on: which projects to benchmark and what kind of environment, is it self-hosted, another simple C compiler or full-on gcc/clang optimizations with custom allocator etc. Do you have a preference?

@fuhsnn
Copy link
Owner

fuhsnn commented Nov 16, 2024

On the pull request: I think the short-circuiting of ASCII characters can be done earlier, skipping decode_utf8() altogether.

@jserv
Copy link
Contributor Author

jserv commented Nov 16, 2024

In terms of performance I want to ensure a common basis to work on: which projects to benchmark and what kind of environment, is it self-hosted, another simple C compiler or full-on gcc/clang optimizations with custom allocator etc.

Using uftrace, I identified the performance bottlenecks in the compilation process.

Here's what I found:
First, apply the following changes:

--- a/GNUmakefile
+++ b/GNUmakefile
@@ -1,4 +1,4 @@
-CFLAGS=-std=c99 -g -fno-common -Wall -pedantic -Wno-switch
+CFLAGS=-std=c99 -g -pg -fno-common -Wall -pedantic -Wno-switch
 
 SRCS=$(wildcard *.c)
 OBJS=$(SRCS:.c=.o)

After running uftrace ./slimcc -c -o stage2/parse.o parse.c, the trace showed the following call stack:

Copy[6] is_ident2
[5] read_ident
[4] tokenize
[3] tokenize_file
[2] must_tokenize_file
[1] cc1
[0] main

Based on this analysis, I tweaked the is_ident1 and is_ident2 functions.

@jserv
Copy link
Contributor Author

jserv commented Nov 16, 2024

By the way, I am working with my students to enhance the shecc project, a self-compiling C compiler with some optimizations. Since shecc only implements a subset of C, we are planning to modify slimcc to output compatible C code that shecc can process, rather than having slimcc generate x86-64 assembly directly. This way, slimcc would serve as a full C language frontend for shecc, while allowing shecc to focus on optimization and backend support. To achieve this, we plan to contribute improvements to slimcc's parsing, tokenization, and preprocessing components.

@fuhsnn
Copy link
Owner

fuhsnn commented Nov 16, 2024

I'm thrilled to know that the project has some real world value and am happy to target different backends as long as the IR is stable, but I imagine memory usage will be a major pain point on embedded targets that shecc currently self-hosts on. I thought about re-architecting, but then the code wouldn't be as accessible and one might as well work on other more optimized C frontend like tcc, cproc, tyfkda/xcc, libfirm/cparser etc.

If C++ is allowed in your classroom, I do have an unfinished C++ port sitting around that use RAII for nearly identical code with much better memory consumption, a minimal C++ front-end bootstrappable with C99 is also something I've always wanted to work on.

@fuhsnn
Copy link
Owner

fuhsnn commented Nov 16, 2024

The C++ version is uploaded in partial_c++_port branch, it's less C++ than I remembered and still able to build as C with just a little macro and typedef's.

If that particular style is okay to you, I'd prefer basing performance optimizations on that version, since it feels a bit bike shedding to optimize parsing while compiling larger projects hit OOM on an embedded device.

@jserv
Copy link
Contributor Author

jserv commented Nov 22, 2024

I defer this pull request to @ChAoSUnItY, who is assigned to proceed with the above research prototype under my supervision.

@fuhsnn
Copy link
Owner

fuhsnn commented Nov 23, 2024

I believe identifier detection is of decent speed after efcbb85, feel free to open new issues regarding the C backend.

Btw thanks for mentioning uftrace, logging exact call counts is incredible. Can't believe it's not more well known.

@ChAoSUnItY
Copy link

I've adopted identifier reading logic from efcbb85 and further refine it with the macro functions introduced previously in this PR, and also applied fool-proofing logic to the function. This way, it improves readability.

@fuhsnn fuhsnn force-pushed the main branch 5 times, most recently from 7c97de8 to ed6cfec Compare November 27, 2024 09:35
@jserv
Copy link
Contributor Author

jserv commented Nov 28, 2024

Let's rebase the latest main branch and squash the commits.

@ChAoSUnItY ChAoSUnItY force-pushed the speedup-is-ident branch 2 times, most recently from 644be33 to 4b34032 Compare November 28, 2024 06:00
@fuhsnn fuhsnn force-pushed the main branch 10 times, most recently from f752182 to 6dae514 Compare November 29, 2024 04:33
@fuhsnn fuhsnn force-pushed the main branch 3 times, most recently from d4283bc to c10050f Compare November 29, 2024 08:02
@ChAoSUnItY ChAoSUnItY force-pushed the speedup-is-ident branch 2 times, most recently from 97719ae to 8e22d62 Compare November 30, 2024 06:23
Both 'is_ident1' and 'is_ident2' are now macros instead of function
calls, and they are tweaked for ASCII detection in advance with the
fallback to table lookup for non-ASCII characters.

Also discard unnecessary variable and add safe guard:
- Variable is_first can be replaced with boolean expression "p == start"
- Add safe guard for ascii character checking to ensure starting identifier
  character must not be numberic

Additionally, replace first ascii character check with macro
is_ident2_ascii to keep readability. The later is_ident2 function call
is replaced with is_ident2_non_ascii because the expanded macro
function will result in multiple decode_utf8 function call, also
it's redundant to check if it's an ascii character or not.

Co-authored-by:  Jim Huang <[email protected]>
@fuhsnn fuhsnn force-pushed the main branch 7 times, most recently from f093a17 to bfe2374 Compare December 6, 2024 00:44
@fuhsnn
Copy link
Owner

fuhsnn commented Dec 29, 2024

is_first is indeed redundant, I'll change that.

The rest is overwhelmingly similar to enh/chibicc@9473229

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants