Large compile time regression with new MatchTable combiner #66751

aemerson · 2023-09-19T08:48:23Z

It seems we forgot to measure the compile time impact of the new MT combiner. When it was enabled for AArch64 in July it resulted in a large -Os 9% regression on CTMark/sqlite3 (and others too but I'm just using sqlite3 as a test case). Some time soon afterwards that regression went down to 3-4% but I haven't identified the commit responsible for that improvement.

However, even 3% is a very large CT regression.

The measurements were taken using a release + noasserts build of clang, without LTO/PGO on trunk.

Unfortunately some quick profiling doesn't show much except that ExecuteMatchTable() is taking most of the time in the combiner, which isn't surprising.

The text was updated successfully, but these errors were encountered:

aemerson · 2023-09-19T08:49:43Z

@Pierre-vh Could you take a look? Unfortunately I know we've removed the old combiner now so comparing might be hard.

Pierre-vh · 2023-09-19T09:08:01Z

Sure, I can definitely take I look. I assume the regression improved after deleting the old combiner and refactoring the APIs, it made it so we don't re-instantiate the combiner every single inst.

Do you have some repro instructions? Something I can easily run to get the same numbers you have.
I'll compare against an older llvm release and work from there.

aemerson · 2023-09-19T09:14:53Z

Thanks. This command should work to compile sqlite3 from the llvm test suite:

/path/to/clang -DNDEBUG  -fno-color-diagnostics -Os -arch arm64 -isysroot `xcrun --show-sdk-path --sdk macosx` -mmacosx-version-min=13.5   -w -Werror=date-time -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DSQLITE_OMIT_LOAD_EXTENSION=1 -DSQLITE_THREADSAFE=0 -o /dev/null -c /path/to/llvm-test-suite/CTMark/sqlite3/sqlite3.c -fglobal-isel

Pierre-vh · 2023-09-19T09:24:23Z

Can I run this on linux without the isysroot/macos bits? Also was thing regression observed on a x86 CPU or Apple Silicon? Do you think I can also see it on x86 (still compiling for arm64 ofc)?

I'll first check which opcodes take the longest and look for optimizations there. I will also look for more general optimizations in the MatchTable machinery. I'm a bit surprised there was a regression at all -I assumed a state machine would be much faster, but it kind of makes sense as well because there's now quite a bit more indirection than before. e.g. predicates go through 2 function calls + a switch to be called.

aemerson · 2023-09-19T09:32:39Z

Unfortunately I only have a Mac to do testing. This magnitude of regression is very likely reproducible on Intel machines though. However you will need to find an aarch64 Linux sysroot to provide clang so it can find the C headers. I'm on mobile right now but IIRC they're available online somewhere. @davemgreen do you happen to know where Pierre can get one?

aemerson · 2023-09-19T09:35:35Z

Oh wait, I'll just give you an IR file you can run llc on...

aemerson · 2023-09-19T09:46:40Z

Attached IR (tarballed due to github requirements). For AArch64, we fall back on a couple of functions so you'll need to do: llc -global-isel -global-isel-abort=2

sqlite3.bc.tar.gz

Pierre-vh · 2023-09-19T14:57:54Z

I think some of this could be due to not having a GIM_SwitchOpcode. I have no idea why it's not being generated by the matchtable. Because of that it ends up running a bunch of try cases, running GIM_SwitchOpcode a hundred time

Pierre-vh · 2023-09-19T15:02:38Z

@aemerson Can you please do a quick test on your machine with a llvm main build, and you just add the following to GICombinerEmitter::run and check if it helps CT?

void GICombinerEmitter::run(raw_ostream &OS) {
  InstructionOpcodeMatcher::initOpcodeValuesMap(Target);
  LLTOperandMatcher::initTypeIDValuesMap();

This makes the MatchTable generate a SwitchOpcode. I'll prep a diff for that tomorrow, in the meantime let me know if that helps a bit already

Profiling this turns out to be a bit more difficult than I thought. I'm trying to use llvm's Timer groups to get per-opcode timing

aemerson · 2023-09-20T02:19:14Z

That made a huge difference! llc runtime for the test case improved by 15%. It's now faster than SDAG by ~9% too.

For clang on the initial sqlite3.c, total compile time improved by 6%, making it once again faster than SDAG.

Perhaps there's some more time to be found, the delta between SDAG and GISel seems very slightly larger before MT combiners than after, but that may be due to other changes that landed in clang in the mean time.

Pierre-vh · 2023-09-20T07:03:58Z

I'll make a patch but not close the issue, I'll try to find some time to look into this again in the coming days/weeks, maybe there's other easy optimizations that can be done. A quick look shows most of the time is now spent in opcodes that execute the match and apply functions which is to be expected and the only way to address that is to switch more rules to MIR patterns I think.

The call to `initOpcodeValuesMap` was missing, causing the MatchTable to never emit a `SwitchMatcher`. Also adds other code imported from `GlobalISelEmitter.cpp` to ensure rules are sorted by precedence as well. Overall this improves GlobalISel performance by a noticeable amount. See llvm#66751

aemerson · 2023-09-20T07:30:22Z

I think it's fine to close this as the main issue's been resolved now. If we identify anything we can file a new one.

The call to `initOpcodeValuesMap` was missing, causing the MatchTable to (unintentionally) not emit a `SwitchMatcher`. Also adds other code imported from `GlobalISelEmitter.cpp` to ensure rules are sorted by precedence as well. Overall this improves GlobalISel compile-time performance by a noticeable amount. See #66751

aemerson added the llvm:globalisel label Sep 19, 2023

aemerson assigned aemerson and Pierre-vh and unassigned aemerson Sep 19, 2023

Pierre-vh mentioned this issue Sep 20, 2023

[TableGen][GlobalISel] Use GIM_SwitchOpcode in Combiners #66864

Merged

aemerson closed this as completed Sep 20, 2023

aemerson linked a pull request Sep 20, 2023 that will close this issue

[TableGen][GlobalISel] Use GIM_SwitchOpcode in Combiners #66864

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large compile time regression with new MatchTable combiner #66751

Large compile time regression with new MatchTable combiner #66751

aemerson commented Sep 19, 2023

aemerson commented Sep 19, 2023

Pierre-vh commented Sep 19, 2023

aemerson commented Sep 19, 2023

Pierre-vh commented Sep 19, 2023

aemerson commented Sep 19, 2023

aemerson commented Sep 19, 2023

aemerson commented Sep 19, 2023

Pierre-vh commented Sep 19, 2023

Pierre-vh commented Sep 19, 2023

aemerson commented Sep 20, 2023

Pierre-vh commented Sep 20, 2023

aemerson commented Sep 20, 2023

Large compile time regression with new MatchTable combiner #66751

Large compile time regression with new MatchTable combiner #66751

Comments

aemerson commented Sep 19, 2023

aemerson commented Sep 19, 2023

Pierre-vh commented Sep 19, 2023

aemerson commented Sep 19, 2023

Pierre-vh commented Sep 19, 2023

aemerson commented Sep 19, 2023

aemerson commented Sep 19, 2023

aemerson commented Sep 19, 2023

Pierre-vh commented Sep 19, 2023

Pierre-vh commented Sep 19, 2023

aemerson commented Sep 20, 2023

Pierre-vh commented Sep 20, 2023

aemerson commented Sep 20, 2023