Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large compile time regression with new MatchTable combiner #66751

Closed
aemerson opened this issue Sep 19, 2023 · 12 comments · Fixed by #66864
Closed

Large compile time regression with new MatchTable combiner #66751

aemerson opened this issue Sep 19, 2023 · 12 comments · Fixed by #66864
Assignees

Comments

@aemerson
Copy link
Contributor

It seems we forgot to measure the compile time impact of the new MT combiner. When it was enabled for AArch64 in July it resulted in a large -Os 9% regression on CTMark/sqlite3 (and others too but I'm just using sqlite3 as a test case). Some time soon afterwards that regression went down to 3-4% but I haven't identified the commit responsible for that improvement.

However, even 3% is a very large CT regression.

The measurements were taken using a release + noasserts build of clang, without LTO/PGO on trunk.

Unfortunately some quick profiling doesn't show much except that ExecuteMatchTable() is taking most of the time in the combiner, which isn't surprising.

@aemerson
Copy link
Contributor Author

@Pierre-vh Could you take a look? Unfortunately I know we've removed the old combiner now so comparing might be hard.

@Pierre-vh
Copy link
Contributor

Sure, I can definitely take I look. I assume the regression improved after deleting the old combiner and refactoring the APIs, it made it so we don't re-instantiate the combiner every single inst.

Do you have some repro instructions? Something I can easily run to get the same numbers you have.
I'll compare against an older llvm release and work from there.

@aemerson
Copy link
Contributor Author

Thanks. This command should work to compile sqlite3 from the llvm test suite:

/path/to/clang -DNDEBUG  -fno-color-diagnostics -Os -arch arm64 -isysroot `xcrun --show-sdk-path --sdk macosx` -mmacosx-version-min=13.5   -w -Werror=date-time -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DSQLITE_OMIT_LOAD_EXTENSION=1 -DSQLITE_THREADSAFE=0 -o /dev/null -c /path/to/llvm-test-suite/CTMark/sqlite3/sqlite3.c -fglobal-isel

@Pierre-vh
Copy link
Contributor

Can I run this on linux without the isysroot/macos bits? Also was thing regression observed on a x86 CPU or Apple Silicon? Do you think I can also see it on x86 (still compiling for arm64 ofc)?

I'll first check which opcodes take the longest and look for optimizations there. I will also look for more general optimizations in the MatchTable machinery. I'm a bit surprised there was a regression at all -I assumed a state machine would be much faster, but it kind of makes sense as well because there's now quite a bit more indirection than before. e.g. predicates go through 2 function calls + a switch to be called.

@aemerson
Copy link
Contributor Author

Unfortunately I only have a Mac to do testing. This magnitude of regression is very likely reproducible on Intel machines though. However you will need to find an aarch64 Linux sysroot to provide clang so it can find the C headers. I'm on mobile right now but IIRC they're available online somewhere. @davemgreen do you happen to know where Pierre can get one?

@aemerson
Copy link
Contributor Author

Oh wait, I'll just give you an IR file you can run llc on...

@aemerson
Copy link
Contributor Author

Attached IR (tarballed due to github requirements). For AArch64, we fall back on a couple of functions so you'll need to do: llc -global-isel -global-isel-abort=2

sqlite3.bc.tar.gz

@Pierre-vh
Copy link
Contributor

I think some of this could be due to not having a GIM_SwitchOpcode. I have no idea why it's not being generated by the matchtable. Because of that it ends up running a bunch of try cases, running GIM_SwitchOpcode a hundred time

@Pierre-vh
Copy link
Contributor

@aemerson Can you please do a quick test on your machine with a llvm main build, and you just add the following to GICombinerEmitter::run and check if it helps CT?

void GICombinerEmitter::run(raw_ostream &OS) {
  InstructionOpcodeMatcher::initOpcodeValuesMap(Target);
  LLTOperandMatcher::initTypeIDValuesMap();

This makes the MatchTable generate a SwitchOpcode. I'll prep a diff for that tomorrow, in the meantime let me know if that helps a bit already

Profiling this turns out to be a bit more difficult than I thought. I'm trying to use llvm's Timer groups to get per-opcode timing

@aemerson
Copy link
Contributor Author

That made a huge difference! llc runtime for the test case improved by 15%. It's now faster than SDAG by ~9% too.

For clang on the initial sqlite3.c, total compile time improved by 6%, making it once again faster than SDAG.

Perhaps there's some more time to be found, the delta between SDAG and GISel seems very slightly larger before MT combiners than after, but that may be due to other changes that landed in clang in the mean time.

@Pierre-vh
Copy link
Contributor

I'll make a patch but not close the issue, I'll try to find some time to look into this again in the coming days/weeks, maybe there's other easy optimizations that can be done. A quick look shows most of the time is now spent in opcodes that execute the match and apply functions which is to be expected and the only way to address that is to switch more rules to MIR patterns I think.

Pierre-vh added a commit to Pierre-vh/llvm-project that referenced this issue Sep 20, 2023
The call to `initOpcodeValuesMap` was missing, causing the MatchTable to never emit a `SwitchMatcher`.
Also adds other code imported from `GlobalISelEmitter.cpp` to ensure rules are sorted by precedence as well.

Overall this improves GlobalISel performance by a noticeable amount.
See llvm#66751
@aemerson
Copy link
Contributor Author

I think it's fine to close this as the main issue's been resolved now. If we identify anything we can file a new one.

@aemerson aemerson linked a pull request Sep 20, 2023 that will close this issue
Pierre-vh added a commit that referenced this issue Sep 20, 2023
The call to `initOpcodeValuesMap` was missing, causing the MatchTable to
(unintentionally) not emit a `SwitchMatcher`. Also adds other code
imported from `GlobalISelEmitter.cpp` to ensure rules are sorted by
precedence as well.

Overall this improves GlobalISel compile-time performance by a
noticeable amount. See #66751
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants