-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large compile time regression with new MatchTable combiner #66751
Comments
@Pierre-vh Could you take a look? Unfortunately I know we've removed the old combiner now so comparing might be hard. |
Sure, I can definitely take I look. I assume the regression improved after deleting the old combiner and refactoring the APIs, it made it so we don't re-instantiate the combiner every single inst. Do you have some repro instructions? Something I can easily run to get the same numbers you have. |
Thanks. This command should work to compile sqlite3 from the llvm test suite:
|
Can I run this on linux without the isysroot/macos bits? Also was thing regression observed on a x86 CPU or Apple Silicon? Do you think I can also see it on x86 (still compiling for arm64 ofc)? I'll first check which opcodes take the longest and look for optimizations there. I will also look for more general optimizations in the MatchTable machinery. I'm a bit surprised there was a regression at all -I assumed a state machine would be much faster, but it kind of makes sense as well because there's now quite a bit more indirection than before. e.g. predicates go through 2 function calls + a switch to be called. |
Unfortunately I only have a Mac to do testing. This magnitude of regression is very likely reproducible on Intel machines though. However you will need to find an aarch64 Linux sysroot to provide clang so it can find the C headers. I'm on mobile right now but IIRC they're available online somewhere. @davemgreen do you happen to know where Pierre can get one? |
Oh wait, I'll just give you an IR file you can run llc on... |
Attached IR (tarballed due to github requirements). For AArch64, we fall back on a couple of functions so you'll need to do: |
I think some of this could be due to not having a GIM_SwitchOpcode. I have no idea why it's not being generated by the matchtable. Because of that it ends up running a bunch of try cases, running GIM_SwitchOpcode a hundred time |
@aemerson Can you please do a quick test on your machine with a llvm main build, and you just add the following to
This makes the MatchTable generate a SwitchOpcode. I'll prep a diff for that tomorrow, in the meantime let me know if that helps a bit already Profiling this turns out to be a bit more difficult than I thought. I'm trying to use llvm's Timer groups to get per-opcode timing |
That made a huge difference! For clang on the initial sqlite3.c, total compile time improved by 6%, making it once again faster than SDAG. Perhaps there's some more time to be found, the delta between SDAG and GISel seems very slightly larger before MT combiners than after, but that may be due to other changes that landed in clang in the mean time. |
I'll make a patch but not close the issue, I'll try to find some time to look into this again in the coming days/weeks, maybe there's other easy optimizations that can be done. A quick look shows most of the time is now spent in opcodes that execute the |
The call to `initOpcodeValuesMap` was missing, causing the MatchTable to never emit a `SwitchMatcher`. Also adds other code imported from `GlobalISelEmitter.cpp` to ensure rules are sorted by precedence as well. Overall this improves GlobalISel performance by a noticeable amount. See llvm#66751
I think it's fine to close this as the main issue's been resolved now. If we identify anything we can file a new one. |
The call to `initOpcodeValuesMap` was missing, causing the MatchTable to (unintentionally) not emit a `SwitchMatcher`. Also adds other code imported from `GlobalISelEmitter.cpp` to ensure rules are sorted by precedence as well. Overall this improves GlobalISel compile-time performance by a noticeable amount. See #66751
It seems we forgot to measure the compile time impact of the new MT combiner. When it was enabled for AArch64 in July it resulted in a large -Os 9% regression on CTMark/sqlite3 (and others too but I'm just using sqlite3 as a test case). Some time soon afterwards that regression went down to 3-4% but I haven't identified the commit responsible for that improvement.
However, even 3% is a very large CT regression.
The measurements were taken using a release + noasserts build of clang, without LTO/PGO on trunk.
Unfortunately some quick profiling doesn't show much except that
ExecuteMatchTable()
is taking most of the time in the combiner, which isn't surprising.The text was updated successfully, but these errors were encountered: