-
-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate using LTO, Profile-Guided Optimization (PGO) and Post Link Optimization (PLO) #1334
Comments
Wow, this is a very thorough analysis, thank you for doing this! It looks like you get about ~15% for each of the steps, which is very impressive. I would be curious what the impacts on build times are. Especially with PGO/PLO, you have to build once with instrumentation and then a second time, so build times are at least 2x. Then there is the question of getting some halfway complete and realistic testcases to run on instrumented builds, which should include:
Curating a good corpus of tests with edgecases would be challenging, and hooking that up to the build as well. A side not purely on benchmarking: There is But yes, in the end, you would want to build/instrument/optimize the final http server, so in the end you would want to use that. Again, huge thanks for doing all these measurements. At the very least LTO should provide us with a good 10-15% win with reasonable effort. The wins for PGO/PLO look amazing as well, but the effort to get there is a lot higher so I’m not sure yet it if would be worth doing. |
is it possible to build once with instrumentation to produce a profile, and then build the release with the profile of a previous commit (and just disabling PGO if there's none)? it would allow you to parallelize. |
Yep. This could be mitigated by building heavily-optimized Docker images (LTO + PGO + PLO) not for every commit, but let's say every night or something like that. Of course, some changes to the current CI pipelines will be needed in this case. That's just an idea of how it can be implemented.
I agree. If you decide to not invest much in enabling PGO/PLO in your build scripts, I can suggest at least writing a note somewhere in the Symbolicator documentation about possible opportunities to increase the Symbolicator performance with PGO/PLO. In this case, users who want to optimize Symbolicator as much as possible for their use cases and are not afraid of building their Symbolicator binaries from the source code will be aware of such optimizations.
Yes, it's possible to do. Just be aware that in the case of PGO profile caching you will face a "profile skew" problem then the PGO profile could be a bit outdated and not optimize some new code (since the profile is created for the older source code). In general, it's not critical if the source code is not changing much (so there will be no huge difference between the commits). Another possible trap - changing compiler options. If you change a compiler option or upgrade the compiler version, it's highly recommended to regenerate the PGO profiles since they with a huge probability will be incompatible. |
We now build Symbolicator docker images with 1CGU + LTO. We needed to throw a beefier cloudbuild machine at it unfortunately. Build times suffered a little bit in general as well. Though the results on CPU usage in production were underwhelming. I expected to see some improvements in the graph, but nothing is visible. I expected to see at least a tiny dent in the graphs but there was nothing. So I will call this closed, thank you very much for doing the initial assessment on this 🎉 |
Hi!
Recently I checked optimizations like Link-Time Optimization (LTO), Profile-Guided Optimization (PGO) and Post-Link Optimizations (PLO) improvements on multiple projects. The results are available here. According to the tests, all these optimizations can help with achieving better performance in many cases for many applications. I think trying to enable them for Symbolicator can be a good idea.
I already did some benchmarks and want to share my results here. Hopefully, they will be helpful.
Test environment
master
branch on commitac127975e4649dc2f40b177cb98556e307aca26e
Benchmark
For the benchmark purposes, I used this WRK-based scenario for Minidump. As a minidump, I use this Linux dump. WRK command is the same for all benchmarks and PGO/PLO training phases:
WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
. Before each WRK benchmark, I once runcargo run -p process-event -- ../tests/fixtures/linux.dmp
as it's recommended.All PGO and PLO optimizations are done with cargo-pgo (I highly recommend using this tool). For PLO phase I use LLVM BOLT tool.
LTO is enabled by the following changes to the
Release
profile in the rootCargo.toml
file:For all benchmarks, binaries are stripped with a
strip
tool.All tests are done on the same machine, done multiple times, with the same background "noise" (as much as I can guarantee of course) - the results are reproducible at least on my machine.
Tricky moment with PGO dumps
For some unknown reason, Symbolicator does not dump the PGO profile to the disk on Ctrl+C. I guess it's somehow related to custom signal handling somewhere in the code. So I modified Symbolicator a little bit by manually dumping the PGO profile to the disk. As a reference implementation, I use this piece of code from YugabyteDB. I made the following changes to the
main.rs
:I use
signal_hook
dependency. Please note that__llvm_profile_write_file
symbol is linked to the program only when you build your program with PGO instrumentation (it's done automatically by the Rustc compiler). Since this, you need to disable/comment out this code during the PGO optimization phase (otherwise you get a link error).I think there should be a better way to implement this logic but for the tests' purposes, it's good enough.
Results
Here I post the benchmark results for the following Symbolicator configurations:
Release:
Release + LTO:
Release + LTO + PGO optimized:
Release + LTO + PGO optimized + PLO optimized:
According to the tests above, I see measurable improvements from enabling LTO, PGO and PLO with LLVM BOLT.
Additionally, below I post results for the PGO instrumentation and PLO instrumentation phases. So you can estimate the Symbolicator slowdown during the instrumentation.
Release + LTO + PGO instrumentation:
Release + LTO + PGO optimized + PLO instrumentation:
Further steps
I can suggest the following action points:
Here are some examples of how PGO optimization is integrated in other projects:
configure
scriptI have some examples of how PGO information looks in the documentation:
Regarding LLVM BOLT integration, I have the following examples:
The text was updated successfully, but these errors were encountered: