-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regresion on long methods with big lookup switches #412
Comments
@thomaswue @axel22 I see that the regression is still the same with rc2... Should I attach ASM listings of hottest code using perfasm profiler of JMH? Would it help to move on? |
Thanks for testing again. @axel22 What is your current assessment? |
I am currently looking at the inliner correctness for Stream programs like the following, and trying to ensure that we get loop-equivalent performance:
I am also trying to make progress on creating a new economy compilation tier for Truffle, which should ensure faster warmups. Thus, I was so far unable to make progress on this. |
OK. It would be good to just check whether we can reproduce this and assess what type of issue that is. |
I doubt that our performance problem is how we compile that switch statement, but you might revisit the design of that code if you think it matters to your overall performance. Hashing the whole string first and then switching on that value seems expensive to me particularly when you still have to check the string for equality after the match. nested switches based on common prefixes seems like a more performant design. Given the expense of creating the actual tree of JSON objects it seems likely this switch doesn't really show up in the final performance. |
@thomaswue it is reproducible on the current master of the benchmark repo with the latest release of GraalVMs. Just need to clone the repo and run the following commands for different JVMs:
Below are results of running of those commands, which show that there are lot of differences in number of instructions, branches, loads/stores and cache misses: GraalVM CE 19.0.0
GraalVM EE 19.0.0
JDK 1.8.0
|
@thomaswue thanks for your patience and support! With this change in the code generator I have managed to recover loosing of CPU cycles partially: up to ~40% with CE and up to ~10% with EE. The main idea of the change is to improve code layout by reducing number of long jumps. Please see current results: GraalVM CE 19.0.0
GraalVM EE 19.0.0
JDK 1.8.0
On other benchmarks this change gives more than 2.5x speed up. Please, see results for |
Also, I found an inspiring post about code layout optimizations: Does GraalVM use some of these tricks? |
Another limits of Intel and AMD CPUs: Are they took in account for GraalVM? |
We do perform code layout optimizations within one method. Specifically, we use block frequency information to lay out the blocks and also align block starts. We don't do interprocedural optimizations for that. We don't do lower level optimizations for individual instruction scheduling. This is an interesting block post. Thanks for sharing. |
@thomaswue Please see how efficiently this small function was compiled by Rustc! |
@plokhotnyuk these above reports and graphs are generated some hand-made programs? Do you have them open in your git repo or elsewhere? |
@neomatrix369 Hi, Mani! Flame graphs were generated by sbt-jmh using some command like this that use directories of cloned repos for async-profiler and FlameGraph. Benchmark results where plotted by JMH Visualizer from |
@thomaswue Thank you for your patience and support! I'm closing this issue because jsoniter-scala, GraalVM and some other libraries that used in the test are evolved a lot since reporting date. I have picked historical graphs for the test (which wasn't changed) and run on the same environment. The difference in throughput with OpenJDK 8 reduced from ~70% to ~20% and absolute values increased too. Some follow up issues that can improve the parsing in this test are here: 2018-05-09
2019-10-02
|
OK, thank you! |
I have a benchmark which parses JSON data from UTF-8 bytes to a structure of nested classes with 20-40 fields.
Code for that is generated by Scala macros and compiled by scalac to long methods with big lookup switches (one for each field) like here: https://gist.github.com/plokhotnyuk/1e2ee19d3cc80c3644bc9e453c8aae77
With GraalVM CE/EE it shows ~2x slowdown comparing to running on Oracle JDK 8.
Steps to reproduce:
sbt clean 'benchmark/jmh:run -jvm /usr/lib/jvm/java-8-oracle/bin/java .*TwitterAPIBenchmark.readJsoniterScala.*'
sbt clean 'benchmark/jmh:run -jvm /usr/lib/jvm/graalvm-ee-1.0.0-rc1/bin/java .*TwitterAPIBenchmark.readJsoniterScala.*'
I have manadged to run it with Async profiler through sbt-jmh extras integration. Please see flame graph reports, for both JVM which was committed in the following directory: https://github.com/plokhotnyuk/jsoniter-scala/tree/graalvm-slowdown-on-big-method-2/profile-async
The text was updated successfully, but these errors were encountered: