-
Notifications
You must be signed in to change notification settings - Fork 733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve jitdump functionality #9120
Comments
Adding new PR (eclipse-omr/omr#5135) to the list which will address some newline issues seen when jitdumps are generated. |
I've reopened #9227 as we'll need to avoid printing snippets after a crash since we cannot reliably print them before binary encoding. This is further explained in eclipse-omr/omr#5111 which will be addressed at some point in the future. For now, we still want to avoid recursive crashes so we get a proper jitdump out so I'll be addressing that issue in the next few days. |
Adding new issue (#9386) on a proposal to enable paranoid opt. check for jitdump recompilations. |
Adding new PR (#9387) to address inconsistency in generation of jitdump vs. javacore and other dump triggers. That is, the messages reported and how they are reported are now consistent with javacore, Snap dump, heapdump, etc. and there is no redundant prefixes in the messages. In addition we use the same function naming convention as javacore and snap dumps to remain consistent with other parts of the JVM. |
Adding new issue (#9428) to improve programmatically setting of tracing options for jitdump compiles. |
Adding new issue (#9479) to support specifying sub-options using the |
Adding new issue (#9522) to avoid compilation interruptions, such as the JVM wanting to shut down, when generating jitdumps. This is often seen in JUnit type tests where for example a crash in the JIT will happen, or an exception is thrown in a test which reaches |
Just a quick update on where things stand. I currently have several PRs up which I'm waiting to get merged before forging on. I think the most important issue to work on following this bulk of PRs getting merged is #9136. |
Another update on #9136. I've gotten to the bottom of the major issue for one of the deadlocks. Still need to investigate the other much less common, and more artificial deadlock described in the latest comment in #9136. I'd like to fix them both to close off that item which is a major milestone in this work. |
Back to trying to finish this off in the next month or so. Trying to knock off the easier items first, so I'm resuming #9428. |
Another update from me. I do still have this on my radar but have been distracted by some machine migration that must be performed by end of September. I hope to get back to working on this in the next few weeks. I will post an update once I get back to doing something meaningful in this area. |
The changes delivered here are already starting to show their benefit, for example a 0/420 defect was able to produce a useful jitdump on first failure data capture over in #10630 which will aid in debugging the assert there. |
I found one problem in a crash. The original crash is in AOT compilation, but the replay is for JIT compilation which finishes without error. |
It would be good if the trace log and jitdump tell us if it is an AOT compile. Another problem is that, if the crash is in ilgen, no trees will be printed out before replay. I guess if replay happens in the right context, that is not a problem, but it will still be good if we can print some information. |
Opened #10852 |
Getting back to this work in the last few days as I'm trying to polish this off given we are so close to completing everything. I started back looking at #9522 and that problem is mostly fixed, but during my stress testing around that area I discovered several issues which I've documented in #11765, #11770, and #11772. I have a firm understanding of the various problems now and I have solutions for each of them which I will try to deliver in the next few days. We are much closer to having robust JitDump generation. |
I've dug myself out of the hole and have emerged with a ton of goodies. I've opened up #11825 which addresses what I believe to be all issues revolving around generation of JitDumps from crashed compilations. It will also help in the case of application thread crashes as well. This is the area I am going to stress test next and ensure every JIT compiled body on the stack of an application crash gets a JitDump recompilation. This will be the final step in this saga, afterwhich I expect every single JIT defect to have a useful JitDump accompanying it. |
All the issues on compilation thread crashes have been resolved that I could find. Going to take a look at application thread crashes and see if there is anything to fix on that front. If not, I'll do another refactoring pass to clean everything up, add documentation, and proper tracing then close off the Epic. |
We are almost done here. I've opened #12203 as a final refactoring PR. Once that PR is merged my contribution to this Epic is complete. Thanks to all who followed along! |
I wonder if you can add a change to turn on |
Implemented in #12208. |
Hoping that late is still better than never - Thank you for all this work!! |
Background
The jitdump is a dump agent [1] which collects JIT trace logs which can help investigation of OpenJ9 issues. This dump agent is enabled by default for general purpose faults and aborts [2].
A jitdump can typically help under two scenarios:
For both of these scenarios we typically require a JIT trace log of the method in question for further investigation. Sometimes this is an iterative process, especially for case 2. as we may no know which area of the JIT compiler was responsible for generating the faulty logic in the JIT compiled method assembly. The iterative process may require us to learn more about the problem from every log, and suggest additional tracing options until we can pinpoint the problem.
For case 1. we often need to have additional tracing enabled of the area in the JIT that we crashed, in addition to having the JIT IL trees at hand.
Due to the dynamic nature of the JVM runtime environment, and the fact that the JIT compiler is guided by profiling information, a JIT compilation of a method in one JVM invocation may behave differently than a JIT compilation of the same method in a subsequent invocation of the JVM, even when the same environment and application is being run. This is a problem for servicing such issues if the first incident data collection did not capture enough information to be able to effectively service the issue and provide a resolution.
The typical result of the failure to obtain useful logging on first incident is that developers/service engineers must work with the stakeholder to reproduce the issue with additional tracing. This can take time and resources for both parties. A properly generated jitdump has a very high chance of reproducing the exact same compilation as the original, but with tracing enabled due to the fact that it runs in the same JVM process which produced the original faulty compilation. Therefore it is highly desirable to generate a useful jitdump on first incident to speed up the investigation effort of issues in the JIT.
[1] https://www.eclipse.org/openj9/docs/xdump/#dump-agents
[2] https://www.eclipse.org/openj9/docs/xdump/#default-dump-agents
Problems
There are several limitations when jitdump trace files are created:
Goal
The goal of this effort is to figure out a way to resolve the problems outlined in the previous section, and to always generate a useful jitdump so that developers/service engineers can make use of the trace information obtained during first incident data collection. The success metric of this effort will be quantified by the reduction in the amount of time it takes for developers/service engineers to obtain a JIT trace log which contains valuable information to make progress on fixing a defect. Another goal of this effort is to improve documentation and code quality of the jitdump process in the JIT compiler.
Issues / PRs
The text was updated successfully, but these errors were encountered: