-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JVM crash with SIGSEGV #57
Comments
This will be tough to troubleshoot without a core file or an hs_error log. Is there no way to get these artifacts from the fargate container? Perhaps using ECS Exec? could the container mount some durable storage? Perhaps using EFS with ECS? Are you able to set the command line flags for the java process? Could you try running with |
Thanks for the quick reply, @earthling-amzn! I'll set up tooling to be prepared for the next crash and report back. Due to the irregularity of the crashes it might take a few days until I have more data. Thank you for your patience and understanding! |
@earthling-amzn Good news! We've been able to observe another crash and your proposed option with With my limited understanding, it seems to be related to our use of Kotlin Co-Routine Flows, specifically the collect() method (at least this instance of the issue)? Please do not hesitate to reach out, if I can support the process! Thank you for your time and effort! |
Thank you for sharing the crash log. To me, it looks like an issue with C2. I'm not very familiar with Kotlin Co-Routine Flows, so it would be helpful if you had a small bit of code to reproduce the crash. Do you know of any public projects that might use Flows? I could look there for benchmarks or tests to reproduce the crash. |
It would be helpful to have the replay log from the compiler, could you have the JVM write that file out to persistent storage with DataDog agent also does a fair amount of byte code instrumentation which could also confuse the JIT compiler. You might want to explore options there to disable instrumentation. |
@earthling-amzn thanks again for the quick response! Reproduction Public projects Compiler replay log Exercise code outside of a container Fastdebug build DataDog agent Thanks again for your hard work and support on this issue! |
I just want to clarify that we don't want to blame the DataDog agent for being responsible for the crash. It's just that through instrumentation the agent might create unusual byetcode patterns which the JIT compiler might be not prepared for. Excluding (or not) the DD agent as a reason for this crash might help to isolate the problem and potentially create a reproducer. Thanks for your support, |
Thanks for the clarification, Volker! I'd like to ensure that I'm able to provide individual data for each change I'm making. Since the crashes are highly infrequent, it will probably take some time until I've been able to gather data on the different scenarios. Scenario 1 (currently waiting for crash): no changes, capture compiler log |
Here is a link to download a fastdebug build. The link will expire in 7 days (Feb 28th, 2022). Please note that although the fastdebug build is an optimized build, it has asserts enabled so it will run somewhat slower than the release build. It's to be hoped that an assert will catch the condition leading to the crash before it crashes and then terminate the VM with a helpful message. |
@earthling-amzn Thank you for providing the fastdebug build! Thanks in advance! |
Sorry about that. Try this one. |
Have you seen this crash in earlier versions of the JDK? |
Thank you, the second link worked. I'll probably set it up tomorrow (due to meetings today) and a team mate of mine should be in touch soon. We've only seen this issue in JDK 17. We've been recently upgrading from Corretto 11 to Corretto 17. We've only seen this happen in this specific service. Setup for services is pretty similar (Spring Boot + Kotlin + DataDog Agent on ECS). |
Unfortunately we've not been able to capture the compiler replay log with We've integrated the fastdebug build in one of our environments and will report back with more information on the next occurrence of the issue. Please note that a colleague will continue the communication with you as I will be leaving the team. |
@earthling-amzn It's been a time, but we had to try out a few things... We excluded the Datadog agent and let it run for a while with the fastdebug build. Now we could reproduce the crash once with your provided fastdebug build. Find the log file here (some information has been anonymized): jvm-crash-2022-04-11.log Probably you're mainly interested in the following?
Hope this helps! Let me know in case of further questions, as I'll take up the communication from @aablsk. |
That's very interesting and helps narrow the search. I don't suppose you have the compilation replay file given by |
This crash sure looks like: Hotspot C1 compiler crashes on Kotlin suspend fun with loop Which is patched in the 17.0.3 release. 17.0.3 is scheduled for release on April 19th, 2022. This is all good news, but I'm a little concerned that the original crash for this issue was in C2. You might want to disable tiered compilation with |
Thanks for the hint. And sorry, no I don't have the replay file. I disabled tiered compilation when running the |
Since running the However, with JRE 17.0.3, the JVM still crashes on production:
|
Do you have the rest of that crash report? The replay file would also be very helpful to root cause the issue. |
Find the full crash report here. I don't have the replay file unfortunately, because the service is running on AWS Fargate without a persistent volume. |
This is the same error as your original reported. c2 fails to compile 'kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl (410 bytes)'. With -XX:ReplayDataFile=./replay_pid1.log, it's very likely we can produce this error. is it possible that you write it somewhere with persistent storage? |
I implemented persisting the replay data file and will let you know when it is available. |
I now have a replay file at hand. Should I upload it here? Is it fine if I anonymize some information (i.e. replace the package names)? Otherwise, how can I provide you safely with this file? |
You may anonymize the file and upload it here and we'll see how far we get with it. We'll also look into ways to better exchange confidential files. |
Here it is: 2022-05-05_replay_anonymized.log I hope you can make some value out of it. Let me know if you need anything else. |
hi, @fknrio
Those classes are generated dynamically. I don't have the class files so we can't trigger compilation on my side.
Can you try that? or have a simple reproducible ( either in source code or a jar file) so we can step into it? |
hi, @fknrio, |
Hi @navyxliu, I configured the appropriate options and will share the respective files once the crash occurs the next time. Unfortunately, I don't have a simple reproducible, because also for us it only happens in a single service (although others are built very similar). |
We're having the same crash on Corretto-17.0.3.6.1 from the amazoncorretto:17 image(not the alpine one)
|
@fknrio , @robert-csdisco case looks like very similiar to yours. His app crashed at I will try to build debuginfo of 17.0.3.6.1 and see if I understand RSP[0] better. |
@navyxliu Thank you for looking into it and filing an upstream issue. I have the coredump at hand, but cannot share it due to confidential information. If you require some information, just let me know (with providing details how to get them). |
hi, @elizarov, Some customers report that they observe PC become zero or near zero in C2CompilerThread. C2 has trouble to compile The closure of classfiles is something like this from @fknrio 's report. Have you seen this before? I wonder if you have a reproducible of this in your bug database.
Thank you. |
navyxliu I have not seen this particular one before. |
hi, @fknrio , I think I am stuck here. so far, all I know is that argument I can't process your file with sensitive data. If I share the debuginfo file of JVM with you, is that okay you use gdb to load up coredump and give us stacktraces? or could you work on a coredump without sensitive data? thanks, |
Hi @navyxliu, thanks for your analysis. If you guide me, I can share stacktraces. Or we do a session online together? (Which might be less back and forth). |
hi, @fknrio , --lx |
Hi @navyxliu, any update on this? We still face these crashes. |
hi, @fknrio , Start with gdbWhat do you need in to analyze a coredump? Essentially, you need only the executable and the coredump. Corretto binaries ship with symbols, which will help you to decode stacktraces. Debuginfo files are optional. They provide DWARF information and will help you understand optimized code and frames. Theoretically, it's possible to do coredump analysis on different platforms. That would require extra care about system libraries. At minimum, you need to prepare the symbols and debuginfo of Libc. For simplicity, we assume you are using the exact same Linux system which generated the coredump file. To parse the coredump file, you need the exactly same Java executable, otherwise the symbols and their offsets will not match precisely, and you may not be able to correctly parse the coredump. In this example, we are using Corretto 17.0.3.6.1 for Linux x86_64. Here is the command.
After gdb loads it, you can dump the stacktrace of any thread. To switch to other thread, use 'thread id'. Use 'info thread' to check all threads.
In this example, we can see that Unsafe_putInt of libjvm.so triggered a segment fault. If we were lucky, we would know what was wrong. If not, we need to resort to DWARF info and inspect individual frames. That would require you to obtain the debuginfo of libjvm.so first. We are preparing debuginfo files and will start shipping them in next release. Contact me if you need the debuginfo files of current release. |
if this bug is critical for you, you can workaround it using the follow command. This will disable the very compilation which triggered the crash in your application. -XX:CompileCommand=exclude,kotlinx.coroutines.flow.AbstractFlow::collect |
Hi @navyxliu, thank you for the description. I tried the flag For analysis: I ran a shell in the exact docker image that crashed, installed gdb and used the generated coredump file, and here is the stacktrace of the crashing thread:
Does this help? Otherwise it seems I'm missing the corresponding debuginfo (at least gdb also complains about |
@fknrio , The stacktrace looks reasonable to me. it's very similar what we found before. The problem happens in ideal graph, which is the intermediate representation of C2. this is up to your code and profiling info. Without a reproducible, I can't get the ideal graph and reason why it has trouble it in If you can confirm that
Make sure that you only fetch broken_compilation.log after the java process has terminated. compiler logs are only serialized in termination phrase. |
I implemented the workaround with Unfortunately it is not straightforward to create a reproducible without confidential data. Will keep you posted. |
We are also seeing this issue on Linux and Windows. We will assess if we can try the workarounds mentioned above. We’ve been able to reproduce this consistently with our production code. We have a specific integration test failing on Linux and Windows during JIT compilation (approximately 50% of the time) OS: Linux, version 5.15.0-1022-aws and also Windows Server 2012 R2, version 6.3 |
How portable is your integration test? Is it something we could run? Could you turn it into something we could run? Are you also using Kotlin? |
Hello, our test is not currently portable and we're still assessing if we are able to make it portable. We are using Kotlin. |
Hi there. Excluding the method that was triggering the crash worked for us as a workaround. We'd like to help move this investigation forward and eventually re-enable the C2 compiler for this method. The failure does occasionally occur when attempting to compile other methods, but much less frequently. So this workaround isn't iron-clad, but it has improved the situation. The issue is fairly regular on our CI, but attempts to locally reproduce have been fruitless, and making the failing test portable doesn't appear viable. How should we move forward trying to root-cause this? |
I understand your concern. This is the inherent hardship when it comes to the c2 compiler. Unlike C/C++, the Java JIT compiles code with profiling information, so we need both static class files and dynamic information to reproduce the exactly same compilation.
The first thing I would try is our debug builds in your environment. Fastdebug first and then slowdebug if the former can trigger the problem. Debug builds enable assertions and will capture abnormal conditions as early as possible. Sometimes they also reveal more information in the hotspot crash report. If there's no easy way to reproduce the issue, I would pay attention to collect information when the crash happens. To be specific, 2 artifacts we need to save:
Theoretically, we are able to playback the exact compilation using the replay file and your classes. If you manage to do so, it can be treated as a reproducible! Share that with us and we will diagnose the compilation error. If hotspot doesn't take a replay file for you, use the following command to force it to do so.
There are a few cases that even the replay file can't reproduce the problem. If so, we can only resort to coredump. One thing need to call out is that a coredump file is the image of your process. It may contain sensitive data residing in your memory so please refer to your security policy before sharing it with us. If you have the coredump file, you could diagnose on your side. With the coredump, we use gdb or hsdb to inspect it and extract more information. hsdb has a handy tool to wrap up relevant things and yield an executable. If you can share that with us, it's also helpful. Here is a video that shows how Volker diagnosed a crash using hsdb+core. Last but not least, there's a tool rr. It is an event recorder. It can record everything and allow you to playback on a certain platform. You may try that if your system is applicable. --xliu, Corretto Team |
@navyxliu |
I am facing errors similar to this:- #57 (comment). Any suggestions please? |
@kzalewski11 I'm looking into the best way for you to provide us the files, sorry for the slow response time. @amankr1279 it would be best if you could open a separate issue and provide us with any error output you have, such as the crash output, hs_err.log, etc |
Is
|
https://bugs.openjdk.org/browse/JDK-8303279 seems to track the same issue. |
Fixed by openjdk/jdk#14600 in JDK 22. Still needs to be downported to 17 & 21. |
Backports requested
|
Describe the bug
What: After updating to amazoncorretto:17 we've seen irregular JVM-crashes for a workload with below log. The crash usually happens within the first 5 minutes after starting the workload. Up until the crash, the workload works as expected.
How often: twice with a period of 7 days in between
Where: Workload runs as a ECS Fargate task
Dumps: None as the dumps were only written to ephemeral storage so far (if that worked as expected)
To Reproduce
No reliable reproduction as this happens very rarely.
Expected behavior
JVM does not crash.
When JVM crashes, it is able to report the error correctly.
Platform information
For VM crashes, please attach the error report file. By default the file name is
hs_err_pidpid.log
, where pid is the process ID of the process. --> unfortunately not available currently, as this has only been written to ephemeral storage of the fargate task container.Thank you for considering this report! If there is additional information I can provide to help with resolving this, please do not hesitate to reach out!
The text was updated successfully, but these errors were encountered: