Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JVM crash with SIGSEGV #57

Open
aablsk opened this issue Feb 16, 2022 · 61 comments
Open

JVM crash with SIGSEGV #57

aablsk opened this issue Feb 16, 2022 · 61 comments

Comments

@aablsk
Copy link

aablsk commented Feb 16, 2022

Describe the bug

What: After updating to amazoncorretto:17 we've seen irregular JVM-crashes for a workload with below log. The crash usually happens within the first 5 minutes after starting the workload. Up until the crash, the workload works as expected.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000000000, pid=1, tid=14
#
# JRE version: OpenJDK Runtime Environment Corretto-17.0.2.8.1 (17.0.2+8) (build 17.0.2+8-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.2.8.1 (17.0.2+8-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
# Problematic frame:
# C  0x0000000000000000
#
# Core dump will be written. Default location: //core.1
#
# An error report file with more information is saved as:
# //hs_err_pid1.log
#
# Compiler replay data is saved as:
# //replay_pid1.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/corretto/corretto-17/issues/
#
[error occurred during error reporting (), id 0xb, SIGSEGV (0xb) at pc=0x00007fd7072cb23b]

How often: twice with a period of 7 days in between
Where: Workload runs as a ECS Fargate task
Dumps: None as the dumps were only written to ephemeral storage so far (if that worked as expected)

To Reproduce

No reliable reproduction as this happens very rarely.

Expected behavior

JVM does not crash.
When JVM crashes, it is able to report the error correctly.

Platform information

OS: Amazon Linux 2
Version: Corretto-17.0.2.8.1 (17.0.2+8) (build 17.0.2+8-LTS) (see log above)
Base-image: public.ecr.aws/amazoncorretto/amazoncorretto:17

For VM crashes, please attach the error report file. By default the file name is hs_err_pidpid.log, where pid is the process ID of the process. --> unfortunately not available currently, as this has only been written to ephemeral storage of the fargate task container.

Thank you for considering this report! If there is additional information I can provide to help with resolving this, please do not hesitate to reach out!

@aablsk aablsk changed the title JVM crash with SIGSEGV in C JVM crash with SIGSEGV Feb 16, 2022
@earthling-amzn
Copy link
Contributor

earthling-amzn commented Feb 16, 2022

This will be tough to troubleshoot without a core file or an hs_error log. Is there no way to get these artifacts from the fargate container? Perhaps using ECS Exec? could the container mount some durable storage? Perhaps using EFS with ECS?

Are you able to set the command line flags for the java process? Could you try running with -XX:+ErrorFileToStdout?

@aablsk
Copy link
Author

aablsk commented Feb 17, 2022

Thanks for the quick reply, @earthling-amzn!

I'll set up tooling to be prepared for the next crash and report back. Due to the irregularity of the crashes it might take a few days until I have more data. Thank you for your patience and understanding!

@aablsk
Copy link
Author

aablsk commented Feb 17, 2022

@earthling-amzn Good news! We've been able to observe another crash and your proposed option with -XX:+ErrorFileToStdout resulted in an error log (see below). Please note that I have removed some information and marked it with {REDACTED}.

With my limited understanding, it seems to be related to our use of Kotlin Co-Routine Flows, specifically the collect() method (at least this instance of the issue)?

Please do not hesitate to reach out, if I can support the process!

Thank you for your time and effort!

error_log_jvm_crash_corretto_17.log

@earthling-amzn
Copy link
Contributor

Thank you for sharing the crash log. To me, it looks like an issue with C2. I'm not very familiar with Kotlin Co-Routine Flows, so it would be helpful if you had a small bit of code to reproduce the crash. Do you know of any public projects that might use Flows? I could look there for benchmarks or tests to reproduce the crash.

@earthling-amzn
Copy link
Contributor

It would be helpful to have the replay log from the compiler, could you have the JVM write that file out to persistent storage with -XX:ReplayDataFile=<path>? Are you able to exercise this code outside of a container? If we gave you a fastdebug build of the JVM (i.e., one with assertions enabled), would you be able to run that in your container?

DataDog agent also does a fair amount of byte code instrumentation which could also confuse the JIT compiler. You might want to explore options there to disable instrumentation.

@aablsk
Copy link
Author

aablsk commented Feb 18, 2022

@earthling-amzn thanks again for the quick response!

Reproduction
Unfortunately we still have not found a reliable way to reproduce the issue, which makes it very hard to build a limited scope reproduction code example.
We still have not been able to reproduce the issue locally either, which might either be bad luck or some difference in environment (OSX + ARM locally vs Linux + x64 in our deployments).
As soon as we find a reliable way to reproduce I will make sure to build a minimal reproduction example and share it with you.

Public projects
I'm not aware of any projects that use something akin to our usage which consists of spring-reactor + kotlin co-routines in this case. I'll do some research on the weekend on this topic and share my findings.

Compiler replay log
We've added the requested flag and waiting for another occurrence of the issue. Will report back as soon as I have more data.

Exercise code outside of a container
Yes we're able to do this, but as mentioned before have not been able to reproduce the crash outside of a container running in Fargate.

Fastdebug build
We should be able to run in our staging environment with a fastdebug build of the JVM, if you could provide that either as a AL2 docker image, that we can build upon (preferred as it is closer to our usage) or as binaries upon which we could build our own AL2+Corretto base image.

DataDog agent
I will have a look at this, thanks for the advice!

Thanks again for your hard work and support on this issue!

@simonis
Copy link
Contributor

simonis commented Feb 18, 2022

I just want to clarify that we don't want to blame the DataDog agent for being responsible for the crash. It's just that through instrumentation the agent might create unusual byetcode patterns which the JIT compiler might be not prepared for. Excluding (or not) the DD agent as a reason for this crash might help to isolate the problem and potentially create a reproducer.

Thanks for your support,
Volker

@aablsk
Copy link
Author

aablsk commented Feb 18, 2022

Thanks for the clarification, Volker!

I'd like to ensure that I'm able to provide individual data for each change I'm making. Since the crashes are highly infrequent, it will probably take some time until I've been able to gather data on the different scenarios.

Scenario 1 (currently waiting for crash): no changes, capture compiler log
Scenario 2: exclude datadog agent
Scenario 3: include fastdebug JVM build(?)

@earthling-amzn
Copy link
Contributor

earthling-amzn commented Feb 21, 2022

Here is a link to download a fastdebug build. The link will expire in 7 days (Feb 28th, 2022). Please note that although the fastdebug build is an optimized build, it has asserts enabled so it will run somewhat slower than the release build. It's to be hoped that an assert will catch the condition leading to the crash before it crashes and then terminate the VM with a helpful message.

@aablsk
Copy link
Author

aablsk commented Feb 22, 2022

@earthling-amzn Thank you for providing the fastdebug build!
Unfortunately I get a ExpiredToken error when trying to access the link. Could you please re-generate the link?

Thanks in advance!

@earthling-amzn
Copy link
Contributor

Sorry about that. Try this one.

@earthling-amzn
Copy link
Contributor

Have you seen this crash in earlier versions of the JDK?

@aablsk
Copy link
Author

aablsk commented Feb 23, 2022

Thank you, the second link worked. I'll probably set it up tomorrow (due to meetings today) and a team mate of mine should be in touch soon.

We've only seen this issue in JDK 17. We've been recently upgrading from Corretto 11 to Corretto 17. We've only seen this happen in this specific service. Setup for services is pretty similar (Spring Boot + Kotlin + DataDog Agent on ECS).

@aablsk
Copy link
Author

aablsk commented Feb 24, 2022

Unfortunately we've not been able to capture the compiler replay log with -XX:ReplayDataFile= as the process seems to be terminated before this can happen.

We've integrated the fastdebug build in one of our environments and will report back with more information on the next occurrence of the issue.

Please note that a colleague will continue the communication with you as I will be leaving the team.
Thank you for your understanding!

@fknrio
Copy link

fknrio commented Apr 11, 2022

@earthling-amzn It's been a time, but we had to try out a few things... We excluded the Datadog agent and let it run for a while with the fastdebug build. Now we could reproduce the crash once with your provided fastdebug build.

Find the log file here (some information has been anonymized): jvm-crash-2022-04-11.log

Probably you're mainly interested in the following?

#  Internal Error (/home/jenkins/node/workspace/Corretto17/generic_linux/x64/build/Corretto17Src/installers/linux/universal/tar/corretto-build/buildRoot/src/hotspot/share/c1/c1_Instruction.cpp:848), pid=1, tid=22
#  assert(existing_value == new_state->local_at(index) || (existing_value->as_Phi() != __null && existing_value->as_Phi()->as_Phi()->block() == this)) failed: phi function required

Hope this helps! Let me know in case of further questions, as I'll take up the communication from @aablsk.

@earthling-amzn
Copy link
Contributor

That's very interesting and helps narrow the search. I don't suppose you have the compilation replay file given by -XX:ReplayDataFile=./replay_pid1.log ?

@earthling-amzn
Copy link
Contributor

This crash sure looks like: Hotspot C1 compiler crashes on Kotlin suspend fun with loop Which is patched in the 17.0.3 release. 17.0.3 is scheduled for release on April 19th, 2022.

This is all good news, but I'm a little concerned that the original crash for this issue was in C2. You might want to disable tiered compilation with -XX:-TieredCompilation. This will effectively disable the C1 compiler (where this latest crash occurred) and will have all code compiled by C2 (where the crash in the original report occurred). Maybe just disable tiered compilation where you are running the fastdebug build?

@fknrio
Copy link

fknrio commented Apr 12, 2022

Thanks for the hint. And sorry, no I don't have the replay file.

I disabled tiered compilation when running the fastdebug build and will monitor if the crash occurs again.

@fknrio
Copy link

fknrio commented Apr 29, 2022

Since running the fastdebug build with disabled tierd compilation and without Datadog agent, the crash did not occur again on our development system. The system is not under high load though.

However, with JRE 17.0.3, the JVM still crashes on production:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000000000, pid=1, tid=14
#
# JRE version: OpenJDK Runtime Environment Corretto-17.0.3.6.1 (17.0.3+6) (build 17.0.3+6-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.3.6.1 (17.0.3+6-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
# Problematic frame:
# C  0x0000000000000000
#
# Core dump will be written. Default location: //core.1
#
# If you would like to submit a bug report, please visit:
#   https://github.com/corretto/corretto-17/issues/
#
---------------  S U M M A R Y ------------
Command Line: -XX:MaxRAMPercentage=70 -XX:+ErrorFileToStdout -XX:ReplayDataFile=./replay_pid1.log -javaagent:./dd-java-agent.jar cloud.rio.marketplace.productactivation.ProductActivationApplicationKt
Host: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz, 2 cores, 1G, Amazon Linux release 2 (Karoo)
Time: Thu Apr 28 07:50:45 2022 UTC elapsed time: 79.554398 seconds (0d 0h 1m 19s)
---------------  T H R E A D  ---------------
Current thread (0x00007f79c806db50):  JavaThread "C2 CompilerThread0" daemon [_thread_in_native, id=14, stack(0x00007f799beff000,0x00007f799c000000)]
Current CompileTask:
C2:  79554 21820   !   4       kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl (410 bytes)
Stack: [0x00007f799beff000,0x00007f799c000000],  sp=0x00007f799bffba68,  free space=1010k
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000000
...

@earthling-amzn
Copy link
Contributor

Do you have the rest of that crash report? The replay file would also be very helpful to root cause the issue.

@fknrio
Copy link

fknrio commented Apr 29, 2022

Find the full crash report here.

I don't have the replay file unfortunately, because the service is running on AWS Fargate without a persistent volume.

@navyxliu
Copy link

This is the same error as your original reported. c2 fails to compile 'kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl (410 bytes)'.

With -XX:ReplayDataFile=./replay_pid1.log, it's very likely we can produce this error. is it possible that you write it somewhere with persistent storage?

@fknrio
Copy link

fknrio commented May 3, 2022

I implemented persisting the replay data file and will let you know when it is available.

@fknrio
Copy link

fknrio commented May 5, 2022

I now have a replay file at hand. Should I upload it here? Is it fine if I anonymize some information (i.e. replace the package names)? Otherwise, how can I provide you safely with this file?
Or is there anything else I need to do?

@earthling-amzn
Copy link
Contributor

You may anonymize the file and upload it here and we'll see how far we get with it. We'll also look into ways to better exchange confidential files.

@fknrio
Copy link

fknrio commented May 6, 2022

Here it is: 2022-05-05_replay_anonymized.log

I hope you can make some value out of it. Let me know if you need anything else.

@navyxliu
Copy link

hi, @fknrio
I try to reproduce your replay file. One blocker is that your compilation unit contains 2 lambda classes.

 6 16 reactor/core/publisher/Mono$$Lambda$2661+0x0000000801a70570 <init> 
 6 16 reactor/core/publisher/Flux$$Lambda$2755+0x0000000801ab0450 <init> 

Those classes are generated dynamically. I don't have the class files so we can't trigger compilation on my side.
Here is one workaround for this issue. You can pass the following option to java. 'DUMP_CLASS_FILES' is a directory and you need to create it before executing. This will force java to dump all lambda classes to 'DUMP_CLASS_FILES'.

-Djdk.internal.lambda.dumpProxyClasses=DUMP_CLASS_FILES

Can you try that? or have a simple reproducible ( either in source code or a jar file) so we can step into it?

@navyxliu
Copy link

hi, @fknrio,
It's also possible to recover the missing classes from a corefile. if it's difficult to reproduce this problem from source code, how about you share the coredump file with us?

@fknrio
Copy link

fknrio commented May 17, 2022

Hi @navyxliu, I configured the appropriate options and will share the respective files once the crash occurs the next time.

Unfortunately, I don't have a simple reproducible, because also for us it only happens in a single service (although others are built very similar).

@navyxliu
Copy link

@edeesis ,
The most distinguishing feature of alpine is musl-c. Gnu/Linux uses glibc.

I am not sure your case is same as @fknrio here. He uses containerized Amazon Linux 2. he got a crash in C2 compiler thread.

Could you upload hs_err_pid1.log or even the coredump file?

@ghost
Copy link

ghost commented May 25, 2022

We're having the same crash on Corretto-17.0.3.6.1 from the amazoncorretto:17 image(not the alpine one)

Current thread (0x00007f06380e40b0):  JavaThread "C2 CompilerThread0" daemon [_thread_in_native, id=22, stack(0x00007f061c87d000,0x00007f061c97e000)]


Current CompileTask:
C2:30514161 10686   !   4       kotlinx.coroutines.flow.AbstractFlow::collect (189 bytes)

@navyxliu
Copy link

@fknrio ,
thank you for your patience. This is a really tricky problem.
I analyzed your crash report and replay file. first of all, it looks like this time your app last longer(47m). The replay file is still broken. I file a JBS issue about it: JDK-8287046

@robert-csdisco case looks like very similiar to yours. His app crashed at AbstractFlow::collect, close but not exactly same. if you guys have a way to reproduce this problem, that would be super helpful.

I will try to build debuginfo of 17.0.3.6.1 and see if I understand RSP[0] better.

@fknrio
Copy link

fknrio commented May 27, 2022

@navyxliu Thank you for looking into it and filing an upstream issue. I have the coredump at hand, but cannot share it due to confidential information. If you require some information, just let me know (with providing details how to get them).

@navyxliu
Copy link

hi, @elizarov,

Some customers report that they observe PC become zero or near zero in C2CompilerThread. C2 has trouble to compile kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl, or "collect".

The closure of classfiles is something like this from @fknrio 's report. Have you seen this before? I wonder if you have a reproducible of this in your bug database.

-cp ./kotlin-stdlib-1.6.21.jar:kotlinx-coroutines-reactor-1.5.2.jar:kotlinx-coroutines-reactive-1.5.2.jar:kotlinx-coroutines-core-jvm-1.5.2.jar:kotlin-stdlib-jdk8-1.6.21.jar

Thank you.
--lx

@elizarov
Copy link

navyxliu I have not seen this particular one before.

@navyxliu
Copy link

navyxliu commented Jun 1, 2022

hi, @fknrio ,

I think I am stuck here. so far, all I know is that argument sub_t of SubTypeCheckNode::sub() is neither klass_ptr nor oop_ptr. it's an any_ptr. see details in here. I don't how this happened, maybe kotlinc generates different code.

I can't process your file with sensitive data. If I share the debuginfo file of JVM with you, is that okay you use gdb to load up coredump and give us stacktraces? or could you work on a coredump without sensitive data?

thanks,
--lx

@fknrio
Copy link

fknrio commented Jun 2, 2022

Hi @navyxliu, thanks for your analysis. If you guide me, I can share stacktraces. Or we do a session online together? (Which might be less back and forth).

@navyxliu
Copy link

navyxliu commented Jun 3, 2022

hi, @fknrio ,
We will post a wiki page about how to load a coredump file in gdb and resolve symbols using Corretto debuginfo.
Stay tuned.

--lx

@fknrio
Copy link

fknrio commented Jun 28, 2022

Hi @navyxliu, any update on this? We still face these crashes.

@navyxliu
Copy link

hi, @fknrio ,
I need a reproducible or at least the stacktrace of crash thread to debug this issue.
here i wrote a quick note how to parse the coredump along with the executable. Can you try that in the same docker image?

Start with gdb

What do you need in to analyze a coredump? Essentially, you need only the executable and the coredump. Corretto binaries ship with symbols, which will help you to decode stacktraces. Debuginfo files are optional. They provide DWARF information and will help you understand optimized code and frames.

Theoretically, it's possible to do coredump analysis on different platforms. That would require extra care about system libraries. At minimum, you need to prepare the symbols and debuginfo of Libc. For simplicity, we assume you are using the exact same Linux system which generated the coredump file.

To parse the coredump file, you need the exactly same Java executable, otherwise the symbols and their offsets will not match precisely, and you may not be able to correctly parse the coredump. In this example, we are using Corretto 17.0.3.6.1 for Linux x86_64. Here is the command.

gdb ./amazon-corretto-17.0.3.6.1-linux-x64/bin/java /tmp/core.107410.107411

After gdb loads it, you can dump the stacktrace of any thread. To switch to other thread, use 'thread id'. Use 'info thread' to check all threads.

[Current thread is 1 (Thread 0x7f0a83e58700 (LWP 107411))]
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f0a83297148 in __GI_abort () at abort.c:79
#2  0x00007f0a828820f5 in os::abort(bool, void*, void const*) () from /local/home/xxinliu/Devel/AnalyzingHotSpotCrashes/examples/amazon-corretto-17.0.3.6.1-linux-x64/lib/server/libjvm.so
#3  0x00007f0a82bb8c60 in VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long) () from /local/home/xxinliu/Devel/AnalyzingHotSpotCrashes/examples/amazon-corretto-17.0.3.6.1-linux-x64/lib/server/libjvm.so
#4  0x00007f0a82bb973b in VMError::report_and_die(Thread*, unsigned int, unsigned char*, void*, void*, char const*, ...) () from /local/home/xxinliu/Devel/AnalyzingHotSpotCrashes/examples/amazon-corretto-17.0.3.6.1-linux-x64/lib/server/libjvm.so
#5  0x00007f0a82bb976e in VMError::report_and_die(Thread*, unsigned int, unsigned char*, void*, void*) () from /local/home/xxinliu/Devel/AnalyzingHotSpotCrashes/examples/amazon-corretto-17.0.3.6.1-linux-x64/lib/server/libjvm.so
#6  0x00007f0a82a597ee in JVM_handle_linux_signal () from /local/home/xxinliu/Devel/AnalyzingHotSpotCrashes/examples/amazon-corretto-17.0.3.6.1-linux-x64/lib/server/libjvm.so
#7  <signal handler called>
#8  0x00007f0a82b6aceb in Unsafe_PutInt () from /local/home/xxinliu/Devel/AnalyzingHotSpotCrashes/examples/amazon-corretto-17.0.3.6.1-linux-x64/lib/server/libjvm.so
#9  0x00007f0a6549d53a in ?? ()
#10 0x0000000000000000 in ?? ()
(gdb)

In this example, we can see that Unsafe_putInt of libjvm.so triggered a segment fault. If we were lucky, we would know what was wrong. If not, we need to resort to DWARF info and inspect individual frames. That would require you to obtain the debuginfo of libjvm.so first. We are preparing debuginfo files and will start shipping them in next release. Contact me if you need the debuginfo files of current release.

@navyxliu
Copy link

if this bug is critical for you, you can workaround it using the follow command. This will disable the very compilation which triggered the crash in your application.

-XX:CompileCommand=exclude,kotlinx.coroutines.flow.AbstractFlow::collect

@fknrio
Copy link

fknrio commented Jul 7, 2022

Hi @navyxliu, thank you for the description.

I tried the flag -XX:CompileCommand=exclude,kotlinx.coroutines.flow.AbstractFlow::collect but the JVM still crashed. Shouldn't it be kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl instead?

For analysis: I ran a shell in the exact docker image that crashed, installed gdb and used the generated coredump file, and here is the stacktrace of the crashing thread:

(gdb) bt
#0  0x00007fb477b5eca0 in raise () from /lib64/libc.so.6
#1  0x00007fb477b60148 in abort () from /lib64/libc.so.6
#2  0x00007fb47714b0f5 in os::abort(bool, void*, void const*) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#3  0x00007fb477481c60 in VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#4  0x00007fb47748273b in VMError::report_and_die(Thread*, unsigned int, unsigned char*, void*, void*, char const*, ...) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#5  0x00007fb47748276e in VMError::report_and_die(Thread*, unsigned int, unsigned char*, void*, void*) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#6  0x00007fb4773227ee in JVM_handle_linux_signal () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#7  <signal handler called>
#8  0x0000000000000000 in ?? ()
#9  0x00007fb47738c5dd in SubTypeCheckNode::sub(Type const*, Type const*) const () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#10 0x00007fb476d1a0e6 in split_if(IfNode*, PhaseIterGVN*) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#11 0x00007fb476d20e2a in IfNode::Ideal(PhaseGVN*, bool) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#12 0x00007fb47718bdb9 in PhaseIterGVN::transform_old(Node*) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#13 0x00007fb477188856 in PhaseIterGVN::optimize() () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#14 0x00007fb476aded48 in Compile::Optimize() () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#15 0x00007fb476ae082d in Compile::Compile(ciEnv*, ciMethod*, int, bool, bool, bool, bool, bool, DirectiveSet*) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#16 0x00007fb476a11aea in C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#17 0x00007fb476aea6ac in CompileBroker::invoke_compiler_on_method(CompileTask*) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#18 0x00007fb476aeb398 in CompileBroker::compiler_thread_loop() () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#19 0x00007fb477402ece in JavaThread::run() () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#20 0x00007fb477405f72 in Thread::call_run() () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#21 0x00007fb47713f601 in thread_native_entry(Thread*) () from /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so
#22 0x00007fb4780e144b in start_thread () from /lib64/libpthread.so.0
#23 0x00007fb477c1840f in clone () from /lib64/libc.so.6

Does this help? Otherwise it seems I'm missing the corresponding debuginfo (at least gdb also complains about Missing separate debuginfo for /usr/lib/jvm/java-17-amazon-corretto/lib/server/libjvm.so (and more). Can you provide the debuginfo of the current release (17.0.3.6.1)?

@navyxliu
Copy link

@fknrio ,
In your case, c2 has difficulty to compile 'kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl'. you can skip compilation using -XX:CompileCommand=exclude,kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl. That's a workaround.

The stacktrace looks reasonable to me. it's very similar what we found before.

The problem happens in ideal graph, which is the intermediate representation of C2. this is up to your code and profiling info. Without a reproducible, I can't get the ideal graph and reason why it has trouble it in SubTypeCheckNode::sub. In particular, what the IfNode looks like at frame-11.

If you can confirm that kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl is the only source of this problem using 'exclude' above. Try to record the log compilation of it.

-XX:+LogCompilation -XX:LogFile=broken_compilation.log -XX:CompileCommand=log,kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl

Make sure that you only fetch broken_compilation.log after the java process has terminated. compiler logs are only serialized in termination phrase.

@fknrio
Copy link

fknrio commented Jul 13, 2022

I implemented the workaround with PublisherAsFlow::collectImpl and will check if the crash also occurs with the exclusion defined. In parallel, I added the LogCompilation options in a second instance of the service without the exclusion.

Unfortunately it is not straightforward to create a reproducible without confidential data.

Will keep you posted.

@fknrio
Copy link

fknrio commented Jul 18, 2022

@navyxliu
So far the workaround (-XX:CompileCommand=exclude,kotlinx.coroutines.reactive.PublisherAsFlow::collectImpl) works fine.

I recorded the log compilation of a crash as you suggested and attach it here in anonymized form (please download, link expires). Does this help?

@karla-barraza
Copy link

We are also seeing this issue on Linux and Windows. We will assess if we can try the workarounds mentioned above.

We’ve been able to reproduce this consistently with our production code. We have a specific integration test failing on Linux and Windows during JIT compilation (approximately 50% of the time)

OS: Linux, version 5.15.0-1022-aws and also Windows Server 2012 R2, version 6.3
Version: Corretto-17.0.4.9.1

@earthling-amzn
Copy link
Contributor

How portable is your integration test? Is it something we could run? Could you turn it into something we could run? Are you also using Kotlin?

@karla-barraza
Copy link

Hello, our test is not currently portable and we're still assessing if we are able to make it portable. We are using Kotlin.

@kzalewski11
Copy link

Hi there. Excluding the method that was triggering the crash worked for us as a workaround. We'd like to help move this investigation forward and eventually re-enable the C2 compiler for this method.

The failure does occasionally occur when attempting to compile other methods, but much less frequently. So this workaround isn't iron-clad, but it has improved the situation. The issue is fairly regular on our CI, but attempts to locally reproduce have been fruitless, and making the failing test portable doesn't appear viable.

How should we move forward trying to root-cause this?

@navyxliu
Copy link

navyxliu commented Dec 16, 2022

@kzalewski11

I understand your concern. This is the inherent hardship when it comes to the c2 compiler. Unlike C/C++, the Java JIT compiles code with profiling information, so we need both static class files and dynamic information to reproduce the exactly same compilation.

How should we move forward trying to root-cause this?

The first thing I would try is our debug builds in your environment. Fastdebug first and then slowdebug if the former can trigger the problem. Debug builds enable assertions and will capture abnormal conditions as early as possible. Sometimes they also reveal more information in the hotspot crash report.

If there's no easy way to reproduce the issue, I would pay attention to collect information when the crash happens. To be specific, 2 artifacts we need to save:

  1. replay file. A replay file records compiler directives and profiling information of a compilation unit.
  2. coredump.

Theoretically, we are able to playback the exact compilation using the replay file and your classes. If you manage to do so, it can be treated as a reproducible! Share that with us and we will diagnose the compilation error. If hotspot doesn't take a replay file for you, use the following command to force it to do so.

-XX:CompileCommand=DumpReplay,package.class::method

There are a few cases that even the replay file can't reproduce the problem. If so, we can only resort to coredump. One thing need to call out is that a coredump file is the image of your process. It may contain sensitive data residing in your memory so please refer to your security policy before sharing it with us. If you have the coredump file, you could diagnose on your side. With the coredump, we use gdb or hsdb to inspect it and extract more information. hsdb has a handy tool to wrap up relevant things and yield an executable. If you can share that with us, it's also helpful. Here is a video that shows how Volker diagnosed a crash using hsdb+core.

Last but not least, there's a tool rr. It is an event recorder. It can record everything and allow you to playback on a certain platform. You may try that if your system is applicable.

--xliu, Corretto Team

@kzalewski11
Copy link

@navyxliu
Thanks for the suggestions. I've pulled a core dump and replay file from the build and am making sure our security policy allows us to send it over. If so, how should I securely get you the files?

@amankr1279
Copy link

I am facing errors similar to this:- #57 (comment). Any suggestions please?

@benty-amzn
Copy link
Contributor

@kzalewski11 I'm looking into the best way for you to provide us the files, sorry for the slow response time.

@amankr1279 it would be best if you could open a separate issue and provide us with any error output you have, such as the crash output, hs_err.log, etc

@ghost
Copy link

ghost commented Jun 1, 2023

Is -XX:CompileCommand=exclude,kotlinx.coroutines.flow.AbstractFlow::collect still the workaround? I managed to get the same crash today on a new app. The app is a grpc service that takes a list, and calls a grpc service that takes/returns a flow, collects to a list, and returns.

# JRE version: OpenJDK Runtime Environment Corretto-17.0.7.7.1 (17.0.7+7) (build 17.0.7+7-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.7.7.1 (17.0.7+7-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
# Problematic frame:
# V  [libjvm.so+0xd4fb6c]  SubTypeCheckNode::sub(Type const*, Type const*) const+0x2cc

@simonis
Copy link
Contributor

simonis commented Jun 12, 2023

https://bugs.openjdk.org/browse/JDK-8303279 seems to track the same issue.

@simonis
Copy link
Contributor

simonis commented Jul 13, 2023

Fixed by openjdk/jdk#14600 in JDK 22.

Still needs to be downported to 17 & 21.

@benty-amzn
Copy link
Contributor

benty-amzn commented Jul 13, 2023

Backports requested

21: openjdk/jdk21u#9 Already fixed in 21 here openjdk/jdk21@f792475

17: openjdk/jdk17u-dev#1580

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests