-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JVM crash with Java 7 + Agent 1.3.0 #458
Comments
Here is a similar report: https://discuss.elastic.co/t/apm-agent-for-java-is-causing-system-crashes/165211 It also crashes in the Is it possible for you to update to the latest Java 1.7 update release? |
We have a theory what's going on. We have recently found a bug in This method is called from We hope that when merging in the fix, the error goes away. In the meantime, could you check if you are using any of the disallowed characters ( |
We don't have any Elastic APM specific code in our application package. Here's the tail of the Elastic APM agent log from the time of the crash. Notice descending timestamps. This is from the same service as hs_err_pid43918.log Crash happened at "Tue Jan 29 01:35:06 2019" according to logs.
|
I updated one host to JDK 7u211. Didn't want to update the other one yet. And also as Tero said, I don't remember us having any APM-specific code in apps. |
And crashed again with 7u211 Works almost like clockwork...
|
Hmm... There might be a second failure scenario, infinite CPU loop. Second app (with older JDK) has now hung up twice in a row. Hangs seem to happen after 50000 seconds or so uptime. Both times htop showed 200% CPU usage. There seems to be two threads stuck. jstack shows lots of BLOCKED threads and a few active (all having same stack trace)
Application queries data with LDAP so sounds normal but on both hangs these are the only running threads. |
@Aketzu I am really sorry to hear about the second scenario. For this one, a full thread dump can be useful. If you have a profiler you can use to get us visibility of where CPU is mostly spent when the app gets to this hang state- it can be super-useful. For the crash scenario, please try this snapshot and see if the crash is reproduced - elastic-apm-agent-1.3.1-SNAPSHOT.jar.zip |
Looks like the issue still happens with this snapshot: |
Bummer :( |
@TeroPihlaja really sorry to hear that! In addition, please add Lastly, if you can try running the same on Java 8, that can be very insightful. Thanks for your invaluable input! |
We are running 1.3.0 with Java 8. This issue seems to only happen on Java 7. I will try your suggestions and see what happens. |
Another idea would be to disable compilation for the |
|
OK, that was a long shot. But what about additional data in the server logs? Anything new there that can provide additional info about JNI issues? Or any gdb logs? |
Looks like those command line argument changes weren't actually in use yet. I will try again and let you know tomorrow (after it crashes again). There's no gdb installed on this system and I'm not too familiar with it. If there's something specific you'd like to check from gdb I could probably install it on the server. |
Nothing specific I know I am looking to get from gdb. I hoped this configuration could provide more info about the time of the crash. Looking forward to hear if there's any additional info in logs after the next crash. |
Have you tried out the |
Not yet. I don't want to test many things at the same time. |
Here's the logs with the I will try |
Tested with Still seems to crash: |
@TeroPihlaja thanks, that may be very valuable input. It seems that the compilation of Assumption: apm-agent-java/apm-agent-core/src/main/java/co/elastic/apm/agent/metrics/builtin/SystemMetrics.java Line 119 in d83d37b
Besides the compiled method name ( invoke_J_D ), two additional factors make me think this JVM bug was exposed by our metrics mechanism:
This assumption may be validated by setting Thanks! |
metrics_interval=1s caused crash in 1668 seconds. Now running with invoke_J_D exclusion. |
Command line is there also in log file. Basically:
And it seems to be working. 20599 stdout shows:
Feels like all LambdaForm methods crash... |
Thanks!! If you just want to test the agent without metrics, you can disable metrics collection by setting However, if you still have the patience to bring this to full resolution, a few more steps will be very helpful:
Again, many thanks! |
Attached pid20599 test.log It seems strange,
|
OK, let's see what happens when only JIT compilation of |
New crash now with
|
And I apparently typoed L_J as L_D... Anyhow now with actually L_J it gets invoke_D_D crash. |
Ohh, I am so sorry about that 😫 |
Now it has been running 14 hours without crashes (with metrics_interval=1s) so looks pretty good. stdout shows
|
Really glad to hear! If it's not a problem, please let it run another day to make sure it is resolved and then you would probably want to restore the metrics collection interval. Looking forward to hear back. |
Note to self- notify users that reported the same problem at https://discuss.elastic.co/t/apm-agent-for-java-is-causing-system-crashes/165211 once the workaround is verified |
I think we should use reflection instead of method handles if the agent is running on Java 7. |
This is a workaround for a bug in the C2 compiler in Java 7 which leads to a segfault closes elastic#458
I have created a PR which uses reflection as opposed to Thanks so much for your help! Btw: I was unable to reproduce the JVM crash on JDK 1.7.0u80 and Mac OS. |
About 18h uptime now with 1.3.1-SNAPSHOT without any compile exclusions so seems to work. 1.3.0 with exclusions still ended up crashing after 17h uptime. |
Looks good! Thanks so much for you help! |
This is a workaround for a bug in the C2 compiler in Java 7 which leads to a segfault closes #458
@Aketzu @TeroPihlaja 1.4.0 is released, feel free to switch to it at your convenience. |
Wow, this is a strange issue. Any idea what the root-cause is? We encountered a similar crash/hung JVM with a custom agent, and it was hard to diagnose (with jdk 8u25). We didn't use |
@jvimal which agent version are u using? |
The root cause is a JVM bug which gets triggered when the JIT optimizes the MethodHandles. Those are used under the hood for Lambdas. |
@lreuven - I am not using elastic agent. I wrote an agent, and encountered the exact same issue mentioned here. @felixbarny Yeah I figured lambdas were causing the issue. However, I was curious if you had more details (e.g., a link to the JVM bug for more details). The discussion on this issue thread was super helpful! This was the only detailed thread I could find -- all the other threads I checked just mentioned that upgrading the jvm should resolve such issues. |
Describe the bug
We updated apm agent to 1.3.0 yesterday and then after about 18 hours two instances got JVM crash which look pretty similar.
Both are using Java 7 (7.0_75 and 7.0_141). One is running on Glassfish 4.0 and the other Payara 4.1.1.171.1.
Debug logs
Also there is
co.elastic.apm.agent.report.serialize.DslJsonSerializer::sanitizeTagKey
in both compilation events.Full logs:
hs_err_pid122146.log
hs_err_pid43918.log
This might be related to #444
The text was updated successfully, but these errors were encountered: