Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jetty 12.0.6: NPE in error handling leading to 100% cpu usage by Jetty threads #11431

Closed
konstantin-mikheev opened this issue Feb 21, 2024 · 5 comments · Fixed by #11435
Closed
Assignees
Labels
Bug For general bugs on Jetty side

Comments

@konstantin-mikheev
Copy link

konstantin-mikheev commented Feb 21, 2024

Jetty version(s)
12.0.6, 12.0.5

Jetty Environment
ee10

Java version/vendor (use: java -version)
OpenJDK Runtime Environment Corretto-21.0.0.35.1 (build 21+35-LTS)

OS type/version
Rocky Linux release 8.8 (Green Obsidian)

Description
We are using embedded Jetty in Spring Boot 3.2.2

After migrating to Jetty 12 we got this error leading to the infinite stream of messages, full gc and node being not able to process traffic. Happened in a few DCs from one day to one week after deploy. Seems similar in appearance to #11326

Initially we got it on 12.0.5 and after reading 12.0.6 release notes forced 12.0.6 but still got the same issue.

Stack trace from 12.0.6

/status 
java.lang.NullPointerException: Cannot invoke "org.eclipse.jetty.http.HttpField.getHeader()" because "f" is null
	at org.eclipse.jetty.http.HttpFields$Mutable.remove(HttpFields.java:1418) ~[jetty-http-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.ee10.servlet.ServletContextResponse.resetContent(ServletContextResponse.java:385) ~[jetty-ee10-servlet-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.ee10.servlet.ServletChannel.handle(ServletChannel.java:509) ~[jetty-ee10-servlet-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1204) ~[jetty-server-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.run(ContextHandler.java:1197) ~[jetty-server-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.ee10.servlet.ServletChannelState.runInContext(ServletChannelState.java:1254) ~[jetty-ee10-servlet-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.ee10.servlet.ServletChannelState.asyncError(ServletChannelState.java:818) ~[jetty-ee10-servlet-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.accept(ContextHandler.java:1185) ~[jetty-server-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.server.handler.ContextRequest.lambda$addFailureListener$1(ContextRequest.java:53) ~[jetty-server-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.server.internal.HttpChannelState.lambda$onFailure$2(HttpChannelState.java:445) ~[jetty-server-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.util.thread.SerializedInvoker$Link.run(SerializedInvoker.java:191) ~[jetty-util-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.http2.server.internal.HttpStreamOverHTTP2.lambda$onFailure$2(HttpStreamOverHTTP2.java:590) ~[jetty-http2-server-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:478) ~[jetty-util-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:441) ~[jetty-util-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:293) ~[jetty-util-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:201) ~[jetty-util-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:410) ~[jetty-util-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:971) ~[jetty-util-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1201) ~[jetty-util-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1156) ~[jetty-util-12.0.6.jar:12.0.6]
	at java.base/java.lang.Thread.run(Thread.java:1583) ~[na:na]

Judging by logs it was usually happening in one thread (named like qtp1047556034-2613) and looks like some while(true) kind of log. We got a few gigabytes of those messages within couple of minutes.

What we also noticed was the appearance of additional exceptions after migration not always leading to any consequences but those precursor exceptions always has happened just before this NPE

Serialized invocation error
java.lang.IllegalStateException: s=IDLE rs=COMPLETED os=ABORTED is=IDLE awp=false se=false i=false al=2
	at org.eclipse.jetty.ee10.servlet.ServletChannelState.completed(ServletChannelState.java:1052) ~[jetty-ee10-servlet-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.util.Callback$3.failed(Callback.java:168) ~[jetty-util-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.ee10.servlet.HttpOutput.onWriteComplete(HttpOutput.java:252) ~[jetty-ee10-servlet-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.ee10.servlet.HttpOutput$WriteCompleteCB.failed(HttpOutput.java:1760) ~[jetty-ee10-servlet-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.ee10.servlet.HttpOutput$CompleteWriteCompleteCB.failed(HttpOutput.java:1777) ~[jetty-ee10-servlet-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.server.handler.ContextHandler$ScopedContext.accept(ContextHandler.java:1185) ~[jetty-server-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.server.handler.ContextResponse$1.failed(ContextResponse.java:47) ~[jetty-server-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.util.Callback$6.failed(Callback.java:277) ~[jetty-util-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.server.internal.HttpChannelState$ChannelResponse.lambda$lockedFailWrite$0(HttpChannelState.java:1111) ~[jetty-server-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.util.thread.SerializedInvoker$Link.run(SerializedInvoker.java:191) ~[jetty-util-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.util.thread.Invocable$1.run(Invocable.java:230) ~[jetty-util-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.http2.server.internal.HttpStreamOverHTTP2.lambda$onFailure$2(HttpStreamOverHTTP2.java:590) ~[jetty-http2-server-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:971) ~[jetty-util-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1201) ~[jetty-util-12.0.6.jar:12.0.6]
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1156) ~[jetty-util-12.0.6.jar:12.0.6]
	at java.base/java.lang.Thread.run(Thread.java:1583) ~[na:na]

We have not got those in Jetty 11 at all and we got much less on 12.0.6 compared to 12.0.5. On 12.0.5 it was occurring almost always just after launch but on 12.0.6 you could wait couple of hour for it to arrive.

How to reproduce?
Unfortunately we have not managed to reproduce it yet. It was not happening on test environment and on DCs with good network connectivity

Usually it coincides with packet loss towards the DC where it happens and it happens on all instances within DC simultaneously. At the time usually the only requests handled by jetty were healthchecks. Basically there are a lot (hundreds) of tiny requests disturbed by packet loss.

We needed to wait from one day to about a week after deploy for the issue to happen.

@konstantin-mikheev konstantin-mikheev added the Bug For general bugs on Jetty side label Feb 21, 2024
@gregw gregw self-assigned this Feb 21, 2024
@joakime joakime moved this to 🏗 In progress in Jetty 12.0.7 - FROZEN Feb 21, 2024
@gregw
Copy link
Contributor

gregw commented Feb 21, 2024

We believe we have a fix for #11326 that will be in the 12.0.7 release at the end of the month.

The ISE is also something that we have been addressing, but it has mostly been a harmless second call to callback.failed() after a callback has already been succeeded or failed.
The NPE is also a problem in the error handling, which also looks like something has already cleared the headers as we are trying to write out an error.

So it may be that 12.0.7 will have fixed these, but I'll have a bit more of a look at this area before the release and at the very least put in a bit better handling of errors whilst handling errors. For both of those exceptions it would be good to see the actual error that was being handled.

@joakime
Copy link
Contributor

joakime commented Feb 21, 2024

@gregw what about the NPE in org.eclipse.jetty.http.HttpField.getHeader() during a ServletContextResponse.resetContent call?

@gregw
Copy link
Contributor

gregw commented Feb 22, 2024

@joakime yeah, that NPE should not happen with normal handling of a Mutable HttpFields class. So it is plausible that this is again double handling of an error, so two threads are modifying the HttpFields at once. The null kind of suggests one thread is adding a field, whilst the other is trying to clear it.

So I do suspect the error handling that we have already cleaned up somewhat, but I can't yet say for sure this is fixed.

@konstantin-mikheev can you tell us more about the actual failure. I.e. details about what you mean with "infinite stream of messages". Plus next time it happens, can you take a full thread dump.

@gregw
Copy link
Contributor

gregw commented Feb 22, 2024

Ugh! The NPE is an exception thrown whilst handling an exception thrown whilst handling an exception! It is turtles all the way down!

@konstantin-mikheev
Copy link
Author

konstantin-mikheev commented Feb 22, 2024

"infinite stream of messages" meant we got 100k identical NPE messages in logs per minute for the next few minutes. Node had died and was restarted so no thread dump unfortunately.

During this timeframe besides those 100k per minute NPEs we got 6 IllegalStateExceptions from org.eclipse.jetty.ee10.servlet.ServletChannelState. Besides those there are only 34 other messages all of those not really interesting: couple of timeouts towards other locations and retries while trying to update some settings. Usually those logs mean packetloss.

I have looked into ploss statistics by location and found out at the last case there was no changes in ploss rate when incident started. Out of about 30 locations there was one with about 20% ploss which was already going for approximately 50 minutes. There were couple of ploss spikes to other locations within 30 minutes before NPEs happened but they were not happening at the time when NPEs started to appear.

For now I have ported project back to Jetty 11. I will probably left one instance with Jetty 12 to try to obtain thread dump but could not promise anything unfortunately.

What also could I add: I have looked into cpu usage distribution and there is an increase from 5 to 20% of the "kernel mode" cpu usage for couple of minutes right when those issues started. I will investigate more what could it be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For general bugs on Jetty side
Projects
No open projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

3 participants