Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful shutdown does not wait for resumed requests #5105

Closed
thomasdraebing opened this issue Jul 31, 2020 · 23 comments
Closed

Graceful shutdown does not wait for resumed requests #5105

thomasdraebing opened this issue Jul 31, 2020 · 23 comments
Assignees

Comments

@thomasdraebing
Copy link
Contributor

thomasdraebing commented Jul 31, 2020

Jetty version
9.4.30.v20200611

Java version

> java -version
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b04)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b04, mixed mode)

OS type/version
OSX 10.15.6

Description
When a resumed request is running and the server is shut down gracefully, the StatisticsHandler does not consider this request, when deciding, whether to let the server shut down. The reason seems to be that the first time StatisticsHandler.handle() is called the _requestStats._current.value is incremented and afterwards decremented, when the request is then resumed, _requestStats._current.value is not incremented again and thus it is not considered in the shutdown, which since PR #2721 only checks for current requests. Reverting the change fixed the issue for me.

For context, the issue occured for me running Gerrit (2.16 or later) and trying to shut down gracefully during a clone request (handled by jgit on server side).

Just reverting the PR mentioned above would again cause issues with graceful shutdowns in other cases. What would be a good way to solve this? Should the StatisticsHandler only let the server shut down if there are no current requests AND no current dispatches? Should a resumed request increment the current request stat, while intermediately decrementing the total request stat? Or would adding a different stat for current resumed request be the way to go?

Thanks and best regards,
Thomas

@lucamilanesio
Copy link

@sbordet could you provide some guidance to @thomasdraebing to how resolve this problem? This issue is breaking the rolling restart experience because of the breakages of incoming Git/HTTP requests. Thanks for your feedback.

@joakime
Copy link
Contributor

joakime commented Jul 31, 2020

Wonder if a new GracefulShutdownHandler is in order.

One that simply uses the information from HttpChannel.Listener to know if the request is complete or not.
For purposes of Graceful Shutdown, we don't care about the nuances of what the request is doing (request / dispatched / async / wait / etc), we only care if they have been completed.

@thomasdraebing
Copy link
Contributor Author

A GracefulShutdownHandler sounds like a good idea to me. I also agree that a graceful shutdown should wait requests in all states to complete.

@janbartel
Copy link
Contributor

@sbordet is the fix for this to take into account the _asyncDispatches, as well as the requestStats?

@thomasdraebing
Copy link
Contributor Author

I tried to implement @joakime 's suggestion in #5156. So far, it does most of what I would expect, but it does not work in one case:

If a request is being served, when the shutdown is triggered and then after the trigger of the shutdown is cancelled by the client, the server continues to wait for the request to finish until the timeout. Do you have any idea, how I could detect whether the request is cancelled by the client?

Comments and feedback on the draft are very welcome, especially since I am a total newbie to jetty.

@gregw
Copy link
Contributor

gregw commented Aug 17, 2020

I don't quiet understand the OPs analysis. When an async request is dispatched "resumed" the request stats are not incremented. However, when that async request completes, the request count is checked and if zero then the shutdown will complete.

There can never be a current dispatch unless there is a current request, so checking "no current requests AND no current dispatches" is equivalent to "no current requests"... unless we have a bug somewhere.

I'm against a separate GracefulShutdownHandler as keep global counts is expensive in terms of contended memory/locks etc. It would be bad to do that twice if you wish to have stats and graceful shutdown.

We need to know more why this fails/succeeds for the OP before we can say more

@gregw
Copy link
Contributor

gregw commented Aug 17, 2020

@lachlan-roberts as you've recently worked on graceful shutdown, can you tune into this issue and help out (or take over from?) @sbordet

@davido
Copy link

davido commented Aug 17, 2020

@thomasdraebing

Do we have a reproducer? Ideally, a unit test that would fail on gerrit@HEAD?

@gregw
Copy link
Contributor

gregw commented Aug 17, 2020

@thomasdraebing with regards to the issue of detecting client closes, this is not really possible with HTTP/1.

When a request is suspended and not doing any IO activity, there is no way to know if the connection has been closed. In order to detect a closer, the container would have to try to read or write something. If it tried to read something then it would detect closures... but it might also read an actual character... which would have to be buffered... then read again... and suddenly we have a potential infinite buffer and a DOS attack vector. So we can't read from the connection whilst the request is suspended. The container also cannot write to the connection.... but the application could, regularly sending either a 102 processing response (if the client can handle it) or some white space (if the content type allows). Not elegant!

To solve this problem at the container level, use HTTP/2 which will give you an AsyncListener.onError callback if the connection closes whilst a request is suspended.

@gregw
Copy link
Contributor

gregw commented Aug 17, 2020

So if the problem is that your graceful shutdown is being held up by such closed connections, then perhaps we do need to provide a "mostly graceful" option that shutsdown the container once there are no dispatched requests. This would have to be optional behaviour, and we'd need the same behaviour in both the Stats handler and any Gracefulshutdown handler.

Perhaps your PR could be modified to move that optional logic into a GracefuleShutdownHandler and then the stats handler could extend that class so we get the same behaviour.

My criteria are:

  • we need same behaviour in both handlers (if we are to have another handler)
  • we cannot have two counts if we are both graceful and stats
  • minimise code duplication (perhaps with extension), but let's not get too complex just to avoid a little duplication.
  • if we have two handlers, we need to make them mutually exclusive in any enabling modules.

@gregw gregw linked a pull request Aug 17, 2020 that will close this issue
@thomasdraebing
Copy link
Contributor Author

@davido No we unfortunately don't have a test for Gerrit testing this. I did the testing manually, but I could push a change adding a test that reproduces the issue. I would have done this anyway, when we would have updated jetty or applied a different fix.

For now, let me describe how I am able to reproduce the issue in the context of Gerrit:

  • In Gerrit configure httpd.gracefulStopTimeout to a value larger than 0 (e.g. 1m; This will be used as parameter for Server.setStopTimeout())
  • Start Gerrit and clone some project
  • While the clone is ongoing shut down Gerrit

The expected behaviour would be that Gerrit waits for the clone request to finish or the timeout to be reached before stopping Jetty, but actually the request is directly cancelled.

A git bisect showed that this issue started, when Gerrit updated Jetty from 9.3.18.v20170406 to 9.4.12.v20180830.

Basically what happened with the request was, it reached the StatisticsHandler and triggered the increase of the _requestStats.current.valuein line 150. This stat is then quickly decremented again in line 193, while the request is still running. The request is then handled again and since it does not have the initial-state anymore only _dispatchedStats.increment(); is called. The handler then waits for the request to finish. Thus, the current request stat is 0 and the current dispatched stat is 1, when the shutdown is triggered. So the StatisticsHandler thinks, it is ok to shut down immediately.

Note ,that this issue only happens for git requests, but not for REST API requests. Thus, it could also be an issue with how Jgit, the Git implementation Gerrit uses, handles HTTP requests, which isn't compatible anymore after #2721.

@gregw
Copy link
Contributor

gregw commented Aug 17, 2020

@thomasdraebing if it is true that "the current request stat is 0 and the current dispatched stat is 1" then that is definitely a jetty bug. @lachlan-roberts can you check the code see if you can find a way for this to happen? I can't see it, other then perhaps in transitory states.

@thomasdraebing
Copy link
Contributor Author

@gregw Thanks for all your comments and suggestions! I agree that the code duplication should be reduced, e.g. by having the StatisticsHandler extending the GracefulShutdownHandler. I also agree, that both handlers should still provide a method to gracefully shutdown, if only to provide backward compatibility.

If we however, have both handlers using the same condition to decide, when the shutdown succeeded, I don't see a lot of benefit of having two distinct handlers, except for having a little less data collected, if using the GracefulShutdownHandler. Not sure, whether this would be worth it. If adding an option to the StatisticsHandler to also check current dispatched requests would be a solution to the issue, I think I would prefer to just add this option to the StatisticsHandler and to not introduce the GracefulShutdownHandler at all.

@thomasdraebing
Copy link
Contributor Author

However, if you are right and this is a bug in Jetty somewhere else, a new handler or a change in the Statisticshandler might not be required. Let's wait for what @lachlan-roberts finds out.

@gregw
Copy link
Contributor

gregw commented Aug 17, 2020

@lachlan-roberts @sbordet et al,
I read the code as follows:

The only way I see to get "the current request stat is 0 and the current dispatched stat is 1" is for the async request to complete while it is dispatched? I don't think that can happen.

@davido
Copy link

davido commented Aug 17, 2020

Basically what happened with the request was, it reached the StatisticsHandler and triggered the increase of the _requestStats.current.valuein line 150. This stat is then quickly decremented again in line 193, while the request is still running.

Can you past here to stack traces, one that is trigerring line 150 and another that is triggering line 193 code paths?

@thomasdraebing
Copy link
Contributor Author

I think the issue is, that in the requests I observed isSuspended returned false during the initial handle. Thus, _requestsStats got decremented and the listener was not added to the state. In my PR, I also had to extend the condition to also ask isDispatched to make it work for my use case.

@thomasdraebing
Copy link
Contributor Author

thomasdraebing commented Aug 17, 2020

@davido (line numbers differ a bit. The ones above are from master. In Gerrit, we use this version of the StatisticsHandler.
during GET /<project.git/info/refs?service=git-upload-pack

Thread [HTTP-88] (Suspended (breakpoint at line 159 in StatisticsHandler))	
	StatisticsHandler.handle(String, Request, HttpServletRequest, HttpServletResponse) line: 159	
	Server(HandlerWrapper).handle(String, Request, HttpServletRequest, HttpServletResponse) line: 127	
	Server.handle(HttpChannel) line: 501	
	HttpChannelOverHttp(HttpChannel).lambda$handle$1() line: 383	
	293788116.dispatch() line: not available	
	HttpChannelOverHttp(HttpChannel).dispatch(DispatcherType, HttpChannel$Dispatchable) line: 556	
	HttpChannelOverHttp(HttpChannel).handle() line: 375	
	HttpConnection.onFillable() line: 272	
	AbstractConnection$ReadCallback.succeeded() line: 311	
	AbstractEndPoint$1(FillInterest).fillable() line: 103	
	ChannelEndPoint$1.run() line: 104	
	QueuedThreadPool.runJob(Runnable) line: 806	
	QueuedThreadPool$Runner.run() line: 938	
	Thread.run() line: 748	

and

Thread [HTTP-88] (Suspended (breakpoint at line 202 in StatisticsHandler))	
	StatisticsHandler.handle(String, Request, HttpServletRequest, HttpServletResponse) line: 202	
	Server(HandlerWrapper).handle(String, Request, HttpServletRequest, HttpServletResponse) line: 127	
	Server.handle(HttpChannel) line: 501	
	HttpChannelOverHttp(HttpChannel).lambda$handle$1() line: 383	
	293788116.dispatch() line: not available	
	HttpChannelOverHttp(HttpChannel).dispatch(DispatcherType, HttpChannel$Dispatchable) line: 556	
	HttpChannelOverHttp(HttpChannel).handle() line: 375	
	HttpConnection.onFillable() line: 272	
	AbstractConnection$ReadCallback.succeeded() line: 311	
	AbstractEndPoint$1(FillInterest).fillable() line: 103	
	ChannelEndPoint$1.run() line: 104	
	QueuedThreadPool.runJob(Runnable) line: 806	
	QueuedThreadPool$Runner.run() line: 938	
	Thread.run() line: 748	

during POST /<project>.git/git-upload-pack

Thread [HTTP-88] (Suspended (breakpoint at line 159 in StatisticsHandler))	
	StatisticsHandler.handle(String, Request, HttpServletRequest, HttpServletResponse) line: 159	
	Server(HandlerWrapper).handle(String, Request, HttpServletRequest, HttpServletResponse) line: 127	
	Server.handle(HttpChannel) line: 501	
	HttpChannelOverHttp(HttpChannel).lambda$handle$1() line: 383	
	293788116.dispatch() line: not available	
	HttpChannelOverHttp(HttpChannel).dispatch(DispatcherType, HttpChannel$Dispatchable) line: 556	
	HttpChannelOverHttp(HttpChannel).handle() line: 375	
	HttpConnection.onFillable() line: 272	
	AbstractConnection$ReadCallback.succeeded() line: 311	
	AbstractEndPoint$1(FillInterest).fillable() line: 103	
	ChannelEndPoint$1.run() line: 104	
	QueuedThreadPool.runJob(Runnable) line: 806	
	QueuedThreadPool$Runner.run() line: 938	
	Thread.run() line: 748	

and

Thread [HTTP-88] (Suspended (breakpoint at line 202 in StatisticsHandler))	
	StatisticsHandler.handle(String, Request, HttpServletRequest, HttpServletResponse) line: 202	
	Server(HandlerWrapper).handle(String, Request, HttpServletRequest, HttpServletResponse) line: 127	
	Server.handle(HttpChannel) line: 501	
	HttpChannelOverHttp(HttpChannel).lambda$handle$1() line: 383	
	293788116.dispatch() line: not available	
	HttpChannelOverHttp(HttpChannel).dispatch(DispatcherType, HttpChannel$Dispatchable) line: 556	
	HttpChannelOverHttp(HttpChannel).handle() line: 375	
	HttpConnection.onFillable() line: 272	
	AbstractConnection$ReadCallback.succeeded() line: 311	
	AbstractEndPoint$1(FillInterest).fillable() line: 103	
	ChannelEndPoint$1.run() line: 104	
	QueuedThreadPool.runJob(Runnable) line: 806	
	QueuedThreadPool$Runner.run() line: 938	
	Thread.run() line: 748		

However, this does not catch the second call of handle during POST /<project>.git/git-upload-pack, since the conditions guarding those lines are not met anymore. For GET /<project.git/info/refs?service=git-upload-pack it only reaches handle() once.

@thomasdraebing
Copy link
Contributor Author

I found a minimal fix for my issue. It appears that the async request can already start before the finally block in the StatisticsHandler is reached. In that case the StatisticsHandler does not see it as an async request and will handle it as a synchronous request and directly marks it as finished. By not only checking, if the request is suspended but also if an async request is started, this is fixed.

@lachlan-roberts
Copy link
Contributor

@thomasdraebing I agree, I was looking into this earlier and I think the state.isSuspended() check can really just be replaced with state.isAsyncStarted().

I am currently preparing a PR which will make this change and but also add an optional configuration to not wait for the async requests when doing a graceful shutdown, but instead shutdown when there are no actively dispatched requests.

lachlan-roberts added a commit that referenced this issue Aug 19, 2020
If the request is async dispatched, the check state.isSuspended() is not
correct to determine if the request was async or not. The check
state.isAsyncStarted() should be used instead.

Signed-off-by: Lachlan Roberts <[email protected]>
lachlan-roberts added a commit that referenced this issue Aug 19, 2020
…quests in StatisticsHandler

Signed-off-by: Lachlan Roberts <[email protected]>
lachlan-roberts added a commit that referenced this issue Aug 19, 2020
If the request is async dispatched, the check state.isSuspended() is not
correct to determine if the request was async or not. The check
state.isAsyncStarted() should be used instead.

Signed-off-by: Lachlan Roberts <[email protected]>
lachlan-roberts added a commit that referenced this issue Aug 19, 2020
…quests in StatisticsHandler

Signed-off-by: Lachlan Roberts <[email protected]>
lachlan-roberts added a commit that referenced this issue Aug 19, 2020
StatisticsHandler no longer gives 503 responses after shutdown.

Signed-off-by: Lachlan Roberts <[email protected]>
@lachlan-roberts
Copy link
Contributor

This has been fixed by PR #5175 and will be available in a 9.4.32 release.

@thomasdraebing
Copy link
Contributor Author

Thanks for the discussion and the quick fix!

@lucamilanesio
Copy link

That's great news, thanks everyone for helping fixing this issue.

lucamilanesio pushed a commit to GerritCodeReview/gerrit that referenced this issue Oct 7, 2020
This release fixes the bug
"Graceful shutdown does not wait for resumed requests"
jetty/jetty.project#5105

Full release notes:
https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.32.v20200930

Change-Id: Ie3d28263c6e803ced5b8ec059a8b7e9cca5d3b32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants