-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Java] Catch exceptions from users and change close state correctly #29249
Conversation
Hello human! Please make sure you've included the Shiproom Template in a comment or (preferably) the PR description. Also, make sure this PR is not marked as a draft and is ready-to-merge. |
@@ -125,6 +125,9 @@ public void onClosing(WebSocket webSocket, int code, String reason) { | |||
stateLock.unlock(); | |||
} | |||
checkStartFailure(null); | |||
|
|||
// Send the close frame response if this was a server initiated close, otherwise noops | |||
webSocket.close(1000, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could remove this if we want to limit the change. When doing the original change I noticed that on server initiated websocket closures my test app would stay around for awhile because the websocket wasn't cleaned up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we test this somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manually, yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can I test this manually?
@@ -517,33 +517,37 @@ private void stopConnection(String errorMessage) { | |||
RuntimeException exception = null; | |||
this.state.lock(); | |||
try { | |||
ConnectionState connectionState = this.state.getConnectionStateUnsynchronized(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the connection state is null, this will throw. We have run into this, I'm wondering if we can be more defensive and allow null here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I'm trying to understand how this is happening. stopConnection is only ever called when the transport fires its closed handlers. You can see where it's wired up here:
Lines 276 to 279 in 88bf143
connectionState.transport = transport; | |
transport.setOnReceive(this.callback); | |
transport.setOnClose((message) -> stopConnection(message)); |
At the point stopConnection is wired up as the transport's close handler, the ConnectionState is already in scope. That means we can fix any issues with a null ConnectionState here by simply capturing the ConnectionState in the setOnClose lambda and passing the ConnectionState as an argument to stopConnection. That way there's absolutely no risk of referencing the wrong ConnectionState or it being null. You could then verify the passed-in ConnectionState matches what's returned from getConnectionStateUnsynchronized() before calling changeState(HubConnectionState.DISCONNECTED) or mutating anything else outside of the ConnectionState object.
That said, @BrennanConroy @aaronoe Do either of you have any ideas about how ConnectionState could be null here today? As far as I can tell, this is the only method that currently sets ConnectionState to null, and it cannot be called before it's hooked to the transport's close event which happens after the ConnectionState is set.
There are two possibilities I can think of at the moment:
- The transport triggering stopConnection multiple times for a single connection.
- The connection starting and stopping twice. If the first call to stopConnection is delayed long enough, the ConnectionState might have been overwritten by the second connection before the first call to stopConnection fires. Theoretically, you shouldn't be able to restart the connection before stopConnection puts the connection back to the DISCONNECTED state. Unfortunately, this isn't the only place that happens:
Lines 330 to 336 in 88bf143
}, error -> { | |
this.state.lock(); | |
try { | |
ConnectionState activeState = this.state.getConnectionStateUnsynchronized(true); | |
if (activeState == connectionState) { | |
this.state.changeState(HubConnectionState.CONNECTING, HubConnectionState.DISCONNECTED); | |
} |
The above is very dangerous if there's a chance of both that and stopConnection getting triggered for the same connection.
I'm thinking instead of just checking the HubConnectionState in start() we should also verify ConnectionState is null before continuing.
Lines 224 to 234 in 88bf143
public Completable start() { | |
CompletableSubject localStart = CompletableSubject.create(); | |
this.state.lock.lock(); | |
try { | |
if (this.state.getHubConnectionState() != HubConnectionState.DISCONNECTED) { | |
logger.debug("The connection is in the '{}' state. Waiting for in-progress start to complete or completing this start immediately.", this.state.getHubConnectionState()); | |
return this.state.getConnectionStateUnsynchronized(false).startTask; | |
} | |
this.state.changeState(HubConnectionState.DISCONNECTED, HubConnectionState.CONNECTING); |
Looking at all this also has me wondering what happens if transport.start() fails so the transport never fires stopConnection. Is the ConnectionState still cleared? It appears not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do either of you have any ideas about how ConnectionState could be null here today?
Yes, if there is an exception during the websockets onClosing
method it will call onFailure
. This is why I have wrapped the user onClosed
callbacks, and modified the changing state in stopConnection
so that both vectors will no longer create exceptions.
Looking at all this also has me wondering what happens if transport.start() fails so the transport never fires stopConnection. Is the ConnectionState still cleared? It appears not.
Correct, but the state will be disconnected and the next start will overwrite the ConnectionState.
The above is very dangerous if there's a chance of both that and stopConnection getting triggered for the same connection.
That code only runs if transport.start
results in an error, in which case stopConnection
should never get called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, these are the stacktraces for the two separate (but similar) issues:
java.lang.RuntimeException: The HubConnection failed to transition from the 'CONNECTED' state to the 'DISCONNECTED' state because it was actually in the 'CONNECTING' state.
at com.microsoft.signalr.HubConnection$ReconnectingConnectionState.changeState(HubConnection.java:1512)
at com.microsoft.signalr.HubConnection.stopConnection(HubConnection.java:538)
at com.microsoft.signalr.HubConnection.lambda$start$6(HubConnection.java:279)
at com.microsoft.signalr.-$$Lambda$HubConnection$plWe2n559m5117SGBt0gopoPs3g.invoke(-.java:2)
at com.microsoft.signalr.WebSocketTransport.onClose(WebSocketTransport.java:92)
at com.microsoft.signalr.WebSocketTransport.lambda$start$1(WebSocketTransport.java:55)
at com.microsoft.signalr.-$$Lambda$WebSocketTransport$vlt_DXG3u2t5OJvAYlxBdjX-Br4.invoke(-.java:2)
at com.microsoft.signalr.OkHttpWebSocketWrapper$SignalRWebSocketListener.onFailure(OkHttpWebSocketWrapper.java:124)
at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:592)
at okhttp3.internal.ws.RealWebSocket$connect$1.onResponse(RealWebSocket.java:174)
at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.java:519)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
at java.lang.Thread.run(Thread.java:919)
and
java.lang.RuntimeException: Connection is not active.
at com.microsoft.signalr.HubConnection$ReconnectingConnectionState.getConnectionStateUnsynchronized(HubConnection.java:1484)
at com.microsoft.signalr.HubConnection.stopConnection(HubConnection.java:522)
at com.microsoft.signalr.HubConnection.lambda$start$6(HubConnection.java:279)
at com.microsoft.signalr.-$$Lambda$HubConnection$plWe2n559m5117SGBt0gopoPs3g.invoke(-.java:2)
at com.microsoft.signalr.WebSocketTransport.onClose(WebSocketTransport.java:90)
at com.microsoft.signalr.WebSocketTransport.lambda$start$1(WebSocketTransport.java:55)
at com.microsoft.signalr.-$$Lambda$WebSocketTransport$vlt_DXG3u2t5OJvAYlxBdjX-Br4.invoke(-.java:2)
at com.microsoft.signalr.OkHttpWebSocketWrapper$SignalRWebSocketListener.onFailure(OkHttpWebSocketWrapper.java:148)
at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:592)
at okhttp3.internal.ws.RealWebSocket$WriterTask.runOnce(RealWebSocket.java:622)
at okhttp3.internal.concurrent.TaskRunner.runTask(TaskRunner.java:116)
at okhttp3.internal.concurrent.TaskRunner.access$getLogger$cp(TaskRunner.java:42)
at okhttp3.internal.concurrent.TaskRunner.access$runTask(TaskRunner.java:42)
at okhttp3.internal.concurrent.TaskRunner$runnable$1.run(TaskRunner.java:65)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
at java.lang.Thread.run(Thread.java:919)
I am also wondering if this is the right websocket callback being used. Looking at the docs, there is both onClosed and onClosing. onClosed specifically calls out that only one call is made to that method, but onClosing does not call that out specifically. Could you confirm that the correct callback is being used here:
Line 104 in 5bac7b0
public void onClosing(WebSocket webSocket, int code, String reason) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also wondering if this is the right websocket callback being used.
Interesting point about onClosing vs onClosed. onClosed does sound more correct. Either way, we should add a check in stopConnection that logs a critical error if it's called more than once.
Correct, but the state will be disconnected and the next start will overwrite the ConnectionState.
We should either always clear it when the connection becomes disconnected or always leave it. I think clearing it makes more sense.
That code only runs if transport.start results in an error, in which case stopConnection should never get called.
Wouldn't the handshake failing or timing out also cause that code to run? I think I remember there being an issue with stopConnection and the start error handler both being called before. Was this fixed? Or am I mistaken?
this.state.changeState(HubConnectionState.CONNECTED, HubConnectionState.DISCONNECTED); | ||
// We can be in the CONNECTING or CONNECTED state here, depending on if the handshake response was received or not. | ||
// connectionState.close() above will exit the Start call with an error if it's still running | ||
this.state.changeState(HubConnectionState.DISCONNECTED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the comment, shouldn't we be changing the state before calling connectionState.close() so that if someone has a retry loop on HubConnection.start() without a delay, it doesn't fail due still being in the CONNECTING state?
Unless the following is want prevents the retry loop from failing:
Lines 330 to 336 in 9b61039
}, error -> { | |
this.state.lock(); | |
try { | |
ConnectionState activeState = this.state.getConnectionStateUnsynchronized(true); | |
if (activeState == connectionState) { | |
this.state.changeState(HubConnectionState.CONNECTING, HubConnectionState.DISCONNECTED); | |
} |
In that case, the problem is even worse. Instead of the retry loop spinning trying to start the connection while its still in the CONNECTING state, it would start connecting again only for this line to incorrectly put the new connection in the DISCONNECTED state because of the old connection closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if someone has a retry loop on HubConnection.start() without a delay, it doesn't fail due still being in the CONNECTING state?
There is a lock, so they wont get anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it's not like a TCS where the continuation can run inline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which continuation are we talking about? They can run inline.
src/SignalR/clients/java/signalr/core/src/main/java/com/microsoft/signalr/HubConnection.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my other comments can be addressed in master. We can consider backporting later if necessary.
81fcd86
to
6e08572
Compare
Description
If an exception happens during the websockets "onClosing" method then it will call the "onFailure" method which was not expected. So some code that was meant to only run once would run twice and cause issues.
Customer Impact
Reported by internal partner.
Client can crash if they hit this scenario.
A workaround is to only use LongPolling for the client, WebSockets is the best transport so this isn't a great workaround.
Regression?
Regressed from 3.1 behavior
Risk
Mostly defensive code, removing risk of exceptions flowing through the code base.
Verification
Packaging changes reviewed?