Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] Catch exceptions from users and change close state correctly #29249

Merged
merged 3 commits into from
Jan 15, 2021

Conversation

BrennanConroy
Copy link
Member

@BrennanConroy BrennanConroy commented Jan 13, 2021

Description

If an exception happens during the websockets "onClosing" method then it will call the "onFailure" method which was not expected. So some code that was meant to only run once would run twice and cause issues.

Customer Impact

Reported by internal partner.
Client can crash if they hit this scenario.
A workaround is to only use LongPolling for the client, WebSockets is the best transport so this isn't a great workaround.

Regression?

  • Yes
  • No

Regressed from 3.1 behavior

Risk

  • High
  • Medium
  • Low

Mostly defensive code, removing risk of exceptions flowing through the code base.

Verification

  • Manual (required)
  • Automated

Packaging changes reviewed?

  • Yes
  • No
  • N/A

@BrennanConroy BrennanConroy added Servicing-consider Shiproom approval is required for the issue area-signalr Includes: SignalR clients and servers labels Jan 13, 2021
@BrennanConroy BrennanConroy added this to the 5.0.x milestone Jan 13, 2021
@ghost
Copy link

ghost commented Jan 13, 2021

Hello human! Please make sure you've included the Shiproom Template in a comment or (preferably) the PR description. Also, make sure this PR is not marked as a draft and is ready-to-merge.

@@ -125,6 +125,9 @@ public void onClosing(WebSocket webSocket, int code, String reason) {
stateLock.unlock();
}
checkStartFailure(null);

// Send the close frame response if this was a server initiated close, otherwise noops
webSocket.close(1000, "");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could remove this if we want to limit the change. When doing the original change I noticed that on server initiated websocket closures my test app would stay around for awhile because the websocket wasn't cleaned up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test this somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually, yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can I test this manually?

@@ -517,33 +517,37 @@ private void stopConnection(String errorMessage) {
RuntimeException exception = null;
this.state.lock();
try {
ConnectionState connectionState = this.state.getConnectionStateUnsynchronized(false);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the connection state is null, this will throw. We have run into this, I'm wondering if we can be more defensive and allow null here?

Copy link
Member

@halter73 halter73 Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I'm trying to understand how this is happening. stopConnection is only ever called when the transport fires its closed handlers. You can see where it's wired up here:

connectionState.transport = transport;
transport.setOnReceive(this.callback);
transport.setOnClose((message) -> stopConnection(message));

At the point stopConnection is wired up as the transport's close handler, the ConnectionState is already in scope. That means we can fix any issues with a null ConnectionState here by simply capturing the ConnectionState in the setOnClose lambda and passing the ConnectionState as an argument to stopConnection. That way there's absolutely no risk of referencing the wrong ConnectionState or it being null. You could then verify the passed-in ConnectionState matches what's returned from getConnectionStateUnsynchronized() before calling changeState(HubConnectionState.DISCONNECTED) or mutating anything else outside of the ConnectionState object.

That said, @BrennanConroy @aaronoe Do either of you have any ideas about how ConnectionState could be null here today? As far as I can tell, this is the only method that currently sets ConnectionState to null, and it cannot be called before it's hooked to the transport's close event which happens after the ConnectionState is set.

There are two possibilities I can think of at the moment:

  1. The transport triggering stopConnection multiple times for a single connection.
  2. The connection starting and stopping twice. If the first call to stopConnection is delayed long enough, the ConnectionState might have been overwritten by the second connection before the first call to stopConnection fires. Theoretically, you shouldn't be able to restart the connection before stopConnection puts the connection back to the DISCONNECTED state. Unfortunately, this isn't the only place that happens:

}, error -> {
this.state.lock();
try {
ConnectionState activeState = this.state.getConnectionStateUnsynchronized(true);
if (activeState == connectionState) {
this.state.changeState(HubConnectionState.CONNECTING, HubConnectionState.DISCONNECTED);
}

The above is very dangerous if there's a chance of both that and stopConnection getting triggered for the same connection.

I'm thinking instead of just checking the HubConnectionState in start() we should also verify ConnectionState is null before continuing.

public Completable start() {
CompletableSubject localStart = CompletableSubject.create();
this.state.lock.lock();
try {
if (this.state.getHubConnectionState() != HubConnectionState.DISCONNECTED) {
logger.debug("The connection is in the '{}' state. Waiting for in-progress start to complete or completing this start immediately.", this.state.getHubConnectionState());
return this.state.getConnectionStateUnsynchronized(false).startTask;
}
this.state.changeState(HubConnectionState.DISCONNECTED, HubConnectionState.CONNECTING);

Looking at all this also has me wondering what happens if transport.start() fails so the transport never fires stopConnection. Is the ConnectionState still cleared? It appears not.

Copy link
Member Author

@BrennanConroy BrennanConroy Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do either of you have any ideas about how ConnectionState could be null here today?

Yes, if there is an exception during the websockets onClosing method it will call onFailure. This is why I have wrapped the user onClosed callbacks, and modified the changing state in stopConnection so that both vectors will no longer create exceptions.

Looking at all this also has me wondering what happens if transport.start() fails so the transport never fires stopConnection. Is the ConnectionState still cleared? It appears not.

Correct, but the state will be disconnected and the next start will overwrite the ConnectionState.

The above is very dangerous if there's a chance of both that and stopConnection getting triggered for the same connection.

That code only runs if transport.start results in an error, in which case stopConnection should never get called.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, these are the stacktraces for the two separate (but similar) issues:

java.lang.RuntimeException: The HubConnection failed to transition from the 'CONNECTED' state to the 'DISCONNECTED' state because it was actually in the 'CONNECTING' state.
  at com.microsoft.signalr.HubConnection$ReconnectingConnectionState.changeState(HubConnection.java:1512)
  at com.microsoft.signalr.HubConnection.stopConnection(HubConnection.java:538)
  at com.microsoft.signalr.HubConnection.lambda$start$6(HubConnection.java:279)
  at com.microsoft.signalr.-$$Lambda$HubConnection$plWe2n559m5117SGBt0gopoPs3g.invoke(-.java:2)
  at com.microsoft.signalr.WebSocketTransport.onClose(WebSocketTransport.java:92)
  at com.microsoft.signalr.WebSocketTransport.lambda$start$1(WebSocketTransport.java:55)
  at com.microsoft.signalr.-$$Lambda$WebSocketTransport$vlt_DXG3u2t5OJvAYlxBdjX-Br4.invoke(-.java:2)
  at com.microsoft.signalr.OkHttpWebSocketWrapper$SignalRWebSocketListener.onFailure(OkHttpWebSocketWrapper.java:124)
  at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:592)
  at okhttp3.internal.ws.RealWebSocket$connect$1.onResponse(RealWebSocket.java:174)
  at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.java:519)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
  at java.lang.Thread.run(Thread.java:919)

and

java.lang.RuntimeException: Connection is not active.
  at com.microsoft.signalr.HubConnection$ReconnectingConnectionState.getConnectionStateUnsynchronized(HubConnection.java:1484)
  at com.microsoft.signalr.HubConnection.stopConnection(HubConnection.java:522)
  at com.microsoft.signalr.HubConnection.lambda$start$6(HubConnection.java:279)
  at com.microsoft.signalr.-$$Lambda$HubConnection$plWe2n559m5117SGBt0gopoPs3g.invoke(-.java:2)
  at com.microsoft.signalr.WebSocketTransport.onClose(WebSocketTransport.java:90)
  at com.microsoft.signalr.WebSocketTransport.lambda$start$1(WebSocketTransport.java:55)
  at com.microsoft.signalr.-$$Lambda$WebSocketTransport$vlt_DXG3u2t5OJvAYlxBdjX-Br4.invoke(-.java:2)
  at com.microsoft.signalr.OkHttpWebSocketWrapper$SignalRWebSocketListener.onFailure(OkHttpWebSocketWrapper.java:148)
  at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:592)
  at okhttp3.internal.ws.RealWebSocket$WriterTask.runOnce(RealWebSocket.java:622)
  at okhttp3.internal.concurrent.TaskRunner.runTask(TaskRunner.java:116)
  at okhttp3.internal.concurrent.TaskRunner.access$getLogger$cp(TaskRunner.java:42)
  at okhttp3.internal.concurrent.TaskRunner.access$runTask(TaskRunner.java:42)
  at okhttp3.internal.concurrent.TaskRunner$runnable$1.run(TaskRunner.java:65)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
  at java.lang.Thread.run(Thread.java:919)

I am also wondering if this is the right websocket callback being used. Looking at the docs, there is both onClosed and onClosing. onClosed specifically calls out that only one call is made to that method, but onClosing does not call that out specifically. Could you confirm that the correct callback is being used here:

public void onClosing(WebSocket webSocket, int code, String reason) {

Copy link
Member

@halter73 halter73 Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also wondering if this is the right websocket callback being used.

Interesting point about onClosing vs onClosed. onClosed does sound more correct. Either way, we should add a check in stopConnection that logs a critical error if it's called more than once.

Correct, but the state will be disconnected and the next start will overwrite the ConnectionState.

We should either always clear it when the connection becomes disconnected or always leave it. I think clearing it makes more sense.

That code only runs if transport.start results in an error, in which case stopConnection should never get called.

Wouldn't the handshake failing or timing out also cause that code to run? I think I remember there being an issue with stopConnection and the start error handler both being called before. Was this fixed? Or am I mistaken?

@leecow leecow modified the milestones: 5.0.x, 5.0.3 Jan 14, 2021
@leecow leecow added Servicing-approved Shiproom has approved the issue and removed Servicing-consider Shiproom approval is required for the issue labels Jan 14, 2021
this.state.changeState(HubConnectionState.CONNECTED, HubConnectionState.DISCONNECTED);
// We can be in the CONNECTING or CONNECTED state here, depending on if the handshake response was received or not.
// connectionState.close() above will exit the Start call with an error if it's still running
this.state.changeState(HubConnectionState.DISCONNECTED);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the comment, shouldn't we be changing the state before calling connectionState.close() so that if someone has a retry loop on HubConnection.start() without a delay, it doesn't fail due still being in the CONNECTING state?

Unless the following is want prevents the retry loop from failing:

}, error -> {
this.state.lock();
try {
ConnectionState activeState = this.state.getConnectionStateUnsynchronized(true);
if (activeState == connectionState) {
this.state.changeState(HubConnectionState.CONNECTING, HubConnectionState.DISCONNECTED);
}

In that case, the problem is even worse. Instead of the retry loop spinning trying to start the connection while its still in the CONNECTING state, it would start connecting again only for this line to incorrectly put the new connection in the DISCONNECTED state because of the old connection closed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if someone has a retry loop on HubConnection.start() without a delay, it doesn't fail due still being in the CONNECTING state?

There is a lock, so they wont get anywhere.

Copy link
Member

@halter73 halter73 Jan 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it's not like a TCS where the continuation can run inline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which continuation are we talking about? They can run inline.

Copy link
Member

@halter73 halter73 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my other comments can be addressed in master. We can consider backporting later if necessary.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-signalr Includes: SignalR clients and servers Servicing-approved Shiproom has approved the issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants