Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Service Bus] Bug fix: batching receiver upon a disconnect #13374

Merged
11 commits merged into from
Jan 26, 2021

Conversation

HarshaNalluru
Copy link
Member

@HarshaNalluru HarshaNalluru commented Jan 25, 2021

Bug

  • receiveMessages method returned zero messages upon a connection refresh caused by a network disconnect(simulated in the test).
  • OperationTimeout error on message settlement after the disconnect

These two failures made the "disconnect" test occasionally fail for the batching receiver.

Cause

onDetached on the batchReceivers is called 300ms after the connection refresh causing the recovered receive link to be closed.

  • If the message returned from the service took close to 300ms until reached the receiver since the refresh, onDetached is called to close the link leading to the loss of messages.
  • If the 300ms had elapsed right before the settlement, we'd see the OperationTimeout error on settlement since the receive link is closed.

Investigated here #13339

Fix

  • Call onDetached for the batching receivers before calling the refresh connection
  • And retain calling onDetached for the streaming receivers after the refresh connection

Changes in the PR

@ghost ghost added the Service Bus label Jan 25, 2021
if (!state.wasConnectionCloseCalled && state.numReceivers) {
logger.verbose(
`[${connectionContext.connection.id}] connection.close() was not called from the sdk and there were ${state.numReceivers} ` +
`receivers. We should reconnect.`
);
await delay(Constants.connectionReconnectDelay);
Copy link
Member Author

@HarshaNalluru HarshaNalluru Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the delay for all kinds of receivers.

await callOnDetachedOnReceivers(
connectionContext,
connectionError || contextError,
"streaming"
Copy link
Member Author

@HarshaNalluru HarshaNalluru Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling onDetached on streaming receivers after the refreshConnection since onDetached would recover the streaming receiver and that would only be possible after the connection is refreshed.

Copy link
Member

@richardpark-msft richardpark-msft Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walking through this mentally it seems like it should be fine. I believe the sequence we'll end up with is something like this:

'connection refreshing promise started'

  1. onDetach called on all receivers
    1. batchingReceiver.onDetach
    2. batchingReceiver.closeLink
    3. batchingReceiver.openLock is now locked
  2. connection refresh initiates
  3. connection closed
  4. new connection created
  5. refresh connection promise resolved
    1. (continues) batchingReceiver.closeLink
    2. batchingReceiver.openLock is now acquired
    3. batchingReceiver.closeLink completes (against old connection, but that's fine
    4. batchingReceiver.openLink happens (against new connection, which is what you intend)

Does that match your analysis?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One correction..

The closeLink ends completely and then the refresh is initiated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing this reasoning! Any chance you can put it as a comment as well for our future selves?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more comments. :)

@@ -93,7 +93,7 @@ export class BatchingReceiver extends MessageReceiver {
);
}

await this._batchingReceiverLite.close(connectionError);
this._batchingReceiverLite.close(connectionError);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this now it seems 'close' is a bit misnamed - it's more like "cancel" or something of that nature. (ie, it doesn't change the link state at all).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not something you need to fix in this PR, just curious if you agree)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... maybe terminate? (since it refers to ending the things than canceling or closing)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logged #13390

Copy link
Member

@richardpark-msft richardpark-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!

@HarshaNalluru HarshaNalluru marked this pull request as ready for review January 25, 2021 20:54

for (const receiverName of Object.keys(connectionContext.messageReceivers)) {
const receiver = connectionContext.messageReceivers[receiverName];
if (receiver && receiver.receiverType === receiverType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this affect session receivers at all? Just want to make sure we're still calling onDetached on all receivers like we used to now that we're checking receiverType as well.

Copy link
Member Author

@HarshaNalluru HarshaNalluru Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onDetached was(and is) only called for the streaming and batching receivers.

(Session receivers can only be accessed through connectionContext.messageSessions. We never called onDetached for sessions, there is no onDetached method that is implemented for sessions.)

Previously, onDetached was called for "batching" and "streaming" receivers after the refresh connection.
Now, onDetached is called for "batching" before the refresh connection, unlike the streaming receivers.

I left a TODO to investigate further on the "sessions" part, I'll log an issue for it too.

Copy link
Member Author

@HarshaNalluru HarshaNalluru Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is this related issue #8875 which intends to deal with the work related to disconnects for sessions, that issue should take care of the remaining work for sessions, not logging a new issue.

Added a comment with questions - #8875 (comment)

@HarshaNalluru
Copy link
Member Author

/azp run js - servicebus - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ghost
Copy link

ghost commented Jan 26, 2021

Hello @HarshaNalluru!

Because this pull request has the auto-merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants