[AMQP] Ensure the EndpointState terminate upon ReceiverClient closure #29212

anuchandy · 2022-06-03T22:31:55Z

The ReceiveLinkProcessor relies on the current ReceiverClient's EndpointStates termination to obtain a new ReceiverClient to continue streaming events downstream or terminate the downstream. I.e., we want to ensure EndpointStates termination happens in all scenarios; else, it can cause a hanging downstream.

Unfortunately, in two scenarios, the EndpointStates termination will not happen -

If the client fails to schedule the local-close.
If there is a missing remote-close ack for a client-initiated local-close, this can occur if,
- a). the remote-close never arrives.
- b). the parent session/connection hosting link was already closed or in a faulted state.
- c). the ProtonJ event-loop thread never picked the local-close work (because it's been shutting down).

We want to slightly update the ReceiverClient's closure route to ensure the termination of EndpointStates. Ideally, after the update, the flow looks like this -

\cc @conniey @ki1729

anuchandy · 2022-06-06T15:19:28Z

Attached is a section extracted from the customer log, where the case "2.c)" happened, leading to the stoppage event-receive from 3 out of 4 partitions.

Demystifying the incident flow from logs:

Application consuming from 4 partitions (P0, P1, P2, P3).
A single connection CON0 is hosting 4 ReactorReceivers (one per partition).
Sevice sent remote-close for P2 (seems idle timeout, a 10 min quiet time between the last received event and remote-close).
The P2 ReactorReceiver reacted to the remote-close and marked itself as disposed.
The P2 AmqpReceiveLinkProcessor reacted to this disposal by requesting a new ReactorReceiver for P2.
The application used "azure-messaging-eventhubs:5.11.1", which has the code path prone to the bug_27716.
The P2 ReactorReceiver hit this code path.
As explained in bug_27716, the "no current delivery" exception from P2 ReactorReceiver bypassed and faulted the shared parent connection CON0, leading to the shutdown preparation of CON0.
The CON0 notified shutdown to P0, P3, and P4. The P2 was already disposed (step4); hence, it won't be notified about the shutdown.
In response to the shutdown signal, P0, P3, and P4 ReactorReceivers enqueued (scheduled) local-close work to ProtonJ EventLoop.
The CON0 continued the shutdown process and requested ProtonJ EventLoop to run one last iteration.
The ProtonJ EventLoop ran, but it didn't pick the local-close(s) scheduled in step 10; it indicated that it is shutting down though there are some tasks to process.
Not much to do here; we can't wait forever for the shutdown step complete (the library wants to finish CON0 and recover).
The CON0 completed the shutdown process.
Meantime, the request in step5 was honored, resulting in a new ReactorReceiver for P2 on a new connection CON1.
Since the local-close work from ReactorReceiver P0, P1, and P3 was never executed, the corresponding AmqpReceiveLinkProcessors never detected the termination of those 3 ReactorReceivers.
The missing signal resulted in those AmqpReceiveLinkProcessors not requesting new ReactorReceivers, causing downstream application receivers for P0, P1, and P3 to hang.

Notes:

Had the customer used the next version, i.e., azure-messaging-eventhubs:5.11.2, the application wouldn't have hit the bug_27716 code path and wouldn't have faced this incident.

But the underlying issue of AmqpReceiveLinkProcessor missing the terminal signal was still a bug; it's that combining bug_27716 with the edge case that ProtonJ EventLoop ignoring some work while shutting down revealed this.

Additional thought was - as observed, the P2 idle timed out (step2), but other partitions were also inactive for 10 mins; it's possible that the service sent remote-close for those as well, but those closures may not bubble up as the client started the shutdown (no low-level logs to prove it though).

anuchandy · 2023-03-23T23:07:47Z

This is addressed in the pr #29201

anuchandy added Event Hubs Service Bus pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) amqp Label for tracking issues related to AMQP labels Jun 3, 2022

anuchandy added this to the [2022] July milestone Jun 3, 2022

anuchandy self-assigned this Jun 3, 2022

anuchandy mentioned this issue Jun 3, 2022

[AMQP] Ensure ReceiverEndpointState completes in case no link remote-close ack #29201

Merged

liukun-msft mentioned this issue Jun 7, 2022

Stop emitting RequestResponseChannel after ReactorConnection is disposed #27400

Merged

anuchandy mentioned this issue Jun 28, 2022

[BUG] ServiceBusReceiverClient stops consuming messages after some time though messages are present in subscription. #26465

Closed

3 tasks

anuchandy closed this as completed Mar 23, 2023

github-actions bot locked and limited conversation to collaborators Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMQP] Ensure the EndpointState terminate upon ReceiverClient closure #29212

[AMQP] Ensure the EndpointState terminate upon ReceiverClient closure #29212

anuchandy commented Jun 3, 2022 •

edited

Loading

anuchandy commented Jun 6, 2022

anuchandy commented Mar 23, 2023

[AMQP] Ensure the EndpointState terminate upon ReceiverClient closure #29212

[AMQP] Ensure the EndpointState terminate upon ReceiverClient closure #29212

Comments

anuchandy commented Jun 3, 2022 • edited Loading

anuchandy commented Jun 6, 2022

Demystifying the incident flow from logs:

Notes:

anuchandy commented Mar 23, 2023

anuchandy commented Jun 3, 2022 •

edited

Loading