Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMQP] Ensure the EndpointState terminate upon ReceiverClient closure #29212

Closed
anuchandy opened this issue Jun 3, 2022 · 2 comments
Closed
Assignees
Labels
amqp Label for tracking issues related to AMQP Event Hubs pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) Service Bus
Milestone

Comments

@anuchandy
Copy link
Member

anuchandy commented Jun 3, 2022

The ReceiveLinkProcessor relies on the current ReceiverClient's EndpointStates termination to obtain a new ReceiverClient to continue streaming events downstream or terminate the downstream. I.e., we want to ensure EndpointStates termination happens in all scenarios; else, it can cause a hanging downstream.

Unfortunately, in two scenarios, the EndpointStates termination will not happen -

  1. If the client fails to schedule the local-close.
  2. If there is a missing remote-close ack for a client-initiated local-close, this can occur if,
    • a). the remote-close never arrives.
    • b). the parent session/connection hosting link was already closed or in a faulted state.
    • c). the ProtonJ event-loop thread never picked the local-close work (because it's been shutting down).

We want to slightly update the ReceiverClient's closure route to ensure the termination of EndpointStates. Ideally, after the update, the flow looks like this -

ReceiverClientClosureEPState

\cc @conniey @ki1729

@anuchandy anuchandy added Event Hubs Service Bus pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) amqp Label for tracking issues related to AMQP labels Jun 3, 2022
@anuchandy anuchandy added this to the [2022] July milestone Jun 3, 2022
@anuchandy anuchandy self-assigned this Jun 3, 2022
@anuchandy
Copy link
Member Author

Attached is a section extracted from the customer log, where the case "2.c)" happened, leading to the stoppage event-receive from 3 out of 4 partitions.

Demystifying the incident flow from logs:

  1. Application consuming from 4 partitions (P0, P1, P2, P3).
  2. A single connection CON0 is hosting 4 ReactorReceivers (one per partition).
  3. Sevice sent remote-close for P2 (seems idle timeout, a 10 min quiet time between the last received event and remote-close).
  4. The P2 ReactorReceiver reacted to the remote-close and marked itself as disposed.
  5. The P2 AmqpReceiveLinkProcessor reacted to this disposal by requesting a new ReactorReceiver for P2.
  6. The application used "azure-messaging-eventhubs:5.11.1", which has the code path prone to the bug_27716.
  7. The P2 ReactorReceiver hit this code path.
  8. As explained in bug_27716, the "no current delivery" exception from P2 ReactorReceiver bypassed and faulted the shared parent connection CON0, leading to the shutdown preparation of CON0.
  9. The CON0 notified shutdown to P0, P3, and P4. The P2 was already disposed (step4); hence, it won't be notified about the shutdown.
  10. In response to the shutdown signal, P0, P3, and P4 ReactorReceivers enqueued (scheduled) local-close work to ProtonJ EventLoop.
  11. The CON0 continued the shutdown process and requested ProtonJ EventLoop to run one last iteration.
  12. The ProtonJ EventLoop ran, but it didn't pick the local-close(s) scheduled in step 10; it indicated that it is shutting down though there are some tasks to process.
  13. Not much to do here; we can't wait forever for the shutdown step complete (the library wants to finish CON0 and recover).
  14. The CON0 completed the shutdown process.
  15. Meantime, the request in step5 was honored, resulting in a new ReactorReceiver for P2 on a new connection CON1.
  16. Since the local-close work from ReactorReceiver P0, P1, and P3 was never executed, the corresponding AmqpReceiveLinkProcessors never detected the termination of those 3 ReactorReceivers.
  17. The missing signal resulted in those AmqpReceiveLinkProcessors not requesting new ReactorReceivers, causing downstream application receivers for P0, P1, and P3 to hang.

Notes:

Had the customer used the next version, i.e., azure-messaging-eventhubs:5.11.2, the application wouldn't have hit the bug_27716 code path and wouldn't have faced this incident.

But the underlying issue of AmqpReceiveLinkProcessor missing the terminal signal was still a bug; it's that combining bug_27716 with the edge case that ProtonJ EventLoop ignoring some work while shutting down revealed this.

Additional thought was - as observed, the P2 idle timed out (step2), but other partitions were also inactive for 10 mins; it's possible that the service sent remote-close for those as well, but those closures may not bubble up as the client started the shutdown (no low-level logs to prove it though).

@anuchandy
Copy link
Member Author

This is addressed in the pr #29201

@github-actions github-actions bot locked and limited conversation to collaborators Jun 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
amqp Label for tracking issues related to AMQP Event Hubs pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) Service Bus
Projects
None yet
Development

No branches or pull requests

1 participant