[event-hubs] Load balancers can end up in an eternal fight for partitions #7017

richardpark-msft · 2020-01-17T18:26:03Z

Related to issue #6945, it's possible to end up in this situation (and possibly a simpler possible solution for that as well).

Assume two consumers, 'a' and 'b' and this partition layout (4 partitions total):

aaab

In this case to reach balance we need 'a' to go to 2 partitions, and 'b' to obtain an extra partition.

Today, the way that ownership works is that each consumer will continue to assert ownership over partitions they continue to own. This means that it's possible for this sequence of events:

'b' comes up, determines it needs to steal a partition
'a' asserts that it should keep 0,1,2. We won't claim new partitions but we keep existing partitions.
'a' does the actual claim 0,1,2 again, beating 'b' to the punch.
'b' fails now to claim '2' since the etag for '2' has changed.

The issue here is that there's nothing that will change the order that 'a' and 'b' attempt to claim partitions. So this situation could persist forever.

One possibility is to add in a random amount of random jitter so, much like HTTP calls, giving it a chance to resolve w/o having an active coordination between consumers. This allows for the possibility of knocking the consumers out of lockstep.

This could potentially help in the "new owners being shut out" case as well, by giving a chance for 'b' to get in and claim at least one partition, alerting other consumers to it's presence.

richardpark-msft · 2020-01-17T22:19:22Z

Java is already doing this (thanks to @srnagar for pointing me to it):

https://github.com/Azure/azure-sdk-for-java/blob/master/sdk/eventhubs/azure-messaging-eventhubs/src/main/java/com/azure/messaging/eventhubs/EventProcessorClient.java#L121

So we will follow suit.

chradek · 2020-07-01T17:39:04Z

This scenario should be mitigated as part of #9706

Prior to #9706, during the EventProcessor loop the following would happen:

listOwnerships called to get the current list of partition ownerships.
list of all partitionIds on the Event Hub retrieved.
The list of partitions to claim is calculated, including those already owned by the client.
The client attempts to claim the partitions.

With #9706, the EventProcessor loop does the following instead:

list of all partitionIds on the Event Hub retrieved.
listOwnerships called to get the current list of partition ownerships.
The list of partitions to claim is calculated, excluding those already owned by the client.
The client attempts to claim the new partitions first, then reclaims the partitions it already owns.

This helps in 2 ways.

We've removed an IO call between when the list of ownerships is retrieved, and when we calculate the partitions to claim. This means the load balancing strategy has the most up-to-date info on partition ownership it can, whereas previously there was more opportunity for the partition ownership data to change.

Clients also now claim new partitions before reclaiming partitions they already own. This gives higher priority to stealing a partition. Previosuly there was no preference for claiming new partitions over reclaiming existing ones, so there was opportunity for ownership to change while the client was reclaiming partitions it already owned.

chradek · 2020-07-07T17:26:32Z

Closing now that event-hubs 5.3.0-preview.1 has been released that has mitigations for this.

triage-new-issues bot added the triage label Jan 17, 2020

ramya-rao-a added Client This issue points to a problem in the data-plane of the library. Event Hubs labels Jan 18, 2020

triage-new-issues bot removed triage labels Jan 18, 2020

ramya-rao-a added this to the [2020] February milestone Jan 18, 2020

YijunXieMS mentioned this issue Jan 23, 2020

[Event Hubs] Add a random jitter to avoid load balance lockstep Azure/azure-sdk-for-python#9575

Closed

richardpark-msft assigned richardpark-msft and unassigned richardpark-msft Jan 30, 2020

richardpark-msft mentioned this issue Jan 30, 2020

[event-hubs] Introduce a random delay when starting the event processor loop #7111

Closed

chradek modified the milestones: [2020] February, [2020] March Feb 7, 2020

richardpark-msft modified the milestones: [2020] March, [2020] May Mar 30, 2020

ramya-rao-a unassigned richardpark-msft Apr 10, 2020

ramya-rao-a modified the milestones: [2020] May, Backlog Apr 10, 2020

ramya-rao-a modified the milestones: Backlog, [2020] July Jun 10, 2020

ramya-rao-a assigned chradek Jun 26, 2020

chradek mentioned this issue Jul 1, 2020

[event-hubs] Load balancing can starve out new consumers #6945

Closed

chradek closed this as completed Jul 7, 2020

github-actions bot locked and limited conversation to collaborators Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[event-hubs] Load balancers can end up in an eternal fight for partitions #7017

[event-hubs] Load balancers can end up in an eternal fight for partitions #7017

richardpark-msft commented Jan 17, 2020 •

edited

Loading

richardpark-msft commented Jan 17, 2020 •

edited

Loading

chradek commented Jul 1, 2020

chradek commented Jul 7, 2020

[event-hubs] Load balancers can end up in an eternal fight for partitions #7017

[event-hubs] Load balancers can end up in an eternal fight for partitions #7017

Comments

richardpark-msft commented Jan 17, 2020 • edited Loading

richardpark-msft commented Jan 17, 2020 • edited Loading

chradek commented Jul 1, 2020

chradek commented Jul 7, 2020

richardpark-msft commented Jan 17, 2020 •

edited

Loading

richardpark-msft commented Jan 17, 2020 •

edited

Loading