Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[event-hubs] Load balancers can end up in an eternal fight for partitions #7017

Closed
richardpark-msft opened this issue Jan 17, 2020 · 3 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. Event Hubs
Milestone

Comments

@richardpark-msft
Copy link
Member

richardpark-msft commented Jan 17, 2020

Related to issue #6945, it's possible to end up in this situation (and possibly a simpler possible solution for that as well).

Assume two consumers, 'a' and 'b' and this partition layout (4 partitions total):

aaab

In this case to reach balance we need 'a' to go to 2 partitions, and 'b' to obtain an extra partition.

Today, the way that ownership works is that each consumer will continue to assert ownership over partitions they continue to own. This means that it's possible for this sequence of events:

  1. 'b' comes up, determines it needs to steal a partition
  2. 'a' asserts that it should keep 0,1,2. We won't claim new partitions but we keep existing partitions.
  3. 'a' does the actual claim 0,1,2 again, beating 'b' to the punch.
  4. 'b' fails now to claim '2' since the etag for '2' has changed.

The issue here is that there's nothing that will change the order that 'a' and 'b' attempt to claim partitions. So this situation could persist forever.

One possibility is to add in a random amount of random jitter so, much like HTTP calls, giving it a chance to resolve w/o having an active coordination between consumers. This allows for the possibility of knocking the consumers out of lockstep.

This could potentially help in the "new owners being shut out" case as well, by giving a chance for 'b' to get in and claim at least one partition, alerting other consumers to it's presence.

@richardpark-msft
Copy link
Member Author

richardpark-msft commented Jan 17, 2020

@chradek
Copy link
Contributor

chradek commented Jul 1, 2020

This scenario should be mitigated as part of #9706

Prior to #9706, during the EventProcessor loop the following would happen:

  1. listOwnerships called to get the current list of partition ownerships.
  2. list of all partitionIds on the Event Hub retrieved.
  3. The list of partitions to claim is calculated, including those already owned by the client.
  4. The client attempts to claim the partitions.

With #9706, the EventProcessor loop does the following instead:

  1. list of all partitionIds on the Event Hub retrieved.
  2. listOwnerships called to get the current list of partition ownerships.
  3. The list of partitions to claim is calculated, excluding those already owned by the client.
  4. The client attempts to claim the new partitions first, then reclaims the partitions it already owns.

This helps in 2 ways.

We've removed an IO call between when the list of ownerships is retrieved, and when we calculate the partitions to claim. This means the load balancing strategy has the most up-to-date info on partition ownership it can, whereas previously there was more opportunity for the partition ownership data to change.

Clients also now claim new partitions before reclaiming partitions they already own. This gives higher priority to stealing a partition. Previosuly there was no preference for claiming new partitions over reclaiming existing ones, so there was opportunity for ownership to change while the client was reclaiming partitions it already owned.

@chradek
Copy link
Contributor

chradek commented Jul 7, 2020

Closing now that event-hubs 5.3.0-preview.1 has been released that has mitigations for this.

@chradek chradek closed this as completed Jul 7, 2020
@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. Event Hubs
Projects
None yet
3 participants