-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[event-hubs] Load balancers can end up in an eternal fight for partitions #7017
Comments
Java is already doing this (thanks to @srnagar for pointing me to it): So we will follow suit. |
This scenario should be mitigated as part of #9706 Prior to #9706, during the EventProcessor loop the following would happen:
With #9706, the EventProcessor loop does the following instead:
This helps in 2 ways. We've removed an IO call between when the list of ownerships is retrieved, and when we calculate the partitions to claim. This means the load balancing strategy has the most up-to-date info on partition ownership it can, whereas previously there was more opportunity for the partition ownership data to change. Clients also now claim new partitions before reclaiming partitions they already own. This gives higher priority to stealing a partition. Previosuly there was no preference for claiming new partitions over reclaiming existing ones, so there was opportunity for ownership to change while the client was reclaiming partitions it already owned. |
Closing now that event-hubs 5.3.0-preview.1 has been released that has mitigations for this. |
Related to issue #6945, it's possible to end up in this situation (and possibly a simpler possible solution for that as well).
Assume two consumers, 'a' and 'b' and this partition layout (4 partitions total):
aaab
In this case to reach balance we need 'a' to go to 2 partitions, and 'b' to obtain an extra partition.
Today, the way that ownership works is that each consumer will continue to assert ownership over partitions they continue to own. This means that it's possible for this sequence of events:
The issue here is that there's nothing that will change the order that 'a' and 'b' attempt to claim partitions. So this situation could persist forever.
One possibility is to add in a random amount of random jitter so, much like HTTP calls, giving it a chance to resolve w/o having an active coordination between consumers. This allows for the possibility of knocking the consumers out of lockstep.
This could potentially help in the "new owners being shut out" case as well, by giving a chance for 'b' to get in and claim at least one partition, alerting other consumers to it's presence.
The text was updated successfully, but these errors were encountered: