Fix deterministic server selection in balancer by adding tie-breaking logic #17764

cmick · 2025-02-27T16:57:33Z

Description

The current connection count query balancer (when druid.broker.balancer.type=connectionCount) behavior in Apache Druid has a deterministic server selection mechanism when multiple servers have the same number of active connections. This results in uneven query distribution, particularly in the following cases:

A single query always targets the same node.
If the nodes have the same number of connections, a single Historical is always "preferred".
If the number of queries surpasses the number of active connections on Historicals, the Broker may end up targeting only one Historical node, leading to potential performance bottlenecks.

Likely the same problem is described in an old issue here: #3777

Proposed Fix

This PR introduces a simple tie-breaking mechanism in the balancer to prevent deterministic selection when multiple servers have the same connection count. This will improve overall load balancing by:

Keeping the behavior unchanged when the nodes have different number of active connections (the process with the fewest number of active connections is always picked).
Falling back to random query balancing when the processes have the same number of connections.
Ensuring that when query volume increases beyond active connections on Historicals, the Broker does not repeatedly select the same Historical.

Impact:

The fix only impacts clusters with druid.broker.balancer.type=connectionCount set (by default it is set to random):

Improves query distribution across servers, reducing load imbalance.
Prevents performance degradation due to overloading a single Historical.

Testing:

Added unit tests to validate improved balancing behavior.
Manually tested in a clustered setup to ensure even query distribution.

Related Issues:

Observed issues with connectionCount load balancing #3777

Let me know if any additional details should be included!

Release note

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

…ases Previously, when multiple servers had the same number of active connections, the balancer would always select the same server deterministically. This led to uneven distribution in certain cases. This commit introduces a simple tie-breaking mechanism to ensure a more balanced distribution.

cmick force-pushed the feature/improve-connectioncount-query-balancing branch from 2ff15b7 to 0efd663 Compare February 28, 2025 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deterministic server selection in balancer by adding tie-breaking logic #17764

Fix deterministic server selection in balancer by adding tie-breaking logic #17764

cmick commented Feb 27, 2025

Fix deterministic server selection in balancer by adding tie-breaking logic #17764

Are you sure you want to change the base?

Fix deterministic server selection in balancer by adding tie-breaking logic #17764

Conversation

cmick commented Feb 27, 2025

Description

Proposed Fix

Impact:

Testing:

Related Issues:

Release note