Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deterministic server selection in balancer by adding tie-breaking logic #17764

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cmick
Copy link

@cmick cmick commented Feb 27, 2025

Description

The current connection count query balancer (when druid.broker.balancer.type=connectionCount) behavior in Apache Druid has a deterministic server selection mechanism when multiple servers have the same number of active connections. This results in uneven query distribution, particularly in the following cases:

  • A single query always targets the same node.
  • If the nodes have the same number of connections, a single Historical is always "preferred".
  • If the number of queries surpasses the number of active connections on Historicals, the Broker may end up targeting only one Historical node, leading to potential performance bottlenecks.

Likely the same problem is described in an old issue here: #3777

Proposed Fix

This PR introduces a simple tie-breaking mechanism in the balancer to prevent deterministic selection when multiple servers have the same connection count. This will improve overall load balancing by:

  • Keeping the behavior unchanged when the nodes have different number of active connections (the process with the fewest number of active connections is always picked).
  • Falling back to random query balancing when the processes have the same number of connections.
  • Ensuring that when query volume increases beyond active connections on Historicals, the Broker does not repeatedly select the same Historical.

Impact:

The fix only impacts clusters with druid.broker.balancer.type=connectionCount set (by default it is set to random):

  • Improves query distribution across servers, reducing load imbalance.
  • Prevents performance degradation due to overloading a single Historical.

Testing:

  • Added unit tests to validate improved balancing behavior.
  • Manually tested in a clustered setup to ensure even query distribution.

Related Issues:


Let me know if any additional details should be included!

Release note

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

…ases

Previously, when multiple servers had the same number of active connections, the balancer would always select the same server deterministically. This led to uneven distribution in certain cases.
This commit introduces a simple tie-breaking mechanism to ensure a more balanced distribution.
@cmick cmick force-pushed the feature/improve-connectioncount-query-balancing branch from 2ff15b7 to 0efd663 Compare February 28, 2025 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant