Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load balancer health checks problematic hosts #1709

Merged
merged 15 commits into from
Aug 17, 2021

Conversation

chemicL
Copy link
Contributor

@chemicL chemicL commented Jul 29, 2021

Motivation:

The current implementation of RoundRobinLoadBalancer cycles through all addresses that the ServiceDiscoverer provides and opens connection regardless of the behavior of the individual hosts behind those addresses. No passive health checking is performed and no feedback from connection establishment is provided to the RRLB to make any smart decisions with regards to the way hosts are chosen for connections or directing requests. The purpose of RRLB is to do exactly one thing - cycle through hosts and direct traffic attempting to distribute the load fairly with regards to this assumption.
However, there are occasions when a particular ServiceDiscoverer (e.g. DNS-based) doesn't provide up-to-date health information about hosts. Meanwhile, some addresses might be not responding, but are considered active from the perspective of the discovery mechanism. Such addresses lead to unsuccessful connection establishment attempts and introduce unnecessary latency in the request path.
In this PR, a mechanism for detecting such failures is introduced. Hosts that the RRLB consecutively fails to establish connections with are taken out of the selection process until a connection is established. A background task tries, at specified intervals, to connect to the given host. Upon success, the connection can be used for routing traffic and the host comes back to the pool and takes part in the selection. The mechanism described here is a specific type of health checking and can possibly be improved in the future to be more tunable. Currently, the user controls the interval at which the health checks are performed, the consecutive failures count for a host to be considered unhealthy, and the background io.servicetalk.concurrent.api.Executor for running the checks.

Modifications:

  • Consecutive connection attempts to ACTIVE hosts are counted in the internal RRLB's Host state,
  • After a threshold is met, a background task is scheduled which will attempt a connection at a specified interval,
  • Meanwhile, the particular address is not considered for directing traffic and opening connections,
  • Whenever the background task successfully establishes a connection, that connection is used for directing requests and the host comes back to the list of eligible for selection in the request path,
  • RoundRobinConnectionFactory.Builder was enhanced to incorporate this mechanism.

Result:

Problematic hosts are not used in the requests path and are actively health checked in the background until they are reachable again. The overall latency should increase for DNS ServiceDiscoverer users which stumble upon a situation where some addresses returned from the DNS queries are unreachable.

Copy link
Member

@Scottmitch Scottmitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took a quick look and left a few initial comments.

// Next, NoAvailableHostException is thrown when the host is unhealthy,
// but we still wait until the health check is scheduled and only then stop retrying.
t instanceof DeliberateException || testExecutor.scheduledTasksPending() == 0,
// try to prevent stack overflow
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I ran this 1000 times, 1 test failed due to stack overflow caused by the nature of resubscribing in the retry operator. I believe adding a small delay will allow the concurrent executor trigger the health check before an overflow happens.

@chemicL chemicL marked this pull request as ready for review August 11, 2021 17:16
@bondolo bondolo marked this pull request as draft August 12, 2021 18:04
@bondolo bondolo marked this pull request as ready for review August 12, 2021 18:04
*/
public RoundRobinLoadBalancerFactory.Builder<ResolvedAddress, C> healthCheckInterval(Duration interval) {
this.healthCheckInterval = requireNonNull(interval);
if (interval.isNegative()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Consider checking value before assigning to internal variable. Otherwise, tricky users can try-catch it. If you change the order, requireNonNull is no longer necessary.
  2. Is zero duration allowed? Should we enforce some min duration (like 100ms) to avoid too frequent health-checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Thanks for pointing it out. Regarding requireNonNull, I'm leaving it as a wrapper on top as requireNonNull(interval).isNegative(). As the package has the ElementsAreNonnullByDefault an explicit null check causes warnings in the IDE. But when the API is called from a language (e.g. Kotlin) that potentially ignores that annotation we still need to check for null. Am I right?
  2. Zero is allowed, it uses the default. I wouldn't impose a minimum value. The API should be flexible here, it's difficult to come up with a minimum that will work in all cases. Assuming there's network traffic involved, the limiting factor will be the time it takes to repeat a connection attempt. If users want to set a value that's not Duration.ZERO, but too small for them, they should know what they're during. We provide a sensible default, anything different than that should be in users' hands.

Copy link
Member

@idelpivnitskiy idelpivnitskiy Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If the interval is null, interval.isNegative() will throw NPE, which gives the same result as requireNonNull(interval).isNegative().
  2. ok

}

public Completable reconnect() {
return defer(() -> connectionFactory.newConnection(host.address, null))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like defer is required here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of fact it is. It might feel it isn't and in most cases there won't be any issue. But the fact that the call connectionFactory.newConnection(host.address, null) can return a different Single every time it's called makes the re-subscribing mechanism prone to errors if that call is not re-exercised upon every retry. A practical example can be observed when you run the RRLB tests after removing the defer() call. But aside from the quirks of the test, I just believe it's semantically incorrect to assume the factory returns a proper re-subscribable Single that would behave the same way when a new one was obtained via calling newConnection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Reactive Streams each re-subscribe should trigger new evaluation of the Single and produce the meaningful result. The result might be different (like a connection with new id), but it's still a result that should be treated as any previous result. In this case, your test is written incorrectly, without properly following RS spec:

        Function<String, Single<TestLoadBalancedConnection>> factory =
                new Function<String, Single<TestLoadBalancedConnection>>() {

            @Override
            public Single<TestLoadBalancedConnection> apply(final String s) {
                if (s.equals(failingHost)) {
                    requests.incrementAndGet();
                    if (momentInTime.get() >= connections.size()) {
                        return properConnection;
                    }
                    return connections.get(momentInTime.get());
                }
                return properConnection;
            }
        };

Your apply method returns a Single, but it creates a state outside of the Single, which is not executed on each re-subscribe. Look at any other filter. For example, RequestTargetEncoderHttpRequesterFilter. It uses defer to recompute requestTarget on each re-subscribe. Otherwise, retry or repeat operator won't work as expected. You should do the same here and wrap the internals of your apply function with defer instead of using defer inside RRLB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great explanation, thanks for taking the time to elaborate. That makes sense, I did as you recommended.

Copy link
Member

@idelpivnitskiy idelpivnitskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comments then LGTM. Good job!

@@ -381,6 +400,177 @@ public void newConnectionIsClosedWhenSelectorRejects() throws Exception {
awaitIndefinitely(connection.onClose());
}

@Test
public void unhealthyHostTakenOutOfPoolForSelection() throws Exception {
serviceDiscoveryPublisher.onComplete();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in other tests, you complete serviceDiscoveryPublisher but later invoke onNext. It violates the spec. Interesting, that TestPublisher doesn't validate such misuse. Please, open a GH issue for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I actually copied the pattern from selectStampedeUnsaturableConnection, but I thought I'm re-creating the serviceDiscoveryPublisher in the method creating a new load balancer. But that might have gotten lost. We should improve this behaviour, you're right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the created issue! In your current tests, please adjust the behavior to be correct (i.e. don't terminate the publisher if you intend to use it later). You can skip onComplete() in your tests, it's not necessary for tests that are not focused on SD termination. In general, termination of SD events stream is not expected by RRLB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of fact this call is necessary for tests to behave correctly whenever a new instance of the load balancer is created. The reason for it is the fact that TestPublisher prevents simultaneous subscriptions by default. Calling onComplete() allows the new LB to reuse the existing serviceDiscoveryPublisher. The onComplete() method is not actually shutting down the Publisher but its' Subscriber:

 /**
     * Completes the {@link Subscriber}.
     *
     * @see Subscriber#onComplete()
     */
    public void onComplete()

I think the expected changes can be discussed in the created issue.

@chemicL chemicL changed the title Load balancer unhealthy hosts Load balancer health checks problematic hosts Aug 16, 2021
@@ -141,10 +152,13 @@ public RoundRobinLoadBalancer(final Publisher<? extends ServiceDiscovererEvent<R
* set to {@code false} should be eagerly closed. When {@code false}, the expired addresses will be used
* for sending requests, but new connections will not be requested, allowing the server to drive
* the connection closure and shifting traffic to other addresses.
* @param healthCheckConfig configuration for the health checking mechanism, which monitors hosts that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

healthCheckConfig is nullable but it is not explained what happens when it is null or why you might not provide a health check config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing, I added a note about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the result of disabling it? Will the hosts never become eligible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to point to the factory, which is a public API and contains a more thorough explanation.

Copy link
Contributor

@bondolo bondolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just javadoc clarifications

@@ -141,10 +152,13 @@ public RoundRobinLoadBalancer(final Publisher<? extends ServiceDiscovererEvent<R
* set to {@code false} should be eagerly closed. When {@code false}, the expired addresses will be used
* for sending requests, but new connections will not be requested, allowing the server to drive
* the connection closure and shifting traffic to other addresses.
* @param healthCheckConfig configuration for the health checking mechanism, which monitors hosts that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the result of disabling it? Will the hosts never become eligible?

* @see #healthCheckFailedConnectionsThreshold(int)
*/
public RoundRobinLoadBalancerFactory.Builder<ResolvedAddress, C> healthCheckInterval(Duration interval) {
if (requireNonNull(interval).isNegative()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also disallow zero and perhaps nonsensical values (<1s or >1 day)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the bounds. I'd say it's up to the user to determine what is nonsensical – in some scenarios <1s might make sense. Not sure about >1 day, but then again, is 23 hrs good?
For zero, in my initial proposal it served the purpose of restoring the default. But I'm ok with preventing against zero too.

* @return {@code this}.
* @see #healthCheckFailedConnectionsThreshold(int)
*/
public RoundRobinLoadBalancerFactory.Builder<ResolvedAddress, C> backgroundExecutor(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shared executor which is used if no executor is provided should be mentioned.

* {@link #healthCheckFailedConnectionsThreshold(int)} can be used to disable the health checking mechanism
* and always consider all hosts for establishing new connections.
* </p>
* @param interval interval at which a background health check will be scheduled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should mention that providing an interval is optional and an uspecified default value will be used if not specified.

* background health checking. Use negative value to disable the health checking mechanism.
* @return {@code this}.
* @see #backgroundExecutor(Executor)
* @see #healthCheckFailedConnectionsThreshold(int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should point at #healthCheckInterval(Duration)

* unhealthy and a connection establishment will take place in the background. Until finished, the host will
* not take part in load balancing selection.
* Use a negative value of the argument to disable health checking.
* @param threshold number of consecutive connection failures to consider a host unhealthy and eligible for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does zero mean? Does 1 mean a single failure or one additional failure following a failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comment. Actually there were inconsistencies here. I adjusted the description (# consecutive failures greater or equal to provided value triggers health check).
I'm throwing an exception for 0 now. 1 means any failure triggers the health check. Hope it's clearer now.

@chemicL chemicL merged commit f455e6f into apple:main Aug 17, 2021
chemicL pushed a commit that referenced this pull request Aug 17, 2021
Motivation:

The current implementation of `RoundRobinLoadBalancer` cycles through all addresses that the `ServiceDiscoverer` provides and opens connection regardless of the behavior of the individual hosts behind those addresses. No passive health checking is performed and no feedback from connection establishment is provided to the RRLB to make any smart decisions with regards to the way hosts are chosen for connections or directing requests. The purpose of RRLB is to do exactly one thing - cycle through hosts and direct traffic attempting to distribute the load fairly with regards to this assumption.
However, there are occasions when a particular `ServiceDiscoverer` (e.g. DNS-based) doesn't provide up-to-date health information about hosts. Meanwhile, some addresses might be not responding, but are considered active from the perspective of the discovery mechanism. Such addresses lead to unsuccessful connection establishment attempts and introduce unnecessary latency in the request path.
In this PR, a mechanism for detecting such failures is introduced. Hosts that the RRLB consecutively fails to establish connections with are taken out of the selection process until a connection is established. A background task tries, at specified intervals, to connect to the given host. Upon success, the connection can be used for routing traffic and the host comes back to the pool and takes part in the selection. The mechanism described here is a specific type of health checking and can possibly be improved in the future to be more tunable. Currently, the user controls the interval at which the health checks are performed, the consecutive failures count for a host to be considered unhealthy, and the background `io.servicetalk.concurrent.api.Executor` for running the checks.

Modifications:

- Consecutive connection attempts to ACTIVE hosts are counted in the internal RRLB's Host state,
- After a threshold is met, a background task is scheduled which will attempt a connection at a specified interval,
- Meanwhile, the particular address is not considered for directing traffic and opening connections,
- Whenever the background task successfully establishes a connection, that connection is used for directing requests and the host comes back to the list of eligible for selection in the request path,
- `RoundRobinConnectionFactory.Builder` was enhanced to incorporate this mechanism.

Result:

Problematic hosts are not used in the requests path and are actively health checked in the background until they are reachable again. The overall latency should increase for DNS `ServiceDiscoverer` users which stumble upon a situation where some addresses returned from the DNS queries are unreachable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants