Load balancer health checks problematic hosts #1709

chemicL · 2021-07-29T18:04:04Z

Motivation:

The current implementation of RoundRobinLoadBalancer cycles through all addresses that the ServiceDiscoverer provides and opens connection regardless of the behavior of the individual hosts behind those addresses. No passive health checking is performed and no feedback from connection establishment is provided to the RRLB to make any smart decisions with regards to the way hosts are chosen for connections or directing requests. The purpose of RRLB is to do exactly one thing - cycle through hosts and direct traffic attempting to distribute the load fairly with regards to this assumption.
However, there are occasions when a particular ServiceDiscoverer (e.g. DNS-based) doesn't provide up-to-date health information about hosts. Meanwhile, some addresses might be not responding, but are considered active from the perspective of the discovery mechanism. Such addresses lead to unsuccessful connection establishment attempts and introduce unnecessary latency in the request path.
In this PR, a mechanism for detecting such failures is introduced. Hosts that the RRLB consecutively fails to establish connections with are taken out of the selection process until a connection is established. A background task tries, at specified intervals, to connect to the given host. Upon success, the connection can be used for routing traffic and the host comes back to the pool and takes part in the selection. The mechanism described here is a specific type of health checking and can possibly be improved in the future to be more tunable. Currently, the user controls the interval at which the health checks are performed, the consecutive failures count for a host to be considered unhealthy, and the background io.servicetalk.concurrent.api.Executor for running the checks.

Modifications:

Consecutive connection attempts to ACTIVE hosts are counted in the internal RRLB's Host state,
After a threshold is met, a background task is scheduled which will attempt a connection at a specified interval,
Meanwhile, the particular address is not considered for directing traffic and opening connections,
Whenever the background task successfully establishes a connection, that connection is used for directing requests and the host comes back to the list of eligible for selection in the request path,
RoundRobinConnectionFactory.Builder was enhanced to incorporate this mechanism.

Result:

Problematic hosts are not used in the requests path and are actively health checked in the background until they are reachable again. The overall latency should increase for DNS ServiceDiscoverer users which stumble upon a situation where some addresses returned from the DNS queries are unreachable.

Scottmitch

took a quick look and left a few initial comments.

servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java

...lk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerFactory.java

servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java

chemicL · 2021-08-11T14:32:27Z

...etalk-loadbalancer/src/test/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerTest.java

+                    // Next, NoAvailableHostException is thrown when the host is unhealthy,
+                    // but we still wait until the health check is scheduled and only then stop retrying.
+                    t instanceof DeliberateException || testExecutor.scheduledTasksPending() == 0,
+            // try to prevent stack overflow


When I ran this 1000 times, 1 test failed due to stack overflow caused by the nature of resubscribing in the retry operator. I believe adding a small delay will allow the concurrent executor trigger the health check before an overflow happens.

servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java

idelpivnitskiy · 2021-08-13T04:43:32Z

...lk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerFactory.java

+         */
+        public RoundRobinLoadBalancerFactory.Builder<ResolvedAddress, C> healthCheckInterval(Duration interval) {
+            this.healthCheckInterval = requireNonNull(interval);
+            if (interval.isNegative()) {


Consider checking value before assigning to internal variable. Otherwise, tricky users can try-catch it. If you change the order, requireNonNull is no longer necessary.

Is zero duration allowed? Should we enforce some min duration (like 100ms) to avoid too frequent health-checks?

Thanks for pointing it out. Regarding requireNonNull, I'm leaving it as a wrapper on top as requireNonNull(interval).isNegative(). As the package has the ElementsAreNonnullByDefault an explicit null check causes warnings in the IDE. But when the API is called from a language (e.g. Kotlin) that potentially ignores that annotation we still need to check for null. Am I right?

Zero is allowed, it uses the default. I wouldn't impose a minimum value. The API should be flexible here, it's difficult to come up with a minimum that will work in all cases. Assuming there's network traffic involved, the limiting factor will be the time it takes to repeat a connection attempt. If users want to set a value that's not Duration.ZERO, but too small for them, they should know what they're during. We provide a sensible default, anything different than that should be in users' hands.

If the interval is null, interval.isNegative() will throw NPE, which gives the same result as requireNonNull(interval).isNegative().

ok

...lk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerFactory.java

servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java

idelpivnitskiy · 2021-08-13T05:53:21Z

servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java

+            }
+
+            public Completable reconnect() {
+                return defer(() -> connectionFactory.newConnection(host.address, null))


Doesn't look like defer is required here

As a matter of fact it is. It might feel it isn't and in most cases there won't be any issue. But the fact that the call connectionFactory.newConnection(host.address, null) can return a different Single every time it's called makes the re-subscribing mechanism prone to errors if that call is not re-exercised upon every retry. A practical example can be observed when you run the RRLB tests after removing the defer() call. But aside from the quirks of the test, I just believe it's semantically incorrect to assume the factory returns a proper re-subscribable Single that would behave the same way when a new one was obtained via calling newConnection.

In Reactive Streams each re-subscribe should trigger new evaluation of the Single and produce the meaningful result. The result might be different (like a connection with new id), but it's still a result that should be treated as any previous result. In this case, your test is written incorrectly, without properly following RS spec:

Function<String, Single<TestLoadBalancedConnection>> factory = new Function<String, Single<TestLoadBalancedConnection>>() { @Override public Single<TestLoadBalancedConnection> apply(final String s) { if (s.equals(failingHost)) { requests.incrementAndGet(); if (momentInTime.get() >= connections.size()) { return properConnection; } return connections.get(momentInTime.get()); } return properConnection; } };

Your apply method returns a Single, but it creates a state outside of the Single, which is not executed on each re-subscribe. Look at any other filter. For example, RequestTargetEncoderHttpRequesterFilter. It uses defer to recompute requestTarget on each re-subscribe. Otherwise, retry or repeat operator won't work as expected. You should do the same here and wrap the internals of your apply function with defer instead of using defer inside RRLB.

That's a great explanation, thanks for taking the time to elaborate. That makes sense, I did as you recommended.

idelpivnitskiy

Last comments then LGTM. Good job!

servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java

...lk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerFactory.java

servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java

idelpivnitskiy · 2021-08-13T20:42:58Z

...etalk-loadbalancer/src/test/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerTest.java

@@ -381,6 +400,177 @@ public void newConnectionIsClosedWhenSelectorRejects() throws Exception {
        awaitIndefinitely(connection.onClose());
    }

+    @Test
+    public void unhealthyHostTakenOutOfPoolForSelection() throws Exception {
+        serviceDiscoveryPublisher.onComplete();


Here and in other tests, you complete serviceDiscoveryPublisher but later invoke onNext. It violates the spec. Interesting, that TestPublisher doesn't validate such misuse. Please, open a GH issue for that.

Interesting, I actually copied the pattern from selectStampedeUnsaturableConnection, but I thought I'm re-creating the serviceDiscoveryPublisher in the method creating a new load balancer. But that might have gotten lost. We should improve this behaviour, you're right.

Thanks for the created issue! In your current tests, please adjust the behavior to be correct (i.e. don't terminate the publisher if you intend to use it later). You can skip onComplete() in your tests, it's not necessary for tests that are not focused on SD termination. In general, termination of SD events stream is not expected by RRLB.

As a matter of fact this call is necessary for tests to behave correctly whenever a new instance of the load balancer is created. The reason for it is the fact that TestPublisher prevents simultaneous subscriptions by default. Calling onComplete() allows the new LB to reuse the existing serviceDiscoveryPublisher. The onComplete() method is not actually shutting down the Publisher but its' Subscriber:

/** * Completes the {@link Subscriber}. * * @see Subscriber#onComplete() */ public void onComplete()

I think the expected changes can be discussed in the created issue.

...etalk-loadbalancer/src/test/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerTest.java

bondolo · 2021-08-16T16:44:40Z

servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java

@@ -141,10 +152,13 @@ public RoundRobinLoadBalancer(final Publisher<? extends ServiceDiscovererEvent<R
     * set to {@code false} should be eagerly closed. When {@code false}, the expired addresses will be used
     * for sending requests, but new connections will not be requested, allowing the server to drive
     * the connection closure and shifting traffic to other addresses.
+     * @param healthCheckConfig configuration for the health checking mechanism, which monitors hosts that


healthCheckConfig is nullable but it is not explained what happens when it is null or why you might not provide a health check config.

Thanks for noticing, I added a note about it.

What is the result of disabling it? Will the hosts never become eligible?

I decided to point to the factory, which is a public API and contains a more thorough explanation.

bondolo

Looks good to me, just javadoc clarifications

bondolo · 2021-08-16T17:07:49Z

servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java

@@ -141,10 +152,13 @@ public RoundRobinLoadBalancer(final Publisher<? extends ServiceDiscovererEvent<R
     * set to {@code false} should be eagerly closed. When {@code false}, the expired addresses will be used
     * for sending requests, but new connections will not be requested, allowing the server to drive
     * the connection closure and shifting traffic to other addresses.
+     * @param healthCheckConfig configuration for the health checking mechanism, which monitors hosts that


What is the result of disabling it? Will the hosts never become eligible?

bondolo · 2021-08-16T17:14:38Z

...lk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerFactory.java

+         * @see #healthCheckFailedConnectionsThreshold(int)
+         */
+        public RoundRobinLoadBalancerFactory.Builder<ResolvedAddress, C> healthCheckInterval(Duration interval) {
+            if (requireNonNull(interval).isNegative()) {


Should also disallow zero and perhaps nonsensical values (<1s or >1 day)

I'm not sure about the bounds. I'd say it's up to the user to determine what is nonsensical – in some scenarios <1s might make sense. Not sure about >1 day, but then again, is 23 hrs good?
For zero, in my initial proposal it served the purpose of restoring the default. But I'm ok with preventing against zero too.

bondolo · 2021-08-16T17:17:10Z

...lk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerFactory.java

+         * @return {@code this}.
+         * @see #healthCheckFailedConnectionsThreshold(int)
+         */
+        public RoundRobinLoadBalancerFactory.Builder<ResolvedAddress, C> backgroundExecutor(


The shared executor which is used if no executor is provided should be mentioned.

bondolo · 2021-08-16T17:18:32Z

...lk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerFactory.java

+         * {@link #healthCheckFailedConnectionsThreshold(int)} can be used to disable the health checking mechanism
+         * and always consider all hosts for establishing new connections.
+         * </p>
+         * @param interval interval at which a background health check will be scheduled.


Should mention that providing an interval is optional and an uspecified default value will be used if not specified.

bondolo · 2021-08-16T17:22:47Z

...lk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerFactory.java

+         * background health checking. Use negative value to disable the health checking mechanism.
+         * @return {@code this}.
+         * @see #backgroundExecutor(Executor)
+         * @see #healthCheckFailedConnectionsThreshold(int)


Should point at #healthCheckInterval(Duration)

bondolo · 2021-08-16T17:22:48Z

...lk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerFactory.java

+         * unhealthy and a connection establishment will take place in the background. Until finished, the host will
+         * not take part in load balancing selection.
+         * Use a negative value of the argument to disable health checking.
+         * @param threshold number of consecutive connection failures to consider a host unhealthy and eligible for


What does zero mean? Does 1 mean a single failure or one additional failure following a failure?

Thanks for this comment. Actually there were inconsistencies here. I adjusted the description (# consecutive failures greater or equal to provided value triggers health check).
I'm throwing an exception for 0 now. 1 means any failure triggers the health check. Hope it's clearer now.

Motivation: The current implementation of `RoundRobinLoadBalancer` cycles through all addresses that the `ServiceDiscoverer` provides and opens connection regardless of the behavior of the individual hosts behind those addresses. No passive health checking is performed and no feedback from connection establishment is provided to the RRLB to make any smart decisions with regards to the way hosts are chosen for connections or directing requests. The purpose of RRLB is to do exactly one thing - cycle through hosts and direct traffic attempting to distribute the load fairly with regards to this assumption. However, there are occasions when a particular `ServiceDiscoverer` (e.g. DNS-based) doesn't provide up-to-date health information about hosts. Meanwhile, some addresses might be not responding, but are considered active from the perspective of the discovery mechanism. Such addresses lead to unsuccessful connection establishment attempts and introduce unnecessary latency in the request path. In this PR, a mechanism for detecting such failures is introduced. Hosts that the RRLB consecutively fails to establish connections with are taken out of the selection process until a connection is established. A background task tries, at specified intervals, to connect to the given host. Upon success, the connection can be used for routing traffic and the host comes back to the pool and takes part in the selection. The mechanism described here is a specific type of health checking and can possibly be improved in the future to be more tunable. Currently, the user controls the interval at which the health checks are performed, the consecutive failures count for a host to be considered unhealthy, and the background `io.servicetalk.concurrent.api.Executor` for running the checks. Modifications: - Consecutive connection attempts to ACTIVE hosts are counted in the internal RRLB's Host state, - After a threshold is met, a background task is scheduled which will attempt a connection at a specified interval, - Meanwhile, the particular address is not considered for directing traffic and opening connections, - Whenever the background task successfully establishes a connection, that connection is used for directing requests and the host comes back to the list of eligible for selection in the request path, - `RoundRobinConnectionFactory.Builder` was enhanced to incorporate this mechanism. Result: Problematic hosts are not used in the requests path and are actively health checked in the background until they are reachable again. The overall latency should increase for DNS `ServiceDiscoverer` users which stumble upon a situation where some addresses returned from the DNS queries are unreachable.

Dariusz Jędrzejczyk added 7 commits July 27, 2021 12:25

unhealthy host health checking WIP

5c7c559

Single Host state including HealthCheck

2faf232

Health Checking after threshold

c637add

Improved tests

be2871c

Allow disabling health checking

2b55a0b

Merge branch 'main' into load-balancer-unhealthy-hosts

60f165d

Added missing nullable field annotation

9dcabb5

Scottmitch reviewed Jul 30, 2021

View reviewed changes

idelpivnitskiy reviewed Aug 5, 2021

View reviewed changes

Dariusz Jędrzejczyk added 3 commits August 9, 2021 16:30

Merge branch 'main' into load-balancer-unhealthy-hosts

8acd306

Review feedback addressed

5c87efc

Counting failed connections atomically within the state

ee0aaf0

chemicL commented Aug 11, 2021

View reviewed changes

chemicL marked this pull request as ready for review August 11, 2021 17:16

bondolo marked this pull request as draft August 12, 2021 18:04

bondolo marked this pull request as ready for review August 12, 2021 18:04

idelpivnitskiy reviewed Aug 13, 2021

View reviewed changes

Review feedback

97fad11

idelpivnitskiy approved these changes Aug 13, 2021

View reviewed changes

Dariusz Jędrzejczyk added 2 commits August 16, 2021 15:31

Review feedback

1513ca5

Merge branch 'main' into load-balancer-unhealthy-hosts

8ae7221

chemicL changed the title ~~Load balancer unhealthy hosts~~ Load balancer health checks problematic hosts Aug 16, 2021

bondolo reviewed Aug 16, 2021

View reviewed changes

Improved javadoc

9000049

bondolo approved these changes Aug 16, 2021

View reviewed changes

Javadoc improvements

837e293

chemicL merged commit f455e6f into apple:main Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load balancer health checks problematic hosts #1709

Load balancer health checks problematic hosts #1709

chemicL commented Jul 29, 2021 •

edited

Loading

Scottmitch left a comment

chemicL Aug 11, 2021

idelpivnitskiy Aug 13, 2021

chemicL Aug 13, 2021

idelpivnitskiy Aug 13, 2021 •

edited

Loading

idelpivnitskiy Aug 13, 2021

chemicL Aug 13, 2021

idelpivnitskiy Aug 13, 2021

chemicL Aug 16, 2021

idelpivnitskiy left a comment

idelpivnitskiy Aug 13, 2021

chemicL Aug 16, 2021

chemicL Aug 16, 2021

idelpivnitskiy Aug 16, 2021

chemicL Aug 17, 2021

bondolo Aug 16, 2021

chemicL Aug 16, 2021

bondolo Aug 16, 2021

chemicL Aug 17, 2021

bondolo left a comment

bondolo Aug 16, 2021

bondolo Aug 16, 2021

chemicL Aug 17, 2021

bondolo Aug 16, 2021

bondolo Aug 16, 2021

bondolo Aug 16, 2021

bondolo Aug 16, 2021

chemicL Aug 17, 2021

Load balancer health checks problematic hosts #1709

Load balancer health checks problematic hosts #1709

Conversation

chemicL commented Jul 29, 2021 • edited Loading

Scottmitch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

idelpivnitskiy Aug 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

idelpivnitskiy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bondolo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chemicL commented Jul 29, 2021 •

edited

Loading

idelpivnitskiy Aug 13, 2021 •

edited

Loading