-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load balancer health checks problematic hosts #1709
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
took a quick look and left a few initial comments.
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Show resolved
Hide resolved
...lk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerFactory.java
Outdated
Show resolved
Hide resolved
...lk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerFactory.java
Outdated
Show resolved
Hide resolved
...lk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerFactory.java
Show resolved
Hide resolved
...lk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerFactory.java
Outdated
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
// Next, NoAvailableHostException is thrown when the host is unhealthy, | ||
// but we still wait until the health check is scheduled and only then stop retrying. | ||
t instanceof DeliberateException || testExecutor.scheduledTasksPending() == 0, | ||
// try to prevent stack overflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I ran this 1000 times, 1 test failed due to stack overflow caused by the nature of resubscribing in the retry operator. I believe adding a small delay will allow the concurrent executor trigger the health check before an overflow happens.
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
*/ | ||
public RoundRobinLoadBalancerFactory.Builder<ResolvedAddress, C> healthCheckInterval(Duration interval) { | ||
this.healthCheckInterval = requireNonNull(interval); | ||
if (interval.isNegative()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Consider checking value before assigning to internal variable. Otherwise, tricky users can try-catch it. If you change the order,
requireNonNull
is no longer necessary. - Is zero duration allowed? Should we enforce some min duration (like 100ms) to avoid too frequent health-checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Thanks for pointing it out. Regarding
requireNonNull
, I'm leaving it as a wrapper on top asrequireNonNull(interval).isNegative()
. As the package has theElementsAreNonnullByDefault
an explicit null check causes warnings in the IDE. But when the API is called from a language (e.g. Kotlin) that potentially ignores that annotation we still need to check for null. Am I right? - Zero is allowed, it uses the default. I wouldn't impose a minimum value. The API should be flexible here, it's difficult to come up with a minimum that will work in all cases. Assuming there's network traffic involved, the limiting factor will be the time it takes to repeat a connection attempt. If users want to set a value that's not
Duration.ZERO
, but too small for them, they should know what they're during. We provide a sensible default, anything different than that should be in users' hands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If the
interval
isnull
,interval.isNegative()
will throw NPE, which gives the same result asrequireNonNull(interval).isNegative()
. - ok
...lk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerFactory.java
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
public Completable reconnect() { | ||
return defer(() -> connectionFactory.newConnection(host.address, null)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like defer
is required here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a matter of fact it is. It might feel it isn't and in most cases there won't be any issue. But the fact that the call connectionFactory.newConnection(host.address, null)
can return a different Single
every time it's called makes the re-subscribing mechanism prone to errors if that call is not re-exercised upon every retry. A practical example can be observed when you run the RRLB tests after removing the defer()
call. But aside from the quirks of the test, I just believe it's semantically incorrect to assume the factory returns a proper re-subscribable Single
that would behave the same way when a new one was obtained via calling newConnection
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Reactive Streams each re-subscribe should trigger new evaluation of the Single
and produce the meaningful result. The result might be different (like a connection with new id), but it's still a result that should be treated as any previous result. In this case, your test is written incorrectly, without properly following RS spec:
Function<String, Single<TestLoadBalancedConnection>> factory =
new Function<String, Single<TestLoadBalancedConnection>>() {
@Override
public Single<TestLoadBalancedConnection> apply(final String s) {
if (s.equals(failingHost)) {
requests.incrementAndGet();
if (momentInTime.get() >= connections.size()) {
return properConnection;
}
return connections.get(momentInTime.get());
}
return properConnection;
}
};
Your apply
method returns a Single
, but it creates a state outside of the Single
, which is not executed on each re-subscribe. Look at any other filter. For example, RequestTargetEncoderHttpRequesterFilter
. It uses defer
to recompute requestTarget
on each re-subscribe. Otherwise, retry or repeat operator won't work as expected. You should do the same here and wrap the internals of your apply
function with defer instead of using defer
inside RRLB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great explanation, thanks for taking the time to elaborate. That makes sense, I did as you recommended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last comments then LGTM. Good job!
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
...lk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerFactory.java
Outdated
Show resolved
Hide resolved
...lk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerFactory.java
Show resolved
Hide resolved
servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java
Outdated
Show resolved
Hide resolved
@@ -381,6 +400,177 @@ public void newConnectionIsClosedWhenSelectorRejects() throws Exception { | |||
awaitIndefinitely(connection.onClose()); | |||
} | |||
|
|||
@Test | |||
public void unhealthyHostTakenOutOfPoolForSelection() throws Exception { | |||
serviceDiscoveryPublisher.onComplete(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and in other tests, you complete serviceDiscoveryPublisher
but later invoke onNext
. It violates the spec. Interesting, that TestPublisher
doesn't validate such misuse. Please, open a GH issue for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I actually copied the pattern from selectStampedeUnsaturableConnection
, but I thought I'm re-creating the serviceDiscoveryPublisher
in the method creating a new load balancer. But that might have gotten lost. We should improve this behaviour, you're right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the created issue! In your current tests, please adjust the behavior to be correct (i.e. don't terminate the publisher if you intend to use it later). You can skip onComplete()
in your tests, it's not necessary for tests that are not focused on SD termination. In general, termination of SD events stream is not expected by RRLB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a matter of fact this call is necessary for tests to behave correctly whenever a new instance of the load balancer is created. The reason for it is the fact that TestPublisher prevents simultaneous subscriptions by default. Calling onComplete()
allows the new LB to reuse the existing serviceDiscoveryPublisher
. The onComplete()
method is not actually shutting down the Publisher
but its' Subscriber
:
/**
* Completes the {@link Subscriber}.
*
* @see Subscriber#onComplete()
*/
public void onComplete()
I think the expected changes can be discussed in the created issue.
...etalk-loadbalancer/src/test/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerTest.java
Outdated
Show resolved
Hide resolved
...etalk-loadbalancer/src/test/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerTest.java
Outdated
Show resolved
Hide resolved
...etalk-loadbalancer/src/test/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerTest.java
Outdated
Show resolved
Hide resolved
@@ -141,10 +152,13 @@ public RoundRobinLoadBalancer(final Publisher<? extends ServiceDiscovererEvent<R | |||
* set to {@code false} should be eagerly closed. When {@code false}, the expired addresses will be used | |||
* for sending requests, but new connections will not be requested, allowing the server to drive | |||
* the connection closure and shifting traffic to other addresses. | |||
* @param healthCheckConfig configuration for the health checking mechanism, which monitors hosts that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
healthCheckConfig
is nullable but it is not explained what happens when it is null or why you might not provide a health check config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing, I added a note about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the result of disabling it? Will the hosts never become eligible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to point to the factory, which is a public API and contains a more thorough explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just javadoc clarifications
@@ -141,10 +152,13 @@ public RoundRobinLoadBalancer(final Publisher<? extends ServiceDiscovererEvent<R | |||
* set to {@code false} should be eagerly closed. When {@code false}, the expired addresses will be used | |||
* for sending requests, but new connections will not be requested, allowing the server to drive | |||
* the connection closure and shifting traffic to other addresses. | |||
* @param healthCheckConfig configuration for the health checking mechanism, which monitors hosts that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the result of disabling it? Will the hosts never become eligible?
* @see #healthCheckFailedConnectionsThreshold(int) | ||
*/ | ||
public RoundRobinLoadBalancerFactory.Builder<ResolvedAddress, C> healthCheckInterval(Duration interval) { | ||
if (requireNonNull(interval).isNegative()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also disallow zero and perhaps nonsensical values (<1s or >1 day)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the bounds. I'd say it's up to the user to determine what is nonsensical – in some scenarios <1s might make sense. Not sure about >1 day, but then again, is 23 hrs good?
For zero, in my initial proposal it served the purpose of restoring the default. But I'm ok with preventing against zero too.
* @return {@code this}. | ||
* @see #healthCheckFailedConnectionsThreshold(int) | ||
*/ | ||
public RoundRobinLoadBalancerFactory.Builder<ResolvedAddress, C> backgroundExecutor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shared executor which is used if no executor is provided should be mentioned.
* {@link #healthCheckFailedConnectionsThreshold(int)} can be used to disable the health checking mechanism | ||
* and always consider all hosts for establishing new connections. | ||
* </p> | ||
* @param interval interval at which a background health check will be scheduled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should mention that providing an interval is optional and an uspecified default value will be used if not specified.
* background health checking. Use negative value to disable the health checking mechanism. | ||
* @return {@code this}. | ||
* @see #backgroundExecutor(Executor) | ||
* @see #healthCheckFailedConnectionsThreshold(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should point at #healthCheckInterval(Duration)
* unhealthy and a connection establishment will take place in the background. Until finished, the host will | ||
* not take part in load balancing selection. | ||
* Use a negative value of the argument to disable health checking. | ||
* @param threshold number of consecutive connection failures to consider a host unhealthy and eligible for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does zero mean? Does 1 mean a single failure or one additional failure following a failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this comment. Actually there were inconsistencies here. I adjusted the description (# consecutive failures greater or equal to provided value triggers health check).
I'm throwing an exception for 0 now. 1 means any failure triggers the health check. Hope it's clearer now.
Motivation: The current implementation of `RoundRobinLoadBalancer` cycles through all addresses that the `ServiceDiscoverer` provides and opens connection regardless of the behavior of the individual hosts behind those addresses. No passive health checking is performed and no feedback from connection establishment is provided to the RRLB to make any smart decisions with regards to the way hosts are chosen for connections or directing requests. The purpose of RRLB is to do exactly one thing - cycle through hosts and direct traffic attempting to distribute the load fairly with regards to this assumption. However, there are occasions when a particular `ServiceDiscoverer` (e.g. DNS-based) doesn't provide up-to-date health information about hosts. Meanwhile, some addresses might be not responding, but are considered active from the perspective of the discovery mechanism. Such addresses lead to unsuccessful connection establishment attempts and introduce unnecessary latency in the request path. In this PR, a mechanism for detecting such failures is introduced. Hosts that the RRLB consecutively fails to establish connections with are taken out of the selection process until a connection is established. A background task tries, at specified intervals, to connect to the given host. Upon success, the connection can be used for routing traffic and the host comes back to the pool and takes part in the selection. The mechanism described here is a specific type of health checking and can possibly be improved in the future to be more tunable. Currently, the user controls the interval at which the health checks are performed, the consecutive failures count for a host to be considered unhealthy, and the background `io.servicetalk.concurrent.api.Executor` for running the checks. Modifications: - Consecutive connection attempts to ACTIVE hosts are counted in the internal RRLB's Host state, - After a threshold is met, a background task is scheduled which will attempt a connection at a specified interval, - Meanwhile, the particular address is not considered for directing traffic and opening connections, - Whenever the background task successfully establishes a connection, that connection is used for directing requests and the host comes back to the list of eligible for selection in the request path, - `RoundRobinConnectionFactory.Builder` was enhanced to incorporate this mechanism. Result: Problematic hosts are not used in the requests path and are actively health checked in the background until they are reachable again. The overall latency should increase for DNS `ServiceDiscoverer` users which stumble upon a situation where some addresses returned from the DNS queries are unreachable.
Motivation:
The current implementation of
RoundRobinLoadBalancer
cycles through all addresses that theServiceDiscoverer
provides and opens connection regardless of the behavior of the individual hosts behind those addresses. No passive health checking is performed and no feedback from connection establishment is provided to the RRLB to make any smart decisions with regards to the way hosts are chosen for connections or directing requests. The purpose of RRLB is to do exactly one thing - cycle through hosts and direct traffic attempting to distribute the load fairly with regards to this assumption.However, there are occasions when a particular
ServiceDiscoverer
(e.g. DNS-based) doesn't provide up-to-date health information about hosts. Meanwhile, some addresses might be not responding, but are considered active from the perspective of the discovery mechanism. Such addresses lead to unsuccessful connection establishment attempts and introduce unnecessary latency in the request path.In this PR, a mechanism for detecting such failures is introduced. Hosts that the RRLB consecutively fails to establish connections with are taken out of the selection process until a connection is established. A background task tries, at specified intervals, to connect to the given host. Upon success, the connection can be used for routing traffic and the host comes back to the pool and takes part in the selection. The mechanism described here is a specific type of health checking and can possibly be improved in the future to be more tunable. Currently, the user controls the interval at which the health checks are performed, the consecutive failures count for a host to be considered unhealthy, and the background
io.servicetalk.concurrent.api.Executor
for running the checks.Modifications:
RoundRobinConnectionFactory.Builder
was enhanced to incorporate this mechanism.Result:
Problematic hosts are not used in the requests path and are actively health checked in the background until they are reachable again. The overall latency should increase for DNS
ServiceDiscoverer
users which stumble upon a situation where some addresses returned from the DNS queries are unreachable.