-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement WarningTimeout for TCP/HTTP Checks #4522
Comments
This seems like a reasonable enhancement to feed some health info back into Consul in cases like an instance of a service being overloaded or having poor network connectivity. What do you think about the name of the field being something like "WarningThreshold" instead, since the request isn't really "timing out" in the normal sense? |
WarningThreshold seems fine to me as well, I will try an implementation asap |
My two cents; This should be modeled like the "health checks" of AWS' load balancer/target group options. A feature, I feel sometimes could be useful is an indication of "steady state"; i.e. when "N" probes return healthy. As an enhancements, there could possibly be tunables to configure things like the count of the "steady state", etc. Cheers, |
@pierresouchay any thoughts about the steady state idea? |
@shantanugadgil adding a new heathcheck type is probably difficult because it involves lots of changes in Consul as well as within ecosystem: most tools expect only 3 states. The main idea behind this change is to allow systems not supporting http 429 to support warning state easily (or to detect server/OS slowness for TCP) |
@pierresouchay I wasn't suggesting to add a new type of health check. When combined with "alias" type health checks, it could probably also cover the often asked question about "ready" state and "healthy" state. Say, a Cheers, |
@shantanugadgil Well I that case yes, but probably not within this PR. More generally, we need to be able to transmit more information to LB systems, for instance I would like to implement #3973 in order to allow a LB to gradually increase the traffic targeting a host (by deducing that the service is Passing, but only since 1s ago) to avoid sending lots of traffic immediately. We could use notes as a hack from what you describe, but having proper semantics would be a great addition. I don't know if it should be at service level or at health check level, but this really deserves discussions. I would be glad to contribute to such discussion if you create an issue with you ideas regarding this. Kind Regards |
I'm not sure about this. Not sure why I didn't see it yet but my feeling is that it's too simplistic and in practice it will result in a lot of "flapping" or service discovery churn as described. It's only a warning not a critical failure so it's not true "flapping" but assuming that something is actually paying attention to these then any time a service happens to be answering a big request or doing a GC at the same moment the check runs it will update all load balancer weights everywhere. It seems like as soon as anyone starts using this in production they'll find it actually causes more disruption than it solves and makes load fluctuate wildly or has no effect at all. The other issue I see is that often a check against a "healthz" style endpoint doesn't do any actual work so in many cases having it return quickly is not a good indication that the service is not under any load. This is no worse than now granted, but to make this a robust load steering system that should be accounted for. While this might be useful as proposed for some folks it feels like a very "poor man's" solution to traffic steering and I don't think we want to set the precedent that we will solve it in this way since current Consul "health" checks are not really high fidelity enough to meet expectations for dynamic traffic steering/load shedding. The big issue is it's re-purposing something an implicit sign of bad health which can easily have both false positives (see below) and false negatives and with no way to smooth that result over time or separate signal from noise etc. We could build a real solution for that including a way for applications to either explicitly provide some measure of current load back to Consul, or for Consul to probe in a more high-fidelity way (or at very least keep some history of failures to enable better determination of when something is "degraded" vs just one unlucky request). I'm open to discussion on the simplest thing there that seems actually viable/useful but note the next point. I think the better approach for Consul in general in the medium term will come from L7 support in Envoy and potentially other proxies that can all do this kind of thing way better since they have a complete picture of the full request latency performance for each instance over time etc. Thoughts? |
@banks While you might be right for several things, such as GC, an overloaded system might still take time to accept a single Connection when really overloaded. First, we have to note that this warning state is ONLY set when healthcheck is responding correctly. For sure, this option should be used with care, but allow easily to Control a third state without performing signifiant changes to apps. I have in mind for instance a few instances of a service connected to a database cluster with stickiness. One of the database is very slow to respond, having the ability to route most of the trafic using weight to the instances of service not sticked to very slow database allow this database to recover for instance. While load is a good indicator (we are gonna test this soon => return warning state when load is too high for a long time), that's not the other one, and being able to change weight based on bad performance seems like a good first step. |
I'd feel better about it if we had either simulation/model or real-world
tests that showed this model is actually going to work.
Our experience from Serf is that overly simplistic "failure detectors"
especially ones that rely on implicit feedback end up causing more trouble
than they solve and are very hard to predict the network effects they have
on the overall traffic shape and load if they trigger often etc.
I'd suggest if this is an important feature (can see how it is) that we try
to consider a really good way to solve robustly first rather than just
throw a simple thing like this up without thinking
through/testing/modelling how it will atually behave and document it as
"subtle" or "advanced feature".
Does that seem reasonable?
…On Mon, Sep 10, 2018 at 10:57 AM Pierre Souchay ***@***.***> wrote:
@banks <https://github.com/banks> While you might be right for several
things, such as GC, an overloaded system might still take time to accept a
single Connection when really overloaded.
First, we have to note that this warning state is ONLY set when
healthcheck is responding correctly.
In our shop, some services did properly implement 429 to trigger warning
state (since we want to avoid those nodes receive too much trafic), but it
is very difficult to ask developers to set up proper threshold for
responding correct HTTP Code. There are even some languages were it is very
difficult to achieve, think a PHP simple program checking some of its
requirements.
For sure, this option should be used with care, but allow easily to
Control a third state without performing signifiant changes to apps. I have
in mind for instance a few instances of a service connected to a database
cluster with stickiness. One of the database is very slow to respond,
having the ability to route most of the trafic using weight to the
instances of service not sticked to very slow database allow this database
to recover for instance.
While load is a good indicator (we are gonna test this soon => return
warning state when load is too high for a long time), that's not the other
one, and being able to change weight based on bad performance seems like a
good first step.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4522 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHYU-X6pZq9NrH83uGxsN7-Gk2gOxdAks5uZjeggaJpZM4V7GZQ>
.
|
It implements setting Warning state for HTTP 200 or when TCP connection succeeds but is too slow as described in hashicorp#4522 Combined with hashicorp#4468 it will allow very easily to reduce calls/secs on slow nodes in a completely automatic way.
Hello @pierresouchay are you still interested in adding this feature? |
@i0rek Well interested yes as I had to implement a script check to do it when I had the issue (PHP scripts on a machine, so without ability to fine-tune timeouts). But you seem not interested by current PR #4576, since it is not a big deal for us, unless you are interested or provide direction, I won't work on it anymore. But if you think it might be fine, I'll be Ok to rebase and do the appropriate fixes. On our infrastructure, it is not a big deal as we have good control on most health check, so we can do it another way. Furthermore, our other (more important) patch we made to handle temporary timeouts has been merged: #5739 -> no it is such a big deal for us. |
Thank you for the detailed explanation. We are not interested right now in adding that feature ourselves and given that it is not a big deal for you either, I am closing this issue and the corresponding PR. |
Feature Description
In order to allow TCP / HealthCheck to pass to warning state when checks to take too much time answering, I propose to add a WarningTimeout attribute (time, for instance "200ms").
When HeathCheck is run, nothing is changed, but if duration of "Passing" HealthCheck is more than the attribute WarningTimeout, then use "Warning" instead of "Passing" Status.
Use Case(s)
Allow to use easily #4468 when HTTP applications take too much time to answer.
The text was updated successfully, but these errors were encountered: