Implement WarningTimeout for TCP/HTTP Checks #4522

pierresouchay · 2018-08-13T19:36:57Z

Feature Description

In order to allow TCP / HealthCheck to pass to warning state when checks to take too much time answering, I propose to add a WarningTimeout attribute (time, for instance "200ms").

When HeathCheck is run, nothing is changed, but if duration of "Passing" HealthCheck is more than the attribute WarningTimeout, then use "Warning" instead of "Passing" Status.

Use Case(s)

Allow to use easily #4468 when HTTP applications take too much time to answer.

kyhavlov · 2018-08-16T23:29:39Z

This seems like a reasonable enhancement to feed some health info back into Consul in cases like an instance of a service being overloaded or having poor network connectivity. What do you think about the name of the field being something like "WarningThreshold" instead, since the request isn't really "timing out" in the normal sense?

pierresouchay · 2018-08-17T11:20:53Z

WarningThreshold seems fine to me as well, I will try an implementation asap

shantanugadgil · 2018-08-17T11:44:43Z

My two cents;

This should be modeled like the "health checks" of AWS' load balancer/target group options.
(I am sure folks with more exposure can pitch in more ideas).

A feature, I feel sometimes could be useful is an indication of "steady state"; i.e. when "N" probes return healthy.

As an enhancements, there could possibly be tunables to configure things like the count of the "steady state", etc.

Cheers,
Shantanu

shantanugadgil · 2018-08-25T03:33:16Z

@pierresouchay any thoughts about the steady state idea?

pierresouchay · 2018-08-25T17:50:00Z

@shantanugadgil adding a new heathcheck type is probably difficult because it involves lots of changes in Consul as well as within ecosystem: most tools expect only 3 states.

The main idea behind this change is to allow systems not supporting http 429 to support warning state easily (or to detect server/OS slowness for TCP)

shantanugadgil · 2018-08-26T08:02:35Z

@pierresouchay I wasn't suggesting to add a new type of health check.
All I was saying was to add more fields to the current health check to give the "capabilities" which could help when people compare it to "cloudy" health check capabilities. 😁

When combined with "alias" type health checks, it could probably also cover the often asked question about "ready" state and "healthy" state.

Say, a level1 health check only determines basic tests (say "ps" output for process visibility or a tcp check)
Another health check, say a level2 health check doesn't start firing until the level1 health is passing.

Cheers,
Shantanu

pierresouchay · 2018-08-26T09:48:05Z

@shantanugadgil Well I that case yes, but probably not within this PR. More generally, we need to be able to transmit more information to LB systems, for instance I would like to implement #3973 in order to allow a LB to gradually increase the traffic targeting a host (by deducing that the service is Passing, but only since 1s ago) to avoid sending lots of traffic immediately.

We could use notes as a hack from what you describe, but having proper semantics would be a great addition. I don't know if it should be at service level or at health check level, but this really deserves discussions.

I would be glad to contribute to such discussion if you create an issue with you ideas regarding this.

Kind Regards

banks · 2018-09-10T09:39:52Z

I'm not sure about this. Not sure why I didn't see it yet but my feeling is that it's too simplistic and in practice it will result in a lot of "flapping" or service discovery churn as described.

It's only a warning not a critical failure so it's not true "flapping" but assuming that something is actually paying attention to these then any time a service happens to be answering a big request or doing a GC at the same moment the check runs it will update all load balancer weights everywhere.

It seems like as soon as anyone starts using this in production they'll find it actually causes more disruption than it solves and makes load fluctuate wildly or has no effect at all.

The other issue I see is that often a check against a "healthz" style endpoint doesn't do any actual work so in many cases having it return quickly is not a good indication that the service is not under any load. This is no worse than now granted, but to make this a robust load steering system that should be accounted for.

While this might be useful as proposed for some folks it feels like a very "poor man's" solution to traffic steering and I don't think we want to set the precedent that we will solve it in this way since current Consul "health" checks are not really high fidelity enough to meet expectations for dynamic traffic steering/load shedding. The big issue is it's re-purposing something an implicit sign of bad health which can easily have both false positives (see below) and false negatives and with no way to smooth that result over time or separate signal from noise etc.

We could build a real solution for that including a way for applications to either explicitly provide some measure of current load back to Consul, or for Consul to probe in a more high-fidelity way (or at very least keep some history of failures to enable better determination of when something is "degraded" vs just one unlucky request). I'm open to discussion on the simplest thing there that seems actually viable/useful but note the next point.

I think the better approach for Consul in general in the medium term will come from L7 support in Envoy and potentially other proxies that can all do this kind of thing way better since they have a complete picture of the full request latency performance for each instance over time etc.

Thoughts?

pierresouchay · 2018-09-10T09:57:48Z

@banks While you might be right for several things, such as GC, an overloaded system might still take time to accept a single Connection when really overloaded.

First, we have to note that this warning state is ONLY set when healthcheck is responding correctly.
In our shop, some services did properly implement 429 to trigger warning state (since we want to avoid those nodes receive too much trafic), but it is very difficult to ask developers to set up proper threshold for responding correct HTTP Code. There are even some languages were it is very difficult to achieve, think a PHP simple program checking some of its requirements.

For sure, this option should be used with care, but allow easily to Control a third state without performing signifiant changes to apps. I have in mind for instance a few instances of a service connected to a database cluster with stickiness. One of the database is very slow to respond, having the ability to route most of the trafic using weight to the instances of service not sticked to very slow database allow this database to recover for instance.

While load is a good indicator (we are gonna test this soon => return warning state when load is too high for a long time), that's not the other one, and being able to change weight based on bad performance seems like a good first step.

banks · 2018-09-10T10:04:30Z

I'd feel better about it if we had either simulation/model or real-world tests that showed this model is actually going to work. Our experience from Serf is that overly simplistic "failure detectors" especially ones that rely on implicit feedback end up causing more trouble than they solve and are very hard to predict the network effects they have on the overall traffic shape and load if they trigger often etc. I'd suggest if this is an important feature (can see how it is) that we try to consider a really good way to solve robustly first rather than just throw a simple thing like this up without thinking through/testing/modelling how it will atually behave and document it as "subtle" or "advanced feature". Does that seem reasonable?

…

On Mon, Sep 10, 2018 at 10:57 AM Pierre Souchay ***@***.***> wrote: @banks <https://github.com/banks> While you might be right for several things, such as GC, an overloaded system might still take time to accept a single Connection when really overloaded. First, we have to note that this warning state is ONLY set when healthcheck is responding correctly. In our shop, some services did properly implement 429 to trigger warning state (since we want to avoid those nodes receive too much trafic), but it is very difficult to ask developers to set up proper threshold for responding correct HTTP Code. There are even some languages were it is very difficult to achieve, think a PHP simple program checking some of its requirements. For sure, this option should be used with care, but allow easily to Control a third state without performing signifiant changes to apps. I have in mind for instance a few instances of a service connected to a database cluster with stickiness. One of the database is very slow to respond, having the ability to route most of the trafic using weight to the instances of service not sticked to very slow database allow this database to recover for instance. While load is a good indicator (we are gonna test this soon => return warning state when load is too high for a long time), that's not the other one, and being able to change weight based on bad performance seems like a good first step. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4522 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHYU-X6pZq9NrH83uGxsN7-Gk2gOxdAks5uZjeggaJpZM4V7GZQ> .

It implements setting Warning state for HTTP 200 or when TCP connection succeeds but is too slow as described in hashicorp#4522 Combined with hashicorp#4468 it will allow very easily to reduce calls/secs on slow nodes in a completely automatic way.

hanshasselberg · 2019-12-19T20:21:39Z

Hello @pierresouchay are you still interested in adding this feature?

pierresouchay · 2020-01-10T07:56:26Z

@i0rek Well interested yes as I had to implement a script check to do it when I had the issue (PHP scripts on a machine, so without ability to fine-tune timeouts).

But you seem not interested by current PR #4576, since it is not a big deal for us, unless you are interested or provide direction, I won't work on it anymore. But if you think it might be fine, I'll be Ok to rebase and do the appropriate fixes.

On our infrastructure, it is not a big deal as we have good control on most health check, so we can do it another way. Furthermore, our other (more important) patch we made to handle temporary timeouts has been merged: #5739 -> no it is such a big deal for us.

hanshasselberg · 2020-01-13T08:48:29Z

Thank you for the detailed explanation. We are not interested right now in adding that feature ourselves and given that it is not a big deal for you either, I am closing this issue and the corresponding PR.

kyhavlov added the type/enhancement Proposed improvement or new feature label Aug 16, 2018

pierresouchay mentioned this issue Aug 24, 2018

Implement WarningThreshold for TCP/HTTP HealthChecks #4576

Closed

hanshasselberg self-assigned this Dec 19, 2019

hanshasselberg added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Dec 19, 2019

stale bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 10, 2020

hanshasselberg closed this as completed Jan 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement WarningTimeout for TCP/HTTP Checks #4522

Implement WarningTimeout for TCP/HTTP Checks #4522

pierresouchay commented Aug 13, 2018

kyhavlov commented Aug 16, 2018

pierresouchay commented Aug 17, 2018

shantanugadgil commented Aug 17, 2018

shantanugadgil commented Aug 25, 2018

pierresouchay commented Aug 25, 2018

shantanugadgil commented Aug 26, 2018

pierresouchay commented Aug 26, 2018

banks commented Sep 10, 2018 •

edited

Loading

pierresouchay commented Sep 10, 2018

banks commented Sep 10, 2018 via email

hanshasselberg commented Dec 19, 2019

pierresouchay commented Jan 10, 2020

hanshasselberg commented Jan 13, 2020

Implement WarningTimeout for TCP/HTTP Checks #4522

Implement WarningTimeout for TCP/HTTP Checks #4522

Comments

pierresouchay commented Aug 13, 2018

Feature Description

Use Case(s)

kyhavlov commented Aug 16, 2018

pierresouchay commented Aug 17, 2018

shantanugadgil commented Aug 17, 2018

shantanugadgil commented Aug 25, 2018

pierresouchay commented Aug 25, 2018

shantanugadgil commented Aug 26, 2018

pierresouchay commented Aug 26, 2018

banks commented Sep 10, 2018 • edited Loading

pierresouchay commented Sep 10, 2018

banks commented Sep 10, 2018 via email

hanshasselberg commented Dec 19, 2019

pierresouchay commented Jan 10, 2020

hanshasselberg commented Jan 13, 2020

banks commented Sep 10, 2018 •

edited

Loading