Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement WarningTimeout for TCP/HTTP Checks #4522

Closed
pierresouchay opened this issue Aug 13, 2018 · 13 comments
Closed

Implement WarningTimeout for TCP/HTTP Checks #4522

pierresouchay opened this issue Aug 13, 2018 · 13 comments
Assignees
Labels
type/enhancement Proposed improvement or new feature

Comments

@pierresouchay
Copy link
Contributor

Feature Description

In order to allow TCP / HealthCheck to pass to warning state when checks to take too much time answering, I propose to add a WarningTimeout attribute (time, for instance "200ms").

When HeathCheck is run, nothing is changed, but if duration of "Passing" HealthCheck is more than the attribute WarningTimeout, then use "Warning" instead of "Passing" Status.

Use Case(s)

Allow to use easily #4468 when HTTP applications take too much time to answer.

@kyhavlov
Copy link
Contributor

This seems like a reasonable enhancement to feed some health info back into Consul in cases like an instance of a service being overloaded or having poor network connectivity. What do you think about the name of the field being something like "WarningThreshold" instead, since the request isn't really "timing out" in the normal sense?

@kyhavlov kyhavlov added the type/enhancement Proposed improvement or new feature label Aug 16, 2018
@pierresouchay
Copy link
Contributor Author

WarningThreshold seems fine to me as well, I will try an implementation asap

@shantanugadgil
Copy link
Contributor

My two cents;

This should be modeled like the "health checks" of AWS' load balancer/target group options.
(I am sure folks with more exposure can pitch in more ideas).

A feature, I feel sometimes could be useful is an indication of "steady state"; i.e. when "N" probes return healthy.

As an enhancements, there could possibly be tunables to configure things like the count of the "steady state", etc.

Cheers,
Shantanu

@shantanugadgil
Copy link
Contributor

@pierresouchay any thoughts about the steady state idea?

@pierresouchay
Copy link
Contributor Author

@shantanugadgil adding a new heathcheck type is probably difficult because it involves lots of changes in Consul as well as within ecosystem: most tools expect only 3 states.

The main idea behind this change is to allow systems not supporting http 429 to support warning state easily (or to detect server/OS slowness for TCP)

@shantanugadgil
Copy link
Contributor

@pierresouchay I wasn't suggesting to add a new type of health check.
All I was saying was to add more fields to the current health check to give the "capabilities" which could help when people compare it to "cloudy" health check capabilities. 😁

When combined with "alias" type health checks, it could probably also cover the often asked question about "ready" state and "healthy" state.

Say, a level1 health check only determines basic tests (say "ps" output for process visibility or a tcp check)
Another health check, say a level2 health check doesn't start firing until the level1 health is passing.

Cheers,
Shantanu

@pierresouchay
Copy link
Contributor Author

@shantanugadgil Well I that case yes, but probably not within this PR. More generally, we need to be able to transmit more information to LB systems, for instance I would like to implement #3973 in order to allow a LB to gradually increase the traffic targeting a host (by deducing that the service is Passing, but only since 1s ago) to avoid sending lots of traffic immediately.

We could use notes as a hack from what you describe, but having proper semantics would be a great addition. I don't know if it should be at service level or at health check level, but this really deserves discussions.

I would be glad to contribute to such discussion if you create an issue with you ideas regarding this.

Kind Regards

@banks
Copy link
Member

banks commented Sep 10, 2018

I'm not sure about this. Not sure why I didn't see it yet but my feeling is that it's too simplistic and in practice it will result in a lot of "flapping" or service discovery churn as described.

It's only a warning not a critical failure so it's not true "flapping" but assuming that something is actually paying attention to these then any time a service happens to be answering a big request or doing a GC at the same moment the check runs it will update all load balancer weights everywhere.

It seems like as soon as anyone starts using this in production they'll find it actually causes more disruption than it solves and makes load fluctuate wildly or has no effect at all.

The other issue I see is that often a check against a "healthz" style endpoint doesn't do any actual work so in many cases having it return quickly is not a good indication that the service is not under any load. This is no worse than now granted, but to make this a robust load steering system that should be accounted for.

While this might be useful as proposed for some folks it feels like a very "poor man's" solution to traffic steering and I don't think we want to set the precedent that we will solve it in this way since current Consul "health" checks are not really high fidelity enough to meet expectations for dynamic traffic steering/load shedding. The big issue is it's re-purposing something an implicit sign of bad health which can easily have both false positives (see below) and false negatives and with no way to smooth that result over time or separate signal from noise etc.

We could build a real solution for that including a way for applications to either explicitly provide some measure of current load back to Consul, or for Consul to probe in a more high-fidelity way (or at very least keep some history of failures to enable better determination of when something is "degraded" vs just one unlucky request). I'm open to discussion on the simplest thing there that seems actually viable/useful but note the next point.

I think the better approach for Consul in general in the medium term will come from L7 support in Envoy and potentially other proxies that can all do this kind of thing way better since they have a complete picture of the full request latency performance for each instance over time etc.

Thoughts?

@pierresouchay
Copy link
Contributor Author

@banks While you might be right for several things, such as GC, an overloaded system might still take time to accept a single Connection when really overloaded.

First, we have to note that this warning state is ONLY set when healthcheck is responding correctly.
In our shop, some services did properly implement 429 to trigger warning state (since we want to avoid those nodes receive too much trafic), but it is very difficult to ask developers to set up proper threshold for responding correct HTTP Code. There are even some languages were it is very difficult to achieve, think a PHP simple program checking some of its requirements.

For sure, this option should be used with care, but allow easily to Control a third state without performing signifiant changes to apps. I have in mind for instance a few instances of a service connected to a database cluster with stickiness. One of the database is very slow to respond, having the ability to route most of the trafic using weight to the instances of service not sticked to very slow database allow this database to recover for instance.

While load is a good indicator (we are gonna test this soon => return warning state when load is too high for a long time), that's not the other one, and being able to change weight based on bad performance seems like a good first step.

@banks
Copy link
Member

banks commented Sep 10, 2018 via email

pierresouchay added a commit to pierresouchay/consul that referenced this issue May 10, 2019
It implements setting Warning state for HTTP 200 or when TCP
connection succeeds but is too slow as described
in hashicorp#4522

Combined with hashicorp#4468 it will
allow very easily to reduce calls/secs on slow nodes in a
completely automatic way.
@hanshasselberg
Copy link
Member

Hello @pierresouchay are you still interested in adding this feature?

@hanshasselberg hanshasselberg self-assigned this Dec 19, 2019
@hanshasselberg hanshasselberg added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Dec 19, 2019
@pierresouchay
Copy link
Contributor Author

@i0rek Well interested yes as I had to implement a script check to do it when I had the issue (PHP scripts on a machine, so without ability to fine-tune timeouts).

But you seem not interested by current PR #4576, since it is not a big deal for us, unless you are interested or provide direction, I won't work on it anymore. But if you think it might be fine, I'll be Ok to rebase and do the appropriate fixes.

On our infrastructure, it is not a big deal as we have good control on most health check, so we can do it another way. Furthermore, our other (more important) patch we made to handle temporary timeouts has been merged: #5739 -> no it is such a big deal for us.

@stale stale bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 10, 2020
@hanshasselberg
Copy link
Member

Thank you for the detailed explanation. We are not interested right now in adding that feature ourselves and given that it is not a big deal for you either, I am closing this issue and the corresponding PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Proposed improvement or new feature
Projects
None yet
Development

No branches or pull requests

5 participants