override_host_set conflates different EDS health statuses #23213

markdroth · 2022-09-22T17:49:18Z

This is something we discovered while in the process of implementing Envoy's stateful session affinity feature (see #16698) in gRPC.

From reading the code, it appears that there's a bug in Envoy's handling of the override_host_status field. The field allows setting individual health statuses for which the host-override behavior should be honored. However, internally, Envoy does not actually track all of the health statuses individually; in particular, it treats all of DRAINING, UNHEALTHY, and TIMEOUT as being exactly the same thing, and it treats UNKNOWN and HEALTHY as being the same thing (see code). So for example, if someone sets the override_host_status set to include DRAINING but not UNHEALTHY or TIMEOUT, Envoy will actually honor the host-override for all three of those states. I think this is a bug; it seems very clear from the API what is actually intended here.

I think the underlying problem here is that Envoy does not internally record the actual EDS health status in the Host object; instead, it does a lossy conversion to Host::Health. I suspect that the right fix is to have the Host object record the original EDS health status with full fidelity and then look at that when evaluating the override_host_status set.

CC @wbpcode @htuch @yanavlasov @sanjaypujare

The text was updated successfully, but these errors were encountered:

wbpcode · 2022-09-27T03:53:15Z

I suspect that the right fix is to have the Host object record the original EDS health status with full fidelity and then look at that when evaluating the override_host_status set.

Sounds great. But how to handle the status set by active health checker or outlier detection? 🤔 Should we just keep two statuses flag or allow the health checker to overwrite the EDS status?

markdroth · 2022-09-27T15:12:53Z

I would think that an active health checker failure or an outlier detection ejection would map to the EDS UNHEALTHY state.

sanjaypujare · 2022-09-27T16:22:31Z

or may be maintain 2 status fields as @wbpcode suggested?

markdroth · 2022-09-27T21:11:49Z

I think having two status fields would make the API more complicated, because then we would have to expose a way of controlling whether a given host will be used as an override based on the value of both fields.

The only status that is currently exposed in the xDS API is the EDS health status enum, and that enum is capable of representing the health states that Envoy drives from places other than EDS as well, so it seems to me that that's the right thing to use here.

wbpcode · 2022-10-09T09:31:06Z

/assign

Any way, we should solve this problem.

htuch · 2022-10-10T03:20:48Z

FTR I think it makes sense to keep track of the original EDS status in Host, so +1 to that. At the very least this would make sense when reporting endpoint status in endpoint dumps, but there are other advantages as Mark points out.

…h status (#23432) Part of work to fix #23213. Because the canonical health() method is used widely and is enough for most of cases. We can treat it as abbreviated status. So, we won't update the health() directly. This PR added a new healthStatus() method to the host to provide more specific health status (with type of EDS envoy::config::core::v3::HealthStatus). This status is hybrid of EDS status and runtime active status (from active health checker or outlier detection). Runtime active status will be taken as a priority. This is simplest implementation for this feature. Risk Level: none. not change any exist code but add a new method. Testing: unit. Docs Changes: n/a. Release Notes: n/a. Platform Specific Features: n/a. Signed-off-by: wbpcode <[email protected]>

markdroth · 2022-10-20T04:14:34Z

I don't think this is fully fixed yet.

wbpcode · 2022-10-20T04:48:24Z

Yeah, an additional minor patch is necessary to match the override status with the new health status.

I am doing it now.

github-actions · 2022-11-19T08:01:31Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

wbpcode · 2022-11-19T11:44:07Z

active

markdroth added bug triage Issue requires triage labels Sep 22, 2022

adisuissa added area/health_checking and removed triage Issue requires triage labels Sep 23, 2022

repokitteh-read-only bot assigned wbpcode Oct 9, 2022

wbpcode mentioned this issue Oct 11, 2022

upstream: new healthStatus method to provide more detailed host health status #23432

Merged

htuch closed this as completed in #23432 Oct 20, 2022

markdroth reopened this Oct 20, 2022

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 19, 2022

github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Nov 19, 2022

wbpcode mentioned this issue Nov 20, 2022

upstream: verify all statuses of the override host #24094

Merged

wbpcode closed this as completed in #24094 Dec 1, 2022

htuch mentioned this issue Jul 13, 2023

upstream: fix bugs of healthStatus() method #28081

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

override_host_set conflates different EDS health statuses #23213

override_host_set conflates different EDS health statuses #23213

markdroth commented Sep 22, 2022

wbpcode commented Sep 27, 2022 •

edited

Loading

markdroth commented Sep 27, 2022

sanjaypujare commented Sep 27, 2022

markdroth commented Sep 27, 2022

wbpcode commented Oct 9, 2022

htuch commented Oct 10, 2022

markdroth commented Oct 20, 2022

wbpcode commented Oct 20, 2022

github-actions bot commented Nov 19, 2022

wbpcode commented Nov 19, 2022

override_host_set conflates different EDS health statuses #23213

override_host_set conflates different EDS health statuses #23213

Comments

markdroth commented Sep 22, 2022

wbpcode commented Sep 27, 2022 • edited Loading

markdroth commented Sep 27, 2022

sanjaypujare commented Sep 27, 2022

markdroth commented Sep 27, 2022

wbpcode commented Oct 9, 2022

htuch commented Oct 10, 2022

markdroth commented Oct 20, 2022

wbpcode commented Oct 20, 2022

github-actions bot commented Nov 19, 2022

wbpcode commented Nov 19, 2022

wbpcode commented Sep 27, 2022 •

edited

Loading