Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

override_host_set conflates different EDS health statuses #23213

Closed
markdroth opened this issue Sep 22, 2022 · 10 comments · Fixed by #23432 or #24094
Closed

override_host_set conflates different EDS health statuses #23213

markdroth opened this issue Sep 22, 2022 · 10 comments · Fixed by #23432 or #24094

Comments

@markdroth
Copy link
Contributor

This is something we discovered while in the process of implementing Envoy's stateful session affinity feature (see #16698) in gRPC.

From reading the code, it appears that there's a bug in Envoy's handling of the override_host_status field. The field allows setting individual health statuses for which the host-override behavior should be honored. However, internally, Envoy does not actually track all of the health statuses individually; in particular, it treats all of DRAINING, UNHEALTHY, and TIMEOUT as being exactly the same thing, and it treats UNKNOWN and HEALTHY as being the same thing (see code). So for example, if someone sets the override_host_status set to include DRAINING but not UNHEALTHY or TIMEOUT, Envoy will actually honor the host-override for all three of those states. I think this is a bug; it seems very clear from the API what is actually intended here.

I think the underlying problem here is that Envoy does not internally record the actual EDS health status in the Host object; instead, it does a lossy conversion to Host::Health. I suspect that the right fix is to have the Host object record the original EDS health status with full fidelity and then look at that when evaluating the override_host_status set.

CC @wbpcode @htuch @yanavlasov @sanjaypujare

@markdroth markdroth added bug triage Issue requires triage labels Sep 22, 2022
@adisuissa adisuissa added area/health_checking and removed triage Issue requires triage labels Sep 23, 2022
@wbpcode
Copy link
Member

wbpcode commented Sep 27, 2022

I suspect that the right fix is to have the Host object record the original EDS health status with full fidelity and then look at that when evaluating the override_host_status set.

Sounds great. But how to handle the status set by active health checker or outlier detection? 🤔 Should we just keep two statuses flag or allow the health checker to overwrite the EDS status?

@markdroth
Copy link
Contributor Author

I would think that an active health checker failure or an outlier detection ejection would map to the EDS UNHEALTHY state.

@sanjaypujare
Copy link
Contributor

or may be maintain 2 status fields as @wbpcode suggested?

@markdroth
Copy link
Contributor Author

I think having two status fields would make the API more complicated, because then we would have to expose a way of controlling whether a given host will be used as an override based on the value of both fields.

The only status that is currently exposed in the xDS API is the EDS health status enum, and that enum is capable of representing the health states that Envoy drives from places other than EDS as well, so it seems to me that that's the right thing to use here.

@wbpcode
Copy link
Member

wbpcode commented Oct 9, 2022

/assign

Any way, we should solve this problem.

@htuch
Copy link
Member

htuch commented Oct 10, 2022

FTR I think it makes sense to keep track of the original EDS status in Host, so +1 to that. At the very least this would make sense when reporting endpoint status in endpoint dumps, but there are other advantages as Mark points out.

htuch pushed a commit that referenced this issue Oct 20, 2022
…h status (#23432)

Part of work to fix #23213.

Because the canonical health() method is used widely and is enough for most of cases. We can treat it as abbreviated status. So, we won't update the health() directly.

This PR added a new healthStatus() method to the host to provide more specific health status (with type of EDS envoy::config::core::v3::HealthStatus).

This status is hybrid of EDS status and runtime active status (from active health checker or outlier detection). Runtime active status will be taken as a priority.

This is simplest implementation for this feature.

Risk Level: none. not change any exist code but add a new method.
Testing: unit.
Docs Changes: n/a.
Release Notes: n/a.
Platform Specific Features: n/a.

Signed-off-by: wbpcode <[email protected]>
@markdroth
Copy link
Contributor Author

I don't think this is fully fixed yet.

@markdroth markdroth reopened this Oct 20, 2022
@wbpcode
Copy link
Member

wbpcode commented Oct 20, 2022

Yeah, an additional minor patch is necessary to match the override status with the new health status.

I am doing it now.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 19, 2022
@wbpcode
Copy link
Member

wbpcode commented Nov 19, 2022

active

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Nov 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment