Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failure_mode_allow=true is not fully supported for the grpc authz server. #34705

Closed
konstantin-baidin-y42 opened this issue Jun 12, 2024 · 8 comments · Fixed by #34951
Closed
Labels
area/ext_authz bug stale stalebot believes this issue/PR has not been touched recently

Comments

@konstantin-baidin-y42
Copy link
Contributor

konstantin-baidin-y42 commented Jun 12, 2024

Title: failure_mode_allow=true is not fully supported for the grpc authz server.

Description:
There is inconsistency in how the failure_mode_allow=true option is applied to external authorization servers depending on the protocol: http and grpc.

In case of http authz server, any 5xx statuses lead to accepting the incoming request (fail-open).

In case of grpc authz server, statuses like 14 (Unavailable) lead to rejecting the incoming request. If there is any grpc proxy between envoy and authz server, then this proxy can return 14 (Unavailable) when authz server is unavailable. It leads to the fail-close behaviour even when failure_mode_allow is enabled. As a result, we have random rejections while with failure_mode_allow we want the opposite - don't reject requests because of random failures.

Expected behaviour is to use failue_mode_allow when receiveng grpc statuses like 2 (Unknown), 13 (Internal), 14 (Unavailable).

The following PR #4199 was intended to handle these cases when failure_mode_allow=true and authz server is unhealthy, but it was solved for HTTP authz server only.

Repro steps:

  1. Configure envoy to have failure_mode_allow=true
  2. Make the authz server to return grpc status 14 (Unavailable)
  3. Perform http request to a service in the cluster
  4. Envoy returns 403 (fail-close), while it was expected to perform fail-open.
@konstantin-baidin-y42 konstantin-baidin-y42 added bug triage Issue requires triage labels Jun 12, 2024
@alyssawilk
Copy link
Contributor

cc @esmet @tyxia @ggreenway

@alyssawilk alyssawilk removed the triage Issue requires triage label Jun 13, 2024
@ggreenway
Copy link
Contributor

Seems likely that these error codes are getting misinterpreted as a denied response, or hitting some alternate error path from the normal one.

@konstantin-baidin-y42
Copy link
Contributor Author

Hello, I tested it manually with logging level tracing and found that responses with error status codes are processed in the GrpcClientImpl::onSuccess callback instead of the GrpcClientImpl::onFailure. So I added a check of the status code and similar actions as in the onFailure callback.
I made a PR with a draft of the fix. Can you please take a look? #34951

martinduke pushed a commit to martinduke/envoy that referenced this issue Aug 8, 2024
…er as failure to fix failure_mode_allowed feature (envoyproxy#34951)

Fixes envoyproxy#34705

Signed-off-by: konstantin-baidin-y42 <[email protected]>
Signed-off-by: Martin Duke <[email protected]>
@konstantin-baidin-y42
Copy link
Contributor Author

Since my fix was reverted in #35637, can you please re-open this issue again? @ggreenway
Or should I create a duplicate one?

@konstantin-baidin-y42
Copy link
Contributor Author

Also, do I understand right that this time I should create the same PR, just with the runtime guard to be false by default as I did initially?

@ggreenway ggreenway reopened this Aug 12, 2024
@ggreenway
Copy link
Contributor

You'd need to make it with a config setting to control the behavior, which defaults to the old behavior. Runtime flags are meant to be temporary.

asingh-g pushed a commit to asingh-g/envoy that referenced this issue Aug 20, 2024
…er as failure to fix failure_mode_allowed feature (envoyproxy#34951)

Fixes envoyproxy#34705

Signed-off-by: konstantin-baidin-y42 <[email protected]>
Signed-off-by: asingh-g <[email protected]>
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 12, 2024
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ext_authz bug stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
3 participants