Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeouts & other failures on the dataplane #1031

Open
alexsnaps opened this issue Nov 18, 2024 · 0 comments
Open

Timeouts & other failures on the dataplane #1031

alexsnaps opened this issue Nov 18, 2024 · 0 comments

Comments

@alexsnaps
Copy link
Member

Currently, request dealt with by the wasm-shim (Auth & RL as of now), can run into "issues", whether configuration issues, network issues with upstream services (e.g. Limitador or Authorino) or timeouts talking to these services. We have, as part the wasm-shim's configuration, a way to either allow request through in the face of such failure, or deny them (see #866 for more details). But I'm unsure this is the proper model to apply here.

What's a failure?

There are multiple ways serving a request from a downstream client can fail:

  • Bad config (e.g. some invalid CEL, either syntactically or "semantically", i.e. the data format/type it operates on)
  • Marshaling error (e.g. badly formatted JSON literals)
  • Transient networking issues (a upstream service from Limitador or Authorino being down, e.g. Redis being temporarily unavailable)
  • Timeouts (i.e. the wasm-shim not getting a response back from a service in a timely fashion)

Some of these errors are targeted for the policy author to fix. While ideally all syntax errors are caught earlier than today (ideally not even an invalid policy being admitted into the cluster unless all CEL expressions are valid), others can't be validated until at least one request is handled (e.g. expecting that auth.identity.groups being a List of String). And that it remains an invariant for the rest of time, is something that can't ever be guaranteed.

Some errors are just one offs, e.g. some invalid JSON literal or otherwise some sort of data corruption that leads to not being able to resolve some value properly. So, while some of the errors above can be tested for and resolved in development, or staging environments, they can't ever be guaranteed to completely disappear once in production.

Networking issues are just bound to happen. I guess they get best dealt with with observability, yet we need a clear contract with our users on what will happen when they occur. The current behavior of defaulting denying traffic for when authentication fails, while allowing it through for when rate limiting fails is probably a good enough approach. But should we, e.g. resolve authenticated RL when we know auth failed?

Finally timeouts... How much time is enough time for a policy to be resolved is effectively a function of the policy itself and the environment against which they are resolved. For instance, going to 5 different external services for an effective AuthPolicy will just take longer than one that "only" hits one service consistently. Having Limitador hit a Redis that's across the globe will take longer than using its in-memory storage.

Did I miss some error types?

What would a user expect in the face of such failures?

There, my guess is, it depends. Ideally, in development, the request from the gateway would contain all that's needed to solve the issue: 'admin' in auth.identity.groups from policy XYZ can't be evaluated because auth.identity.groups isn't a field on auth.identity; or is the wrong type; or... some explanation. In production... maybe the configuration decides what needs to happen? Is this a RateLimitPolicy? Then allow through...

Marshaling errors & other "data dependent" issue would possibly be best addressed with some webhook, or otherwise "user observable mechanism" as to what to do, possibly fixing the data itself (even transiently for the request to fallback to other decision making)? But there needs to be some mechanism for the user to gather that data overtime, possibly triggering some alerting passed some threshold, so that someone can eventually do something about it, irrespective of what happens to the downstream request (as long as it is configurable by the user).

Transient networking issues fall in the same camp I think, but willing to be wrong.

Finally timeouts, seem like they would need to be configurable on a per policy level. Tho, given the user has no control over the effective policy being handled at the data plane layer, we could use some time budget on a per policy that then translates in the actual timeout at the effective policy level...

Other ideas?

tl;dr

I think simply exposing a timeout (see #1027) or failure mode per service isn't enough to let users have proper control over the system's behavior in the face of failure and we should consider a more holistic approach to both these problems, which would also possibly make the user's life easier during the development phase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

1 participant