Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single merged policy #4

Closed
wants to merge 5 commits into from
Closed

Single merged policy #4

wants to merge 5 commits into from

Conversation

alexsnaps
Copy link
Member

No description provided.

@alexsnaps alexsnaps marked this pull request as draft November 10, 2022 20:11
Copy link
Contributor

@eguzki eguzki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like the proposal and I am willing to implement that.

However, I see a potantial caveat on this approach regarding RLP's targeting a gateway. As a cluster administrator I have detected unusual traffic for suspect-subdomain.domain.com a I want to rate lmit that traffic for "Host: suspect-subdomain.domain.com". Assuming there is a HTTPRoute for this tenant for *.domain.com

rfcs/0000-policy_target_scope.md Outdated Show resolved Hide resolved
`FakeRLPolicy` attached to `my-gw`, the resulting policy would look like this:

```yaml
enabled: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enabled should be false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is set as true as an override on the GW attached policy. That's what I describe in the paragraph above...

(semantically or atomically?)?
- Can we inject the proper config in each of them? Only if all is applicable? What if they are not?
What when that changes over time?
- Can these then be backed by a single e.g. `Limitador`? (in that case I think yes, as we could
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say yes, multiple gateways can be backend by a single Limitador. Currently, the namespace field has a prefix withthe name/NS of the gateway, so limits are well scoped to each gateway. This an implementation detail that can be discussed, of course.

rfcs/0000-policy_target_scope.md Outdated Show resolved Hide resolved
specification according to the `override` and `default` section of the `Policy`'s `spec`. Let's use a
simple example:

```yaml
Copy link
Collaborator

@maleck13 maleck13 Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I like the simplicity being aimed at here but I worry we will need to allow some form of conditional here
example if we were to always send common descriptors (if we agree they are common) e.g Headers. Would we see a way for someone to define (as an example)

limits:
        - conditions:
            - "req.path=='/checkout'"
            - "req.method=='POST'"
          maxValue: 5
          seconds: 10
          variables: []

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I feel you are aiming at something very different. I had thought perhaps by applying a policy to all traffic, it had meant all traffic would be sent to the implementing service and then the implementing service could make a conditional decision

So there are a couple of things to consider here:

- status reporting on "not yet" existing `HTTPRouteRule`, e.g. a Policy declares a set of `matches` that don't (yet?)
exist. The policy must _not_ be enforced... yet. But could eventually. This needs proper reporting.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative could be report the policy as invalid, that could simplify things at least at the beginning... a typical error would be RateLimitPolicy/AuthPolicy rules don't match the HTTPRouteRule definition

[drawbacks]: #drawbacks

- The use cases this approach doesn't address…
- Not very DRY... the need to keep `HTTPRouteRule`s in sync between the Policy and the `HTTPRoute` is suboptimal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the most discouraging point, but the good thing is we could mitigate a bit with status reporting and ux tooling.

@guicassolato
Copy link
Contributor

I'd like to propose an exercise to put this to proof.

First, of course, we'd need to agree on the exact set of use cases to cover. IMO, the following one seems "real" enough. At the same time, it's simple (with only 2 Gateways, 2 HTTPRoutes, not too many matching rules) and yet possibly intricate enough.

Exercise

Given the following initial network resources:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: Gateway
metadata:
  name: gateway-1
  namespace: gwns
spec:
  gatewayClassName: internal
  listeners:
    - hostname: *.acme.internal
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: Gateway
metadata:
  name: gateway-2
  namespace: gwns
spec:
  gatewayClassName: internet
  listeners:
    - hostname: *.acme.com
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: toys
  namespace: toys
spec:
  parentRefs:
    - kind: Gateway
      namespace: gwns
      name: gateway-1
    - kind: Gateway
      namespace: gwns
      name: gateway-2
  hostnames:
    - toys.acme.com
    - toys.acme.internal
  rules:
    - backendRefs:
        - kind: Service
          name: toys
    - matches:
        - headers:
            - name: X-Env
              value: canary
      backendRefs:
        - kind: Service
          name: toys-canary
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: telemetry
  namespace: telemetry
spec:
  parentRefs:
    - kind: Gateway
      namespace: gwns
      name: gateway-1
  hostnames:
    - *.telemetry.acme.internal
  rules:
    - matches:
        - method: GET
        - method: POST
      backendRefs:
        - kind: Service
          name: telemetry

What are the set of policies (RLP and KAP) and possibly required changes to the network resources to implement the following use cases?

  1. RLP override 500 rps on all routes matching *.acme.com.
  2. RLP default 100 rps on POST requests to all routes matching *.acme.com.
  3. KAP override IP deny-list on all requests to *.acme.com.
  4. KAP default X.509 certificate authentication on all requests to *.acme.internal.
  5. RLP default 150 rps on all requests to toys.acme.(com|internal).
  6. RLP default 50 rps on all requests to toys.acme.(com|internal) containing X-Env: canary request header.
  7. RLP override unlimited rps on all requests to toys.acme.internal/admin/*.
  8. KAP override API key authentication on all requests to toys.acme.com.
  9. KAP override DELETE requests forbidden at toys.acme.com.
  10. KAP override all requests forbidden at toys.acme.com/admin/*.
  11. KAP override JSON pattern-matching authorisation to check path param {org_name} matches value stored in API key or X.509 cert on all requests to toys.acme.(com|internal)/[^admin/]orgs/{org_name}/*.
  12. KAP override K8s SAR authorisation on all requests to toys.acme.internal/admin/*.
  13. KAP default OIDC/JWT authentication on all endpoints matching *.telemetry.acme.internal.
  14. KAP override additional API key authentication to all requests to foo.telemetry.acme.internal.

@maleck13 maleck13 mentioned this pull request Feb 22, 2023
requires to know how to applicable it becomes to the union and/or the intersection of the said
subsets. While a heuristic could be applied, there are use-cases where the other option will be
desirable. The question then becomes which policy is the authority as to decide which it should be. In
order to avoid having to answer these questions, this proposal looks at not letting a policy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this proposal looks at not letting a policy discriminate which part of traffic it applies to: it's the whole traffic, always.

this statement is still valid?


Say, we now have a `Backend` (e.g. `foo`) wired to the `my-gw` using some `HTTPRoute` (e.g.
`foo-route`), none of which have yet any `FakeRLPolicy` attached. Yet, since the `Gateway` used as a
policy defined, Kuadrant will create a `RateLimit` `CR` and eventually configure `Limitador` to apply
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RateLimit CR

a bit outdated, isn't it?

starting with `/foo`), the rate limiting will apply to the same traffic: `example.com/foo*`, so that
`foo` will not see more than 42 rps on port `8080`.

Given the `override` on the `my-gw` `FakeRLPolicy` there no way for our route to disable rate limiting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the override on the my-gw FakeRLPolicy there no way for our route to disable rate limiting on the traffic

This is, IMHO, the main pain point of overall defaults/overrides approach applied to rate limiting (auth is another history here). It is not that implementing defaults/overrides for rate liming can be complex because it is the outcome of unions and/or intersections of rules. It is not that, only. It is that the gateway provider can say (with overrides) that this traffic is going to be 42 no matter what I say as a API owner. And my service may not afford 42, only 22 (to say something).

For the rate limiting context, the "most restrictive wins" approach seems to me much more natural and appropriate. No matter which rol I play I can say I want X rps. And the "most restrictive wins" approach grants me that not matter what, my service will not get more than X rps. Maybe, because some other level rate limiting policy, my service will not get more than, let's say X/2 rps.. but my limit still holds, no more than X rps. My contract with the traffic system specified in the policy is never broken. If the traffic is not going to be higher than X/2. it is also true that it is not going to be more than X rps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it would "just" be up to the GW to express this properly, i.e. using a default rather than an override. Also, there might be a need to say 10 (i.e. less than 42), say when being DDoS'ed, so that'd be an override... The fact that the API lets users express what we think are "none sensible" things, doesn't invalidate it.

But I think we could also have Kuadrant "merge" the policies based of the most restrictive wins and "only" support defaults initially.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there might be a need to say 10 (i.e. less than 42), say when being DDoS'ed, so that'd be an override...

That is a valid use case and it is covered by "most restrictive policy wins"

But I think we could also have Kuadrant "merge" the policies based of the most restrictive wins

That is what I call "conditional defaults/overrides" which is worth elaborating/investigating more, but I find it as a complex API to expose to the customers.

"only" support defaults initially.

Is this a valid use case? I mean, a gateway provider saying I want (default) 10 rps but I do not care if you as API provider sets 1000 rps. My point being, if the gateway provider applies 10 rps, that 10 is not a magical number it is something specific for the existing resources and constraints.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need more real world use cases to be sure. A clear use case is a gateway provider/infra provider setting a global limit for the gateway to protect it lets say 1000 RPS and wanting that to be a "override" limit. This is a use case we have in other services where we have used limitador to do this at the gateway level as the infra provider in order to enforce a limit based the tier.
So then the question and use case comes for a default RL. This is less clear as @eguzki points out. That said, I can imagine as a "gateway admin", I don't want any one service to be able to just accidentally consume this entire limit without clearly expressing this intention. I don't want any endpoint deployed to my gateway that some random client can accidentally get into a loop and swamp the gateway. So I now set a sane but restrictive default based on remote ip or some common property to a low number lets say 50 RPS. Now I know if an endpoint is created without any RLP, it can only at worst consume 50 RPS.
So now it is up to the service provider/developer to say explicitly "I need more than 50 RPS I need 200 ~RPS" for my service. So the piece I see as missing here is setting an "upper limit" override. IE by default its 100 but you can override it up to 250 but beyond that we need to have a serious conversation.
@eguzki Does the above make sense in your world? I don't claim to have all the correct use cases here but the above seems sane to me. Interested in your thoughts on the default and being able to set a "upper limit" override.
I think it is also important that we don't think of RLP as the only control involved. As a gateway admin, depending on the environment, I am going to want to setup monitoring so that I can identify what limits are set, and what traffic is flowing to certain endpoints. This will allow me to identify a potential problem before it happens. Also I might have guidelines in place for the teams using my gateway and perhaps I even have an SRE member reviewing any new limits set as part of a git ops flow etc. I think we need to think of RLP as part of a whole rather than the whole.

port: 8080
```

It is important to consider that only _one_ rule will ever be match for any given requests. It makes for a simpler model to have only one "Policy" to also only ever being match, aligning the both behaviors. To apply a policy on the first rule only, the user needs to duplicate that match on the policy targeting `bar-route`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the user needs to duplicate that match on the policy targeting bar-route

This is essentially what we have now, isn't it?

@alexsnaps
Copy link
Member Author

superseded by #58

@alexsnaps alexsnaps closed this Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Status: Needs refinement
Development

Successfully merging this pull request may close these issues.

5 participants