RLP Enforced condition #414

KevFan · 2024-02-06T11:07:27Z

Part of Kuadrant/architecture#38

Related issues: #140

Implement the Enforced condition type for RateLimitPolicy

The text was updated successfully, but these errors were encountered:

KevFan · 2024-02-15T13:59:05Z

Some thoughts on this:

In order to check that the RateLimitPolicy is enforced in Limitador, there is multiple options for this:

Compare the limits config map on k8s cluster to the one loaded into the limitador pod
- Easier option
- Naive assumption that limits are and will be loaded into Limitador correctly without any issues

or

A deeper check, compare the limits config map entries with the limits loaded into Limitador redis using the /limits/{namespace} API provided by limitador itself.
- More difficult option
- Extra requests made to Limitador for this check (can potentially impact Limitador performance by these requests)

Option 1 and 2 require shelling into the Limitador pod in order to compare / access the Limitador API unless we want expose a service endpoint for Limitador in Limitador operator if we want to go with option 2 at https://github.com/Kuadrant/limitador-operator/blob/main/pkg/limitador/k8s_objects.go#L51

Either or, I feel this logic should exist in Limtiador Operator itself to check if the Limits are correctly loaded into Limtiador.
Kuadrant Operator should then be able infer this via the Limitador CR and set whether the RateLimitPolicy is enforced (similar to how AuthPolicy can infer the enforcement via AuthConfig #411).

Interested in @eguzki @didierofrivia @alexsnaps thoughts on this and on what's the preferred approach 🤔

didierofrivia · 2024-02-15T14:59:46Z

I'd only say that we shouldn't go to the storage option of Limitador to check this, since that is going to far for the Operator. The first option, as simple as it sounds, could be a good first approach IMHO. Then we could decide to evolve it to a more thorough probe... However, I agree with you that it should be Limitador Operator where the logic should reside.

eguzki · 2024-02-15T18:11:08Z

My take on this:

The RLP controller should rely on the Limitador CR status.

The Limitador operator should populate the Limitador CR status with available state. I am thinking that the same way the replication controller populates the deployment status objects based on readiness/liveness kubernetes probes (some of which are HTTP based endpoints), the Limitador Operator should be doing something similar. Ideally, the operator would setup a probe mechanism to periodically test the /limits/{namespaces} endpoint and inform the operator about changes somehow. I would not like some "cron" job implemented in the Limitador Operator to do this.

KevFan · 2024-02-21T11:57:31Z

My take on this:

The RLP controller should rely on the Limitador CR status.

The Limitador operator should populate the Limitador CR status with available state. I am thinking that the same way the replication controller populates the deployment status objects based on readiness/liveness kubernetes probes (some of which are HTTP based endpoints), the Limitador Operator should be doing something similar. Ideally, the operator would setup a probe mechanism to periodically test the /limits/{namespaces} endpoint and inform the operator about changes somehow. I would not like some "cron" job implemented in the Limitador Operator to do this.

@eguzki I think the simplest approach for the probe mechanism is for it to be part of the reconcilation loop in Limitador controller in limitador operator 🤔 We can requeue the reconcilation until the limits config map is synced / updated in the pod which the available status in the Limitador CR would imply. With this, any updates to the Limitador CR would trigger a reconcilation loop and update the status state accordingly. In Kuadrant Operator, RateLimitPolicy would not be enforced until these conditions are true in the Limitador CR.

If we are going with this approach and would like to probe periodically, we can configure a default requeue interval after a successful reconcile in Limitador controller

https://github.com/Kuadrant/limitador-operator/blob/main/controllers/limitador_controller.go#L105

eguzki · 2024-02-21T13:41:22Z

My take on this:
The RLP controller should rely on the Limitador CR status.
The Limitador operator should populate the Limitador CR status with available state. I am thinking that the same way the replication controller populates the deployment status objects based on readiness/liveness kubernetes probes (some of which are HTTP based endpoints), the Limitador Operator should be doing something similar. Ideally, the operator would setup a probe mechanism to periodically test the /limits/{namespaces} endpoint and inform the operator about changes somehow. I would not like some "cron" job implemented in the Limitador Operator to do this.

@eguzki I think the simplest approach for the probe mechanism is for it to be part of the reconcilation loop in Limitador controller in limitador operator 🤔 We can requeue the reconcilation until the limits config map is synced / updated in the pod which the available status in the Limitador CR would imply. With this, any updates to the Limitador CR would trigger a reconcilation loop and update the status state accordingly. In Kuadrant Operator, RateLimitPolicy would not be enforced until these conditions are true in the Limitador CR.

If we are going with this approach and would like to probe periodically, we can configure a default requeue interval after a successful reconcile in Limitador controller
* https://github.com/Kuadrant/limitador-operator/blob/main/controllers/limitador_controller.go#L105

respectfully, I disagree. For two reasons mainly:

I would try to avoid the pattern of some k8s controller opening connections and sending HTTP/GRPC requests to something else that is not the API server. For many reasons I am happy to discuss offline.
I would not requeue periodically for events. The controller runtime is meant to watch for events. I would rather create new goroutine and do that if necessary. But I do not like either the goroutine idea. Thus, if no requeue is implemented, then when for whatever reason, limitador loads a new set of limits, the info from the /limits/{namespace} and the Limitador CR would be out of sync. And the CR status would not be updated until something else happens in the cluster.

Happy to discuss about this further and look for an approach we all feel confortable

KevFan · 2024-03-04T16:47:46Z

respectfully, I disagree. For two reasons mainly:

I would try to avoid the pattern of some k8s controller opening connections and sending HTTP/GRPC requests to something else that is not the API server. For many reasons I am happy to discuss offline.

Fair, though I feel this is quite a bit of constraint as some Kubernetes controllers manage external resources not in the Kubernetes cluster but on cloud providers, such as the DNS Operator. Though this point assumes we are going with the approach of HTTP requests to /limits/{namespace} in Limitador, rather than checking the config maps in the Kubernetes cluster vs config map in Limitador pod also.

I would not requeue periodically for events. The controller runtime is meant to watch for events. I would rather create new goroutine and do that if necessary. But I do not like either the goroutine idea. Thus, if no requeue is implemented, then when for whatever reason, limitador loads a new set of limits, the info from the /limits/{namespace} and the Limitador CR would be out of sync. And the CR status would not be updated until something else happens in the cluster.

Yes, I would rather not create a goroutine either for this. Agreed, periodic requeue is not required if we are going with this reconciliation route. The check for enforcement can purely be triggered from watched events from the Limitador CR but only requeued until the config maps are synced. Ideally, any user using the Limitador operator interacts with the CR (or watched resources) to update rate limits. Periodic requeue, I feel, is only necessary if we feel a user may work around this and potentially force the rate limit config map to be out of sync by interacting directly with the Limitador API, thus bypassing watched events 🤔

Happy to discuss about this further and look for an approach we all feel confortable

Another idea that I feel is overkill is having a sort of middle service that is incorporated into a Limitador deployment as a separate container. It has the sole responsibility of just checking the sync of the Limitador config map in Kubernetes and the loaded limits in the Limitador application. This check is called periodically as a readiness or liveness check (though this has potential downsides - if the pod is not ready, then the pod will stop receiving traffic until limits are synced unless this is something we want). It's possible to add this instead to Limitador as a health check endpoint, though that endpoint will be coupled to be deployed into the Kubernetes cluster, which is something I think we don't want. Anyhow, this seems overkill/over-engineering for what we want anyway, but just a thought.

eguzki · 2024-03-04T22:31:00Z

I agree about being overkilling.

Ideally, this would be a deployment readiness probe that labels the deployment as not ready and the limitador operator flags the CR as not ready in the status. The issue is that I do not see how we can implement such complex readiness probe in the deployment. We would need to pass the limits as env vars to the pod and then some custom script living in the limitador's container read the limits from env var and make sure the response of the multiple HTTP requests to /limits/{namespace} match the expected limits. However, any change on the limits would imply new deployment config, hence limitador rollout which is totally undesired. We implemented file system watcher in limitador just to avoid rollouts on limit change events. As said, I do not see how to do this "the easy way".

That leaves us with the following options:

implementing the probe in the limitador controller
some external pod checking limits from config map and endpoint
who cares.. wait 1 minute and the new limits will take effect.

summoning @roivaz here. You have some use cases where you monitor the content of a configmap/secret is being mounted in the container and if it takes too long, you force it with some hack based on annotations. Any input, ideas?

roivaz · 2024-03-05T09:54:12Z

summoning @roivaz here. You have some use cases where you monitor the content of a configmap/secret is being mounted in the container and if it takes too long, you force it with some hack based on annotations. Any input, ideas?

When a ConfigMap mounted as a volume in a container changes, the container doesn't see that change immediately. The reason for this is that, for performance reasons, kubelet only resyncs the pod under 2 circumstances:

when the pod is updated or an event about the pod happens
periodically at a given interval, determined by kubelet's configuration. This most commonly configured to be 60 seconds, and that's why it takes aprox 60 seconds for limitador pods to pick up changes in the limits. But bear in mind this could be even higher, depending on how kubelet is configured (for example you would raise the pod resync interval in a huge cluster to keep the performance). For a better in-depth explanation see https://ahmet.im/blog/kubernetes-secret-volumes-delay/

The solution to this problem can't be applied by changing the resync interval, as this affect the whole cluster. I don't even know if this is something you can configure in OCP.
This only leaves the option of updating the pod somehow to force a resync. It can be achieved quite easily by just applying an annotation with the hashed contents of the configmap as its value directly to the pod whenever the ConfigMap changes. A Pod resync will happen immediately and the new ConfigMap contents will be updated instantly within the container.
I use this solution to update a ConfigMap when Redis failovers occur in 3scale-saas in production, which is extremely critical that happens instantly to avoid loss of service. The code that does this https://github.com/3scale-ops/saas-operator/blob/cd8a04b877ccd144a8a1962f995db88892ad59f0/controllers/twemproxyconfig_controller.go#L167-L228

This still doesn't solve the problem of "knowing" if the update was applied, it just ensures that under normal circumstances the change happens immediately.

eguzki · 2024-03-05T15:53:33Z

groundhog day effect. We had this very same discussion one year ago #140 (comment)

KevFan · 2024-03-06T09:43:45Z

Cool, though as you guys mentioned this doesn't quite solve the ask of this issue, which is knowing if RLP limits are enforced (i.e. knowing that config map changes are synced to limitador) 🤔

roivaz · 2024-03-06T10:29:32Z

Cool, though as you guys mentioned this doesn't quite solve the ask of this issue, which is knowing if RLP limits are enforced (i.e. knowing that config map changes are synced to limitador) 🤔

But arguably if the immediate sync of the configmap is implemented, then the need for actually "knowing" if the configmap has been reloaded is greatly reduced, because almost 100% of the time it will be synced. In fact when it does not sync immediately it would be due to some cluster malfunction, which other operators and monitoring in the cluster have the responsibility of detecting and resolving.

An option is for limitador to have an endpoint where it returns the currently loaded config or just the hash of the loaded config (not sure if this has been already proposed). Envoy for example, has a similar admin endpoint where you can retrieve the currently loaded config.

KevFan · 2024-03-06T16:12:05Z

But arguably if the immediate sync of the configmap is implemented, then the need for actually "knowing" if the configmap has been reloaded is greatly reduced, because almost 100% of the time it will be synced. In fact when it does not sync immediately it would be due to some cluster malfunction, which other operators and monitoring in the cluster have the responsibility of detecting and resolving.

That's a fair assumption, in this case RLP that simply depend on the current "Ready" condition status. Though the pod restart will defeat the purpose the file watcher implementation that @eguzki mentioned so not sure do we want to do this 🤷

An option is for limitador to have an endpoint where it returns the currently loaded config or just the hash of the loaded config (not sure if this has been already proposed). Envoy for example, has a similar admin endpoint where you can retrieve the currently loaded config.

Hmm, that's an option also. Though this would involve making HTTP request to the application 🤔

For reference, I've done a quick implementation / PoC for just checking the CM limits file vs file loaded in limitador pod Kuadrant/limitador-operator#125 as part of the status reconcile loop to which I feel is simplest without having to make request to the Limitador application and based on watched events if anyone wants to check it out

roivaz · 2024-03-06T16:36:06Z

That's a fair assumption, in this case RLP that simply depend on the current "Ready" condition status. Though the pod restart will defeat the purpose the file watcher implementation that @eguzki mentioned so not sure do we want to do this

The pod is not restarted when you add/update an annotation to force the resync of the configmap. The annotation is updated directly in the Pod resource/s, not in the owning Deployment, and that does not trigger a rollout.

KevFan · 2024-03-12T12:13:04Z

The pod is not restarted when you add/update an annotation to force the resync of the configmap. The annotation is updated directly in the Pod resource/s, not in the owning Deployment, and that does not trigger a rollout.

Cool, nice ! 👍

@eguzki Any thoughts on this approach? If config map changes is immediately synced to Pods, we can rely on the limitador cr Ready condition as a means of detecting enforced limits for RLP 🤔

eguzki · 2024-03-12T17:15:50Z

That would be a good enhancement, as the changes would take effect quite soon, which is the complain right now.

Maybe this is good enough. Let me add some thoughts.

This is not a health check or enforced condition. The controller is not checking that the limits are being enforced. The controller is assuming the limits are being enforced. What if the configmap change event is lost?
Thinking about multiple replica of limitador (certainly multiple replica is forbidden for in-memory database), that would be an issue?

Certainly I do not want to over-engineer here. I rather want to be pragmatic. But good to discuss anyway.

KevFan · 2024-03-14T12:37:57Z

This is not a health check or enforced condition. The controller is not checking that the limits are being enforced. The controller is assuming the limits are being enforced. What if the configmap change event is lost?

Hmm, I guess if the change event is lost, then the config map in pod will be out of sync until k8s syncs it eventually. There'll be a time delay if this happens, but RLP & Limitador CR wont detect this and will keep it's current Readystatus

Thinking about multiple replica of limitador (certainly multiple replica is forbidden for in-memory database), that would be an issue?

I dont think so. The main 'issue' is probably having to ensure all replicas are created and that all are updated to the latest annotation value

eguzki · 2024-03-14T14:54:56Z

Good enough for me, at least for now.

KevFan · 2024-04-23T08:17:26Z

Completed as part of #523

guicassolato added this to Kuadrant Feb 6, 2024

KevFan moved this to Todo in Kuadrant Feb 6, 2024

KevFan added kind/enhancement New feature or request area/api Changes user facing APIs labels Feb 6, 2024

KevFan added this to the v0.7.0 milestone Feb 6, 2024

KevFan mentioned this issue Mar 4, 2024

Implement RFC 0004: Policy Status Kuadrant/architecture#38

Closed

12 tasks

KevFan mentioned this issue Mar 19, 2024

feat: immediate limits sync to pod via annotation Kuadrant/limitador-operator#127

Merged

KevFan mentioned this issue Apr 11, 2024

[wip] feat: rlp enforced condition #533

Closed

KevFan closed this as completed Apr 23, 2024

github-project-automation bot moved this from Todo to Done in Kuadrant Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RLP Enforced condition #414

RLP Enforced condition #414

KevFan commented Feb 6, 2024 •

edited

Loading

KevFan commented Feb 15, 2024

didierofrivia commented Feb 15, 2024

eguzki commented Feb 15, 2024

KevFan commented Feb 21, 2024

eguzki commented Feb 21, 2024

KevFan commented Mar 4, 2024

eguzki commented Mar 4, 2024

roivaz commented Mar 5, 2024 •

edited by eguzki

Loading

eguzki commented Mar 5, 2024

KevFan commented Mar 6, 2024

roivaz commented Mar 6, 2024

KevFan commented Mar 6, 2024

roivaz commented Mar 6, 2024

KevFan commented Mar 12, 2024

eguzki commented Mar 12, 2024 •

edited

Loading

KevFan commented Mar 14, 2024

eguzki commented Mar 14, 2024

KevFan commented Apr 23, 2024

RLP Enforced condition #414

RLP Enforced condition #414

Comments

KevFan commented Feb 6, 2024 • edited Loading

KevFan commented Feb 15, 2024

didierofrivia commented Feb 15, 2024

eguzki commented Feb 15, 2024

KevFan commented Feb 21, 2024

eguzki commented Feb 21, 2024

KevFan commented Mar 4, 2024

eguzki commented Mar 4, 2024

roivaz commented Mar 5, 2024 • edited by eguzki Loading

eguzki commented Mar 5, 2024

KevFan commented Mar 6, 2024

roivaz commented Mar 6, 2024

KevFan commented Mar 6, 2024

roivaz commented Mar 6, 2024

KevFan commented Mar 12, 2024

eguzki commented Mar 12, 2024 • edited Loading

KevFan commented Mar 14, 2024

eguzki commented Mar 14, 2024

KevFan commented Apr 23, 2024

KevFan commented Feb 6, 2024 •

edited

Loading

roivaz commented Mar 5, 2024 •

edited by eguzki

Loading

eguzki commented Mar 12, 2024 •

edited

Loading