-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-leading replica fails due to not started cert-controller #1445
Comments
/assign @trasc |
I'm able to reproduce this in Kind:
The non-leader pod restarts after ~1.5 min and keeps restarting. Logs:
|
See the restarts on v0.5.0 as well. Not able to reproduce this on v0.4.2. |
Maybe it is the probes implementations themselves that are wrong.
|
I've recently started using Kueue and I'm intrigued by its implementation. Given that it operates as an admission webhook, I'm wondering if there should be a redundancy system for the pods that manage the requests. Wouldn't it make sense for the non-leader pods to at least handle the admission webhooks, while the leader pod focuses also on background processing? Resulting in also returning readiness probe healthy if it's a non-leader. |
Yeah, ideally all replicas should reply to webhooks. However, we recently introduced this feature https://github.com/kubernetes-sigs/kueue/tree/main/keps/168-2-pending-workloads-visibility. In this case, only the leader can respond. Another alternative would be for non-leaders to also maintain the queues (we do this in kube-scheduler), so that they can also respond to api-extensions requests. I'm not actually sure about what is the behavior that controller-runtime applies. |
controller-runtime exposes the In the case of the OPA cert-controller, there is a For the visibility extension API server, we would need to make sure it's safe to run multiple instances of ClusterQueueReconciler concurrently, or find a way to only run the read-only part? |
/assign |
@astefanutti do you want to on this? |
@trasc yes I can work on it. There is only the part about the visibility API server HA, I may need your input on this. But that may be tackled separately. |
/unassign @trasc |
That might be tricky. The CQReconciler updates status based on usage. If each replica maintains the cache and queues, in theory the usage should be in sync, but we could face race conditions. Ideally we would only like to run the event handlers, but not the reconciler itself. Is there a boolean given by the interface that can tell us whether it's the leader? Then we can just return |
Do you mean you'll fix ca-cert itself? I wonder if this somehow related to the e2e failures that we are seeing: #1372 (comment) |
For the sake of truthfulness /retitle Non-leading replica is restarted due to inaccurate probe implementation |
controller-runtime |
I mean I'll fix the following issue in cert-controller itself (assuming my analysis is correct), and upgrade the cert-controller dependency in Kueue:
I need to look more at how the e2e tests are setup. I'll check it once I have the fix. |
I've open open-policy-agent/cert-controller#166 that fixes the cache timeout issue in cert-controller. |
@astefanutti can you explain why the binary is terminating given the bug in cert-controller? I thought this had something to do with the probes, but our probes just use Ping, which would return Ready/Live if the binary is able to respond to the Http request. |
@alculquicondor right, I don't think it has something to do with the probes. So what happens in the non-leader elected mode is the following:
|
Thanks a lot for fixing this! |
What happened:
If I run two replicas the manager crashes after a while, it looks like that the healthiness probe fails and restarts the pod
What you expected to happen:
Both pods as running fine
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version
): Server Version: v1.28.3-eks-4f4795dgit describe --tags --dirty --always
): v0.5.1cat /etc/os-release
): bottlerocket osuname -a
):The text was updated successfully, but these errors were encountered: