-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] calico-kube-controllers deployment is labeled twice with the CriticalAddonsOnly toleration #4282
Comments
Same for v1.28.9 |
@sabbour this validation could break the above mentioned deployment |
Any update on this? |
Would love to see this resolved, this is creating log spam and alerts on our prometheus stack due to duplicate labels. |
@Aaron-ML I am also using the kube-prometheus-stack and have downgraded prometheus to v2.51.2 until it is fixed... |
We've mitigated it for now by temporarily removing the alert related to prometheus ingest failures. Hopefully this gets resolved soon. |
@chasewilson any update available? |
@chasewilson can you please provide an update? |
Any updates on this issue? |
@wedaly I know we'd investigated this. Could you add some clarity here? |
AKS creates the operator.tigera.io/v1 Installation resource that tells tigera-operator how to install Calico. In the installation CR, we're setting:
tigera-operator code appends this to the list of default tolerations for calico-kube-controllers, which already includes this toleration: https://github.com/tigera/operator/blob/b01279889cd2a625fde862afb7b41e27b9dcce19/pkg/render/kubecontrollers/kube-controllers.go#L648 I don't know the full context of why AKS sets this field in the installation CR, but it's been this way for a long time (I think as long ago as 2021). I'm not yet sure why we added that or if it's safe to remove, as I can see |
@wedaly in that case it is being added by AKS installation resource and the tigera-operator. Your liked line indicates that there is a next to the passed config parameters also some meta data appended: Tolerations: append(c.cfg.Installation.ControlPlaneTolerations, rmeta.TolerateCriticalAddonsAndControlPlane...), If you follow the path you can see that in the toleration is already defined there: TolerateCriticalAddonsOnly = corev1.Toleration{
Key: "CriticalAddonsOnly",
Operator: corev1.TolerationOpExists,
} Therefore you should be good to remove it from the AKS installation resource. |
Digging through the commit history in AKS, I see that the toleration was added as a repair item for a production issue during the migration to tigera-operator. The repair item is linked to this issue in GH: projectcalico/calico#4525 However, I'm not sure how adding the toleration is related to the symptoms described in that issue. And all AKS clusters on supported k8s versions should be using tigera-operator now. Seems like it should be safe to remove the toleration from the installation CR now. |
@wedaly any update here? |
@chasewilson @wedaly can we please get an update? This is currently holding us back from being able to update Prometheus. |
Apologies for the delayed response. The current plan is to remove However, this change has the side-effect of adding two additional tolerations to Calico's typha deployment to tolerate every taint (https://github.com/tigera/operator/blob/8cbb161896a4ca641f885e668528cdb52de83f84/pkg/render/typha.go#L400). We believe this is safe, but any change like this carries some risk as it could affect many clusters. For this reason, we are planning to remove I realize this doesn't provide an immediate solution to folks on earlier k8s versions that want to upgrade Prometheus, but we need to balance the severity of this bug against the risks of making a config change that would affect many AKS clusters. |
Describe the bug
On AKS clusters with calico enabled a namespace calico-system is created. Within that we can find a deployment calico-kube-controllers. This deployment is currently labels twice with the CriticalAddonsOnly toleration. This leads to an error in prometheus starting v2.52.0 as with that version a check for duplicate samples has been introduced.
The above situation leads to such a situation, as the kube-state-metrics pod creates the same metric twice - due to the second existens of the CriticalAddonsOnly toleration. I had created a issue on the prometheus project, as I was expecting it to be a prometheus issue, which it isn't. prometheus/prometheus#14089
Prometheus log output
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: