-
Notifications
You must be signed in to change notification settings - Fork 980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error "failed to call webhook: the server rejected our request for an unknown reason" after upgrade from 0.37.4 to 1.0.4. #7134
Comments
Hello! I experienced the exact same issue when upgrading from 0.37.3 to 1.0.5 : the only difference in my case is that I had to wait to 1.0.5 that enables the migration to v1 with the webhooks (however not clear for me what webhooks are we taking about ? the ones already listed by @AndrzejWisniewski I guess) since ArgoCD is preventing the deletion of the webhooks currently. Though even by removing them manually and restarting the deployment after the upgrade we saw the same errors. After this we can no longer provision any new nodes, however the rollback procedure to 0.37.3 is working pretty well as we have tried to upgrade to different 1.0.x versions on our dev cluster. We are working on EKS 1.30 and our current deployment of Karpenter is done via an ArgoCD App: Chart.yaml
values.yaml
I would be glad to give more details if required. Thanks ! |
Seeing exact same error from an upgrade last night from v0.37.3 to v1.0.5 on K8s v1.28 (AWS EKS). Our deployment path is using "helm template" to generate lightly templated (just name prefixes, IRSA annotations, etc) resource files that our own simple manual deployment tool (Kontemplate) fills in and replaces. Am digging in this morning and found this thread. Our values.yaml is just:
The errors start with a TLS handshake error and then a collection of reconciler errors:
|
Here's what my NodePool looks like on that cluster. I need to dig through the docs to see if the annotations will give me a clue. The NodePool looks like it is already a karpenter.sh/v1. Looking at my other clusters that are still on v0.37.3 (done just before .4 dropped), they all show karpenter.sh/v1 as well - look identical to this one below on my Karpenter v1.0.5 on EKS v1.28 environment.
|
Has anyone found a fix so far other than rolling back to a previously working version? |
I couldn't even cleanly roll-back. The finalizer on the CRDs was hanging talking to the /termination and, since webhooks were toast, hanging. I applied my 0.37.3 back over the top and that worked. I cleanly uninstalled 0.37.3 and then installed 1.0.5 on the cluster and that worked, but not what I want to do for production. I kinda flailed around whacking on things in the deployment, so YMMV, but it appeared to be a certificate/CA issue on the 1.0.5 webhook. Saw some errors about X.509 not being able to determine the CA - didn't capture those details through. |
Sorry for the delayed response here, I suspect this may be an issue we're already aware of and getting a fix out for but I have a few clarifying questions:
@AndrzejWisniewski how did you go about confirming this? Was this done by checking that the resources were not included in the updated helm deployment, or were you able to confirm that they were removed from the cluster? There's some more context here, but because knative adds owner references to the objects, Argo won't prune the objects even if it was the original creator.
@laserpedro This is extremely surprising to me. Are you able to share what you did to remove the webhooks, and what errors you continued to see? |
I suspected as much considering some recent commits I've seen on your main branch, but I'd really like more transparency on what exactly is going on here. Is there maybe a link to a github issue of this thing you're aware of? Truth be told we've spent the last two weeks trying to upgrade to karpenter 1.x and we've encountered problem after problem after problem. |
Following up on my previous post, I've instructed my team to put the upgrade on pause until the fix for whatever this issue is that you may already be aware of, is released. |
@jmdeal thanks you for watching this issue. This morning I tried to upgrade again to reproduce the behavior I witnessed:
I guess this time it is the conversion webhook that cannot establish the connection with the control plane (10.247.x.x matches my control plane cidr range , 100.67.x.x matches my dataplane cidr range). Please rectify me if my understanding is not correct here. (I am also seeing a synch error from ArgoCD on the cpu limits where "1000" is casted as 1k (but that is more related to ArgoCD I guess, and let s solve one problem at a time).) |
We just released latest patch versions of pre-v1.0 versions that fix this issue so that these configuration resources aren't leaked. Please use one of the following versions prior to going to 1.0.x since these versions remove the ownerReference that is causing Argo to leak the resources and causes the failure on upgrade:
Also, this seems related to points discussed in #6982 and #6847. See this comment for a description of why this occurs. |
This appears to be this issue: #6898. This is a spurious error that happens generally in k8s when using webhooks, you should be able to safely ignore it. There's more context in an older issue: kubernetes-sigs/karpenter#718. If you think there's an issue with Karpenter's operations that are related to this error, please leave an update in #6898. I'm going to close this issue out now that the releases with the fixes have been made. |
Description
Observed Behavior:
After upgrade from 0.37.4 to 1.0.4 I can see a lot of such errors in the karpenter logs:
"... Internal error occurred: failed calling webhook "validation.webhook.karpenter.sh": failed to call webhook: the server rejected our request for an unknown reason ..." and
Internal error occurred: failed calling webhook \"defaulting.webhook.karpenter.k8s.aws\": failed to call webhook: the server rejected our request for an unknown reason"}
and I can confirm that:
validation.webhook.karpenter.sh
validation.webhook.config.karpenter.sh
defaulting.webhook.karpenter.k8s.aws
validation.webhook.karpenter.k8s.aws
are removed during 1.0.4 deployment.
Reproduction Steps (Please include YAML):
Values.yaml for ArgoCD:
I also use such kustomizatin patches (as karpenter is deployed in
karpenterns
namespace:Versions:
1.0.4
kubectl version
):Server Version: v1.28.12-eks-a18cd3a
The text was updated successfully, but these errors were encountered: