Karpenter 1.0.2 fails to remove stale nodeclaims leftover from 0.37.3 upgrade #6981

dcherniv · 2024-09-11T21:22:41Z

Description

Observed Behavior:
After upgrade to 1.0.2 following upgrade path https://karpenter.sh/v1.0/upgrading/v1-migration/#upgrade-procedure
The webhook didn't properly convert the nodeclaims
as a result there are a number of claims that are failing:

[
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-n8fjh\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  },
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-wfcrj\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  },
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-djh4m\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  },
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-mckmt\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  },
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-jpnwj\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  },
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-9sdns\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  },
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-hbf52\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  }
]

These nodeclaims cannot be removed by force because karpenter 1.0.2 fails to delete them due to the aforementioned error.
These nodeclaims cannot be edited to remove the finalizer because they are immutable.

Expected Behavior:
Old nodeclaims left over from 0.37.3 properly cleaned up and replaced by new nodeclaims.

Reproduction Steps (Please include YAML):
Follow the upgrade path, see #6847 (comment)

Versions:

Chart Version: 0.37.3, 1.0.2
Kubernetes Version (kubectl version): 1.29+ eks

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

dcherniv · 2024-09-11T22:47:13Z

Worked around this issue via #6847 (comment)
It is very painful.

jonathan-innis · 2024-09-12T00:09:16Z

Can you share the NodeClaim that existed prior to the conversion as well as the config for the CRD and the deployment? This looks like it should have been defaulted by the conversion webhook so I'm surprised that this converted over without the group being set

dcherniv · 2024-09-12T00:36:34Z

I have more explanation here #6847 (comment)
The nodeclaims, while on 1.0.2 were identical, just some of them didn't have the group set for some reason
The nodepool manifest that was deployed at the time:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  labels:
    argocd.argoproj.io/instance: prod-karpenter
  name: default-arm64
spec:
  disruption:
    consolidateAfter: 0s
    consolidationPolicy: WhenEmptyOrUnderutilized
  limits:
    cpu: '36'
  template:
    metadata:
      labels:
        arch: arm64
    spec:
      expireAfter: 720h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values:
            - m7g
            - m6g
            - c7g
            - c6g
        - key: karpenter.k8s.aws/instance-cpu
          operator: In
          values:
            - '2'
            - '4'
            - '8'
            - '16'
        - key: karpenter.sh/capacity-type
          operator: In
          values:
            - on-demand
        - key: kubernetes.io/os
          operator: In
          values:
            - linux
        - key: kubernetes.io/arch
          operator: In
          values:
            - arm64
      taints:
        - effect: NoSchedule
          key: arch
          value: arm64
  weight: 50

The CRDs were from version 1.0.2 of karpenter-crd chart installed into karpenter namespace.

Relevant ec2nodeclass

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  labels:
    argocd.argoproj.io/instance: prod-karpenter
  name: default
spec:
  amiSelectorTerms:
    - alias: al2023@latest
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        deleteOnTermination: true
        encrypted: true
        volumeSize: 100Gi
        volumeType: gp3
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required
  role: initial-eks-node-group-20230826183814285900000005
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: prod
  subnetSelectorTerms:
    - tags:
        Name: '*private*'
        karpenter.sh/discovery: prod
  tags:
    karpenter.sh/discovery: prod

jonathan-innis · 2024-09-16T20:28:54Z

just some of them didn't have the group set for some reason

I suspect that they didn't have the group because you changed versions without enabling the conversion webhooks. This is going to cause fields to get dropped and the apiserver doesn't know how to hydrate the data.

Regardless, I suspect that this fix is going to solve your problem because we will no longer be relying on the spec to be configured correctly for the Patch to succeed. Allowing you to get unblocked from this issue.

Regardless, I wouldn't recommend going through the path without the conversion webhooks because you may end up in an undefined state.

jonathan-innis · 2024-09-16T20:31:03Z

Willing to try installing a snapshot to validate if this fixes the issue? Here's the command for installing the snapshot version of Karpenter with the patch fix

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-1f88e3ae893d937267c5c28420597ea5be1d77ef" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

ShahroZafar · 2024-09-24T12:45:57Z

We plan to go from v0.37.3 to v1.0.2 with webhook enabled. Is this issue resolved? Or we should wait for a newer patch version in v1.0

adawalli · 2024-09-27T15:33:23Z

I have the issue on 1.0.2 yes - updating to 1.0.3 did not remove this error

github-actions · 2024-12-10T22:51:03Z

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

dcherniv added bug Something isn't working needs-triage Issues that need to be triaged labels Sep 11, 2024

dcherniv mentioned this issue Sep 11, 2024

Karpenter v1.0.0 chart does not respect Helm values in packaged CRDs leading to ArgoCD upgrade difficulty #6847

Open

jonathan-innis added triage/needs-information Marks that the issue still needs more information to properly triage and removed needs-triage Issues that need to be triaged labels Sep 12, 2024

github-actions bot added the lifecycle/stale label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter 1.0.2 fails to remove stale nodeclaims leftover from 0.37.3 upgrade #6981

Karpenter 1.0.2 fails to remove stale nodeclaims leftover from 0.37.3 upgrade #6981

dcherniv commented Sep 11, 2024

dcherniv commented Sep 11, 2024

jonathan-innis commented Sep 12, 2024

dcherniv commented Sep 12, 2024 •

edited

Loading

jonathan-innis commented Sep 16, 2024

jonathan-innis commented Sep 16, 2024

ShahroZafar commented Sep 24, 2024

adawalli commented Sep 27, 2024

github-actions bot commented Dec 10, 2024

Karpenter 1.0.2 fails to remove stale nodeclaims leftover from 0.37.3 upgrade #6981

Karpenter 1.0.2 fails to remove stale nodeclaims leftover from 0.37.3 upgrade #6981

Comments

dcherniv commented Sep 11, 2024

Description

dcherniv commented Sep 11, 2024

jonathan-innis commented Sep 12, 2024

dcherniv commented Sep 12, 2024 • edited Loading

jonathan-innis commented Sep 16, 2024

jonathan-innis commented Sep 16, 2024

ShahroZafar commented Sep 24, 2024

adawalli commented Sep 27, 2024

github-actions bot commented Dec 10, 2024

dcherniv commented Sep 12, 2024 •

edited

Loading