Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter 1.0.2 fails to remove stale nodeclaims leftover from 0.37.3 upgrade #6981

Open
dcherniv opened this issue Sep 11, 2024 · 8 comments
Labels
bug Something isn't working lifecycle/stale triage/needs-information Marks that the issue still needs more information to properly triage

Comments

@dcherniv
Copy link

Description

Observed Behavior:
After upgrade to 1.0.2 following upgrade path https://karpenter.sh/v1.0/upgrading/v1-migration/#upgrade-procedure
The webhook didn't properly convert the nodeclaims
as a result there are a number of claims that are failing:

[
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-n8fjh\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  },
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-wfcrj\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  },
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-djh4m\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  },
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-mckmt\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  },
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-jpnwj\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  },
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-9sdns\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  },
  {
    "error": "NodeClaim.karpenter.sh \"default-arm64-hbf52\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"
  }
]

These nodeclaims cannot be removed by force because karpenter 1.0.2 fails to delete them due to the aforementioned error.
These nodeclaims cannot be edited to remove the finalizer because they are immutable.

Expected Behavior:
Old nodeclaims left over from 0.37.3 properly cleaned up and replaced by new nodeclaims.

Reproduction Steps (Please include YAML):
Follow the upgrade path, see #6847 (comment)

Versions:

  • Chart Version: 0.37.3, 1.0.2
  • Kubernetes Version (kubectl version): 1.29+ eks
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@dcherniv
Copy link
Author

Worked around this issue via #6847 (comment)
It is very painful.

@jonathan-innis
Copy link
Contributor

Can you share the NodeClaim that existed prior to the conversion as well as the config for the CRD and the deployment? This looks like it should have been defaulted by the conversion webhook so I'm surprised that this converted over without the group being set

@jonathan-innis jonathan-innis added triage/needs-information Marks that the issue still needs more information to properly triage and removed needs-triage Issues that need to be triaged labels Sep 12, 2024
@dcherniv
Copy link
Author

dcherniv commented Sep 12, 2024

I have more explanation here #6847 (comment)
The nodeclaims, while on 1.0.2 were identical, just some of them didn't have the group set for some reason
The nodepool manifest that was deployed at the time:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  labels:
    argocd.argoproj.io/instance: prod-karpenter
  name: default-arm64
spec:
  disruption:
    consolidateAfter: 0s
    consolidationPolicy: WhenEmptyOrUnderutilized
  limits:
    cpu: '36'
  template:
    metadata:
      labels:
        arch: arm64
    spec:
      expireAfter: 720h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values:
            - m7g
            - m6g
            - c7g
            - c6g
        - key: karpenter.k8s.aws/instance-cpu
          operator: In
          values:
            - '2'
            - '4'
            - '8'
            - '16'
        - key: karpenter.sh/capacity-type
          operator: In
          values:
            - on-demand
        - key: kubernetes.io/os
          operator: In
          values:
            - linux
        - key: kubernetes.io/arch
          operator: In
          values:
            - arm64
      taints:
        - effect: NoSchedule
          key: arch
          value: arm64
  weight: 50

The CRDs were from version 1.0.2 of karpenter-crd chart installed into karpenter namespace.

Relevant ec2nodeclass

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  labels:
    argocd.argoproj.io/instance: prod-karpenter
  name: default
spec:
  amiSelectorTerms:
    - alias: al2023@latest
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        deleteOnTermination: true
        encrypted: true
        volumeSize: 100Gi
        volumeType: gp3
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required
  role: initial-eks-node-group-20230826183814285900000005
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: prod
  subnetSelectorTerms:
    - tags:
        Name: '*private*'
        karpenter.sh/discovery: prod
  tags:
    karpenter.sh/discovery: prod

@jonathan-innis
Copy link
Contributor

just some of them didn't have the group set for some reason

I suspect that they didn't have the group because you changed versions without enabling the conversion webhooks. This is going to cause fields to get dropped and the apiserver doesn't know how to hydrate the data.

Regardless, I suspect that this fix is going to solve your problem because we will no longer be relying on the spec to be configured correctly for the Patch to succeed. Allowing you to get unblocked from this issue.

Regardless, I wouldn't recommend going through the path without the conversion webhooks because you may end up in an undefined state.

@jonathan-innis
Copy link
Contributor

Willing to try installing a snapshot to validate if this fixes the issue? Here's the command for installing the snapshot version of Karpenter with the patch fix

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-1f88e3ae893d937267c5c28420597ea5be1d77ef" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

@ShahroZafar
Copy link

We plan to go from v0.37.3 to v1.0.2 with webhook enabled. Is this issue resolved? Or we should wait for a newer patch version in v1.0

@adawalli
Copy link

I have the issue on 1.0.2 yes - updating to 1.0.3 did not remove this error

Copy link
Contributor

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lifecycle/stale triage/needs-information Marks that the issue still needs more information to properly triage
Projects
None yet
Development

No branches or pull requests

4 participants