Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter on EKS fails to remove 'uninitialized' taint from nodes #7426

Open
konigpeter opened this issue Nov 22, 2024 · 12 comments
Open

Karpenter on EKS fails to remove 'uninitialized' taint from nodes #7426

konigpeter opened this issue Nov 22, 2024 · 12 comments
Labels
bug Something isn't working needs-triage Issues that need to be triaged

Comments

@konigpeter
Copy link

Description

Observed Behavior:
Karpenter provisions a new node, and it successfully joins the EKS cluster. However, the taint "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule" is not removed, preventing the node from being scheduled with pods from deployments. As a result, the node only runs the default add-on pods (e.g., kube-proxy, VPC CNI) but does not receive workload pods.

If the taint is manually removed from the node, everything works as expected, and the pods are scheduled without issues.

This behavior was observed in version 0.37 and persists after updating to version 1.0. It leads to availability issues, as pods remain in a Pending state despite the node being available.

Karpenter does not generate any logs about this issue, providing no insight into why the taint is not being removed.

K8s Events:

not all pods would schedule, xxxx/xxxx-64f49f59f4-jgdmj => would
  schedule against uninitialized nodeclaim/live-mnxjv,
  node/ip-172-16-76-204.sa-east-1.compute.internal
  xxxx/xxxx-7864b5b8-qlsqx => would schedule against
  uninitialized nodeclaim/live-mnxjv,
  node/ip-172-16-76-204.sa-east-1.compute.internal
  xxxx/xxxx-85f77756f4-wtc2s => would schedule against
  uninitialized nodeclaim/live-mnxjv,
  node/ip-172-16-76-204.sa-east-1.compute.internal

Node Clain Condition Message:

KnownEphemeralTaint "[node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule](http://node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule)" still exists

Node Taint:

  taints:
    - effect: NoSchedule
      key: live
    - effect: NoSchedule
      key: node.cloudprovider.kubernetes.io/uninitialized
      value: 'true'

Expected Behavior:
When Karpenter provisions a new node and it successfully joins the EKS cluster, the taint "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule" should be automatically removed once the node is fully initialized. This would allow pods from deployments to be scheduled on the node alongside the default add-on pods (e.g., kube-proxy, VPC CNI).

Nodes should not require manual intervention to remove the taint, and workload pods should be scheduled as soon as the node becomes available, ensuring no pods remain in a Pending state due to this issue.

Reproduction Steps (Please include YAML):
This issue is intermittent and does not occur with every node provisioned by Karpenter, making it difficult to consistently reproduce. However, the following steps outline a typical setup where the issue has been observed:

Deploy Karpenter on an EKS cluster.
Create a provisioner with support for Spot and On-Demand Instances.
Trigger workloads that scale the cluster and provision new nodes.
Occasionally, newly provisioned nodes will join the cluster but retain the taint "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule", preventing workload pods from being scheduled.

Versions:

  • Chart Version: v1.0.8
  • Kubernetes Version (EKS): 1.30
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@konigpeter konigpeter added bug Something isn't working needs-triage Issues that need to be triaged labels Nov 22, 2024
@mcorbin
Copy link

mcorbin commented Nov 22, 2024

Hello,

I just created an issue that I think is related to this problem this morning: #7424.

We for example have at the moment a Nodeclaim in non-ready state (Ready False) for 40 minutes. As you described the node is correctly started (it has the daemonset running etc).

A get nodeclaim returns:

general-purpose-rdvzw m6i.xlarge on-demand eu-west-3b ip-redacted.eu-west-3.compute.internal False 46m

Conditions:

  - lastTransitionTime: "2024-11-22T11:37:51Z"
    message: KnownEphemeralTaint "node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule"
      still exists
    reason: KnownEphemeralTaintsExist
    status: "False"
    type: Initialized
  - lastTransitionTime: "2024-11-22T11:37:00Z"
    message: ""
    reason: Launched
    status: "True"
    type: Launched
  - lastTransitionTime: "2024-11-22T11:37:24Z"
    message: Initialized=False
    reason: UnhealthyDependents
    status: "False"
    type: Ready
  - lastTransitionTime: "2024-11-22T11:37:24Z"
    message: ""
    reason: Registered
    status: "True"
    type: Registered

As said in my issue we also don't understand why it happens because it seems random, for some nodes it works as expected.

The node also has the NoSchedule taint:

  taints:
  - effect: NoSchedule
    key: node.cloudprovider.kubernetes.io/uninitialized
    value: "true"

We rollbacked to 0.37.6 (we were in 1.0.8) but the issue is still present. We didn't have this issue on 0.37.0 (that was the version we were running before upgrading) but now we have difficulties to rollback due to the CRD changes, various webhook issues etc.

@mcorbin
Copy link

mcorbin commented Nov 22, 2024

To give another example of the randomness of the behavior: I cleaned all Pending nodeclaim.

Then Karpenter started new nodes:

kubectl get nodeclaim | grep False

general-purpose-2dwpf                 c6i.2xlarge        on-demand   eu-west-3a   ip-redacted.eu-west-3.compute.internal   False   66s
general-purpose-5w89g                 m6i.xlarge         on-demand   eu-west-3a   ip-redacted.eu-west-3.compute.internal   False   66s
general-purpose-69sch                 m6i.xlarge         on-demand   eu-west-3c   ip-redacted.eu-west-3.compute.internal   False   66s
general-purpose-8m94t                 m6i.xlarge         on-demand   eu-west-3b   ip-redacted.eu-west-3.compute.internal    False   91s
general-purpose-9sk58                 m6i.xlarge         on-demand   eu-west-3c   ip-redacted.eu-west-3.compute.internal   False   66s
general-purpose-bpx7s                 m6i.xlarge         on-demand   eu-west-3c   ip-redacted.eu-west-3.compute.internal     False   6m30s
general-purpose-dw8nz                 c6i.2xlarge        on-demand   eu-west-3a   ip-redacted.eu-west-3.compute.internal    False   66s
general-purpose-h9hnx                 m6i.xlarge         on-demand   eu-west-3b   ip-redacted.eu-west-3.compute.internal   False   66s
general-purpose-hxnzg                 c6i.2xlarge        on-demand   eu-west-3a   ip-redacted.eu-west-3.compute.internal   False   66s
general-purpose-pvdzl                 m6i.xlarge         on-demand   eu-west-3c   ip-redacted.eu-west-3.compute.internal    False   66s
general-purpose-t8j9z                 m6i.xlarge         on-demand   eu-west-3c   ip-redacted.eu-west-3.compute.internal   False   66s
general-purpose-tfhgf                 m6i.xlarge         on-demand   eu-west-3c   ip-redacted.eu-west-3.compute.internal   False   66s

15 minutes after we see that some nodes successfully joined the cluster and some are still Ready False:

kubectl get nodeclaim | grep False

general-purpose-5w89g                 m6i.xlarge         on-demand   eu-west-3a   ip-redacted.eu-west-3.compute.internal   False   15m
general-purpose-9sk58                 m6i.xlarge         on-demand   eu-west-3c   ip-redacted.eu-west-3.compute.internal   False   15m
general-purpose-h9hnx                 m6i.xlarge         on-demand   eu-west-3b   ip-redacted.eu-west-3.compute.internal   False   15m
general-purpose-hxnzg                 c6i.2xlarge        on-demand   eu-west-3a   ip-redacted.eu-west-3.compute.internal   False   15m
general-purpose-tfhgf                 m6i.xlarge         on-demand   eu-west-3c   ip-redacted.eu-west-3.compute.internal   False   15m

@silvadavitor
Copy link

I am facing the same problem.

This issue is greatly complicating my situation.

Version
Kubernetes Version (EKS): 1.30

@tainnsre
Copy link

tainnsre commented Nov 24, 2024

Hi @silvadavitor
Can you share your config nodepool & ec2nodeclass ?

@konigpeter
Copy link
Author

NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: "live"
spec:
  template:
    spec: 
      expireAfter: Never # 30 * 24h = 720h
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: "karpenter.k8s.aws/instance-hypervisor"
          operator: In
          values: ["nitro"]
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: [large, xlarge, 2xlarge]           
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass      
        name: live
      taints:
        - key: live
          effect: NoSchedule        
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 3m      
  limits:
    cpu: "300"
    memory: 600Gi

EC2NodeClass

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: live
spec:
  amiSelectorTerms:
    - alias: al2023@latest
  role: "EKS_Karpenter_Node_Role"
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 60Gi
        volumeType: gp3
        #iops: 10000
        encrypted: true
        deleteOnTermination: true
        #throughput: 125
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "k8s-live"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "k8s-live"
  tags:
    Name: karpenter.sh/live

@tainnsre
Copy link

tainnsre commented Nov 25, 2024

Hello @mcorbin,
U can try test with this above config.
nodepool-ec2nodeclass.yaml

apiVersion: karpenter.sh/v1beta1
# apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: nodepool-large-spot
spec:
  template:
    metadata:
      labels:
        ProjectCost: SRE
        environments: SRE-Dev
        billing-team: SRE
      annotations:
        teams/owner: "SRE"
    spec:
      requirements:
        - key: kubernetes.io/os
          operator: In
          values:
          - linux
        - key: kubernetes.io/arch
          operator: In
          values:
          - amd64
        - key: karpenter.sh/capacity-type
          operator: In
          values:
          - spot
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values:
          - t3a
          - t3
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values:
          - large
      nodeClassRef:
        name: nodeclass-xlarge-spot
  limits:
    cpu: 24
  disruption:
    budgets:
    - nodes: 10%
    consolidationPolicy: WhenEmpty
    consolidateAfter: 24h
    expireAfter: 120h
---
apiVersion: karpenter.k8s.aws/v1beta1
# apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: nodeclass-large-spot
spec:
  amiFamily: AL2 # Amazon Linux 2
  role: eksctl-KarpenterNodeRole-eks-sre-op
  blockDeviceMappings:                                                                                                                                                                                                                                                                                                                                                                                                                                                            
  - deviceName: /dev/xvda
    ebs:
      deleteOnTermination: true
      iops: 3000
      throughput: 125
      volumeSize: 30Gi
      volumeType: gp3
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: eks-sre-op
    - id: subnet-1234 #     zone: ap-southeast-1c
    - id: subnet-2456 #     zone: ap-southeast-1a
    - id: subnet-789 #     zone: ap-southeast-1b
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: eks-sre-op
    - id: sg-234
    - id: sg-45645
    - id: sg-7567
  amiSelectorTerms:
    - id: "ami-123" # EKS v1.31 AMD_AMI_ID
    - id: "ami-234" # EKS v1.31 ARM_AMI_ID
    - id: "ami-456" # EKS v1.31 GPU_AMI_ID
  tags:
    ProjectCost: SRE
	environments: SRE-Dev

Hi @konigpeter
This my config above, i dont use taint.
U can try remove it and test on the new dev environment eks cluster.
In my case I apply this config with apiVersion: karpenter.k8s.aws/v1 then not work but when I was apply this config with apiVersion: karpenter.k8s.aws/v1beta1 then working oke.

@mcorbin
Copy link

mcorbin commented Nov 25, 2024

Hello,
We already tried to revert to v1beta without any changes.

@tainnsre
Copy link

tainnsre commented Nov 25, 2024

Hello @mcorbin,
Have U restart deployment karpenter service ?

@mcorbin
Copy link

mcorbin commented Nov 25, 2024

We'll test this, at the moment I don't see any pending nodeclaim since this morning but our clusters just started to autoscale with the week starting.

@mcorbin
Copy link

mcorbin commented Nov 28, 2024

@konigpeter Do you still experience the issue ? On our side we don't see it anymore but we don't know why.

@konigpeter
Copy link
Author

@mcorbin We have also not experienced this issue anymore. To ensure reliability, we conducted several tests by forcing 10 to 100 NodeClaims, and everything worked without errors.

That said, we are also unsure about the root cause of the problem or why it stopped occurring.

@mcorbin
Copy link

mcorbin commented Nov 28, 2024

To ensure reliability, we conducted several tests by forcing 10 to 100 NodeClaims, and everything worked without errors.

Same, we a did rollout of probably hundreds of nodes and we can't reproduce :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage Issues that need to be triaged
Projects
None yet
Development

No branches or pull requests

4 participants