Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Number of maxPods / Node Pods Capacity #6890

Closed
msvechla opened this issue Aug 28, 2024 · 31 comments
Closed

Incorrect Number of maxPods / Node Pods Capacity #6890

msvechla opened this issue Aug 28, 2024 · 31 comments
Assignees
Labels
bug Something isn't working burning Time sensitive issues

Comments

@msvechla
Copy link

Description

Observed Behavior:

Since we upgraded to Karpenter v1 we observed incorrect kubelet maxPods settings for multiple nodes. We initially only noticed the issue with m7a.medium instances, however today we also had a case with an r7a.medium instance.

The issue becomes visible when multiple pods on a node in the cluster are stuck in initializing with:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "850cdbed09a9f986b2370c7409fb9e5ee782846056ec7466fb13e863f6e225ad": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

Checking the node, it immediately becomes obvious that too many pods have been scheduled on it, and the node is running out of IP addresses.

In the example with m7a.medium we observed multiple nodes in the same cluster (all m7a.medium) with a different status.capacity.pods specified.

We observed nodes with 8, 58 and 29 maxPods in the cluster.

According to https://github.com/awslabs/amazon-eks-ami/blob/main/templates/shared/runtime/eni-max-pods.txt#L518 the correct number should be 8. So the nodes which had a higher number specified ran into the issue mentioned above.

Logging into the nodes and checking the kubelet config revealed the following:

[root@ip]# cat /etc/kubernetes/kubelet/config.json.d/00-nodeadm.conf |grep maxPods
    "maxPods": 29,
[root@ip]# cat /etc/kubernetes/kubelet/config.json |grep maxPods
  "maxPods": 8,

So it appears that the correct value is specified in /etc/kubernetes/kubelet/config.json but overwritten in /etc/kubernetes/kubelet/config.json.d/00-nodeadm.conf.

We use AL2023 and we do not specify any value for podsPerCore in our karpenter resources or similar.

As we had different nodes of the same instance type with varying values, this could also be some kind of race condition or similar.

Expected Behavior:

Calculated maxPods matches value in https://github.com/awslabs/amazon-eks-ami/blob/main/templates/shared/runtime/eni-max-pods.txt

Reproduction Steps (Please include YAML):

Used EC2NodeClass

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiSelectorTerms:
  - alias: al2023@v20240807
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      deleteOnTermination: true
      encrypted: true
      volumeSize: 30Gi
      volumeType: gp3
  detailedMonitoring: true
  instanceProfile: karpenter
  kubelet:
    kubeReserved:
      cpu: 200m
      ephemeral-storage: 1Gi
      memory: 200Mi
    systemReserved:
      cpu: 100m
      ephemeral-storage: 1Gi
      memory: 200Mi
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required
  securityGroupSelectorTerms:
  - name: karpenter
  subnetSelectorTerms:
  - tags:
      Name: private
  userData: |
    #!/bin/bash

    # https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#6-minute-delays-in-attaching-volumes
    # https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1955
    echo -e "InhibitDelayMaxSec=45\n" >> /etc/systemd/logind.conf
    systemctl restart systemd-logind
    echo "$(jq ".shutdownGracePeriod=\"400s\"" /etc/kubernetes/kubelet/config.json)" > /etc/kubernetes/kubelet/config.json
    echo "$(jq ".shutdownGracePeriodCriticalPods=\"100s\"" /etc/kubernetes/kubelet/config.json)" > /etc/kubernetes/kubelet/config.json
    systemctl restart kubelet

Versions:

  • Chart Version: v1.0.1
  • Kubernetes Version (kubectl version): v1.29.6-eks-db838b0
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@msvechla msvechla added bug Something isn't working needs-triage Issues that need to be triaged labels Aug 28, 2024
@rschalo
Copy link
Contributor

rschalo commented Aug 29, 2024

It looks like this problem is related to https://karpenter.sh/v1.0/troubleshooting/#maxpods-is-greater-than-the-nodes-supported-pod-density.

I'll point out that some of the language there needs to be updated, for example I believe NodePods in Solution 2 was meant to point to NodePools and the pod density section now directs to the EC2NodeClass kubelet config section since it was moved there from NodePools in v1.

Please share an update if the problem persists after updating the kubelet spec or enabling prefix delegation.

@rschalo rschalo removed the bug Something isn't working label Aug 29, 2024
@msvechla
Copy link
Author

msvechla commented Aug 29, 2024

I'm not quite sure what you mean. I posted my kubelet spec / the entire EC2NodeClass in the original post above. We are not specifying any maxPods as is mentioned in the troubleshooting guide, so it must mean karpenter is setting an incorrect amount.

Or did I misunderstand something?

We are not using prefix delegation, and according to the docs it should also not be required.

Can you share what exactly we should update in the kubelet config?

It is also weird that karpenter sets a different pod capacity for different nodes of the same instance type in the cluster, so to me this still looks like a bug.

@waihong
Copy link

waihong commented Aug 30, 2024

We are encountering a similar problem that began with the upgrade to v1.0.0. We have noticed an excessive number of pods being scheduled on t3.small/t3a.small instances. Our kubelet configuration does not specify any maxPods settings as well.

@iharris-luno
Copy link

iharris-luno commented Aug 30, 2024

We're also seeing this issue after upgrading to v1.0.0. Around 10% of new nodes have wildly high allocatable pods (eg 205 for a c6a.2xlarge), whereas mostly the calculations are correct (ie 44 for a c6a.2xlarge, as we have RESERVED_ENIS=1 in the karpenter controller).
We've had to hardcode maxPods:44 in our EC2NodeClass to prevent hundreds of pods getting stuck in FailedCreatePodSandBox status.
I can confirm that the affected nodes have an incorrect maxPods value in the # Karpenter Generated NodeConfig of the instance user-data. (So AL2023 / kubelet is doing what it's told, and the problem is in karpenter's maxPods calculations)
I can reproduce this issue on multiple AWS accounts / EKS clusters / Regions / Instance families, and it affects both AL2 and AL2023 AMI families, and both with and without RESERVED_ENIS set in the karpenter controller.

@iharris-luno
Copy link

It appears to be related to the presence or absence of a kubelet stanza in the EC2NodeClass...

Reproduction Steps:
Create a deployment with 50 replicas, with node anti-affinity, in a nodepool which uses the following EC2NodeClass...

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: iharris
spec:
  amiSelectorTerms:
  - alias: al2023@latest
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      encrypted: true
      kmsKeyID: <redacted>
      volumeSize: 150Gi
      volumeType: gp3
  role: karpenter-node-role.<redacted>
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: staging
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: staging

All 50 nodes have the correct .status.allocatable.pods - yay!

Change the EC2NodeClass to...

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: iharris
spec:
  amiSelectorTerms:
  - alias: al2023@latest
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      encrypted: true
      kmsKeyID: <redacted>
      volumeSize: 150Gi
      volumeType: gp3
  kubelet:
    imageGCLowThresholdPercent: 65
  role: karpenter-node-role.<redacted>
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: staging
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: staging

Around 5-10% of the 50 nodes have an incorrect .status.allocatable.pods - boo!
(Nothing special about imageGCLowThresholdPercent, it seems to be the presence of spec.kubelet that triggers the behaviour.)

I think we need that bug label back, sorry!

@engedaam
Copy link
Contributor

Can you share your NodePool? do you have the compatibility.karpenter.sh/v1beta1-kubelet-conversion annotation set on the nodepool?

@iharris-luno
Copy link

iharris-luno commented Aug 30, 2024

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  annotations:
    karpenter.sh/nodepool-hash: "15612137669406834936"
    karpenter.sh/nodepool-hash-version: v3
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"karpenter.sh/v1","kind":"NodePool","metadata":{"annotations":{},"name":"iharris"},"spec":{"disruption":{"budgets":[{"nodes":"100%"}],"consolidateAfter":"1m","consolidationPolicy":"WhenEmptyOrUnderutilized"},"limits":{"cpu":"500","memory":"2000Gi"},"template":{"metadata":{"labels":{"role":"iharris"}},"spec":{"expireAfter":"1h","nodeClassRef":{"group":"karpenter.k8s.aws","kind":"EC2NodeClass","name":"iharris"},"requirements":[{"key":"karpenter.k8s.aws/instance-category","operator":"In","values":["c","m","r"]},{"key":"karpenter.k8s.aws/instance-generation","operator":"In","values":["5","6"]},{"key":"karpenter.k8s.aws/instance-cpu","operator":"Gt","values":["7"]},{"key":"kubernetes.io/os","operator":"In","values":["linux"]},{"key":"kubernetes.io/arch","operator":"In","values":["amd64"]},{"key":"karpenter.sh/capacity-type","operator":"In","values":["on-demand"]}],"taints":[{"effect":"NoSchedule","key":"iharris","value":"true"}]}}}}
  creationTimestamp: "2024-08-29T15:29:45Z"
  generation: 4
  name: iharris
  resourceVersion: "864235779"
  uid: <redacted>
spec:
  disruption:
    budgets:
    - nodes: 100%
    consolidateAfter: 1m
    consolidationPolicy: WhenEmptyOrUnderutilized
  limits:
    cpu: "500"
    memory: 2000Gi
  template:
    metadata:
      labels:
        role: iharris
    spec:
      expireAfter: 1h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: iharris
      requirements:
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values:
        - c
        - m
        - r
      - key: karpenter.k8s.aws/instance-generation
        operator: In
        values:
        - "5"
        - "6"
      - key: karpenter.k8s.aws/instance-cpu
        operator: Gt
        values:
        - "7"
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
      taints:
      - effect: NoSchedule
        key: iharris
        value: "true"
status:
  conditions:
  - lastTransitionTime: "2024-08-29T15:29:45Z"
    message: ""
    reason: NodeClassReady
    status: "True"
    type: NodeClassReady
  - lastTransitionTime: "2024-08-29T15:29:45Z"
    message: ""
    reason: Ready
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-08-29T15:29:45Z"
    message: ""
    reason: ValidationSucceeded
    status: "True"
    type: ValidationSucceeded
  resources:
    cpu: "0"
    ephemeral-storage: "0"
    memory: "0"
    nodes: "0"
    pods: "0"

That's a new nodepool, created to test this issue. The old nodepools that were upgraded from v0.35.7 have eg a compatibility.karpenter.sh/v1beta1-nodeclass-reference: '{"name":"default"}' annotation, but none have compatibility.karpenter.sh/v1beta1-kubelet-conversion annotations.

@engedaam engedaam added the bug Something isn't working label Aug 30, 2024
@engedaam
Copy link
Contributor

Can you provide all your NodePool and EC2NodeClass in the cluster?

@iharris-luno
Copy link

Sure thing, here's the -oyaml from the cluster I'm currently testing in: issue-6890-resources.txt. I've reproduced the issue in both the pre-upgrade default, and the post-upgrade iharris ec2nc/nodepools.

@msvechla
Copy link
Author

msvechla commented Sep 2, 2024

Could it be related to #6167 which was included in v0.37.0? It mentions data races and to me this looks like a data race, as nodes of the exact same instance type have a different value assigned. As part of the v1 upgrade we also updated from v0.36.2 to the latest v0.37.x

EDIT: Its probably unrelated, as our clusters on v0.37.x have not shown this issue so far, only clusters on v1.x

@msvechla
Copy link
Author

msvechla commented Sep 3, 2024

Something else I noticed:

The NodeClaim of the affected nodes has the correct value in .status.capacity.pods, just the matching Node has an incorrect value for .status.capacity.pods

@iharris-luno what instance types have been affected in your case? Also r7a.medium and m7a.medium?

@iharris-luno
Copy link

We've seen the issue in c6a.2xlarge and r5a.2xlarge instances.
Good spot on the NodeClaim vs Node versions of .status.capacity.pods. However it doesn't seem that the NodeClaims are always correct... I just found a NodeClaim with an incorrect .status.capacity.pods:205.

@engedaam
Copy link
Contributor

engedaam commented Sep 9, 2024

@iharris-luno I used you configuration and I was not able to replicate the issue. Do you think you can share the node and nodeclaims that were impacted by the issue?

@engedaam engedaam self-assigned this Sep 9, 2024
@caiohasouza
Copy link

Hi,

I have the same issue with a t3.small instance:

nodeClaim.status.allocatable:
    Cpu:                  1930m
    Ephemeral - Storage:  35Gi
    Memory:               1418Mi
    Pods:                 11
node.status.allocatable:
    cpu:                1930m
    ephemeral-storage:  37569620724
    hugepages-1Gi:      0
    hugepages-2Mi:      0
    memory:             1483068Ki
    pods:               8

I'm using 1.0.1 version but i tested with 1.0.2 version too.

Regards

@iharris-luno
Copy link

I've just spun up 2000 c6a.2xlarge nodes in batches of 50, and not one of them had an incorrect NodeClaim. (If I'd realised how rare they were, compared to incorrect Nodes, I'd have grabbed the yaml of the one I found previously!). Plenty of incorrect nodes though (225 / 2000), so here's one of them and its associated nodeclaim...
node-1.zip

@k24dizzle
Copy link

Saw these values on a r7a.medium

node.status

Allocatable:
  cpu:                940m
  ephemeral-storage:  95551679124
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             7467144Ki
  pods:               58

nodeclaim.status

  Allocatable:
    Cpu:                        940m
    Ephemeral - Storage:        89Gi
    Memory:                     7134Mi
    Pods:                       8
    vpc.amazonaws.com/pod-eni:  4

@jonathan-innis
Copy link
Contributor

Unsure yet if it's related but we did track down a solve for #6987 which is available here

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-2f61ca341eaf5f220a0e70ee12c5d6d6c6c00438" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

(see #7013)

@jonathan-innis
Copy link
Contributor

I'm moving this to burning, given the number of different issues and folks that this is impacting.

@jonathan-innis
Copy link
Contributor

jonathan-innis commented Sep 16, 2024

Ok, I have some good news: I have an initial hypothesis about what's going on and it looks related to this line: https://github.com/aws/karpenter-provider-aws/blob/release-v1.0.x/pkg/providers/amifamily/resolver.go#L210.

What it seems to come down to is that this function is returning back a pointer which we are then mutating on L221 in some cases. This would be fine if we were only calling this function once and the NodeClass wasn't getting used elsewhere throughout the code, but because we are mutating the original object and not just reading it, that's most likely affecting our consistent view of the object throughout the code.

From looking at the code, I could reason about the following order of operations:

  1. The first launch template has a kubeletConfiguration but doesn't have a maxPods defined, so we resolve it and and then update the kubeletConfiguration in place
  2. Subsequent accesses of the kubeletConfiguration then use the same maxPods value.

This also explains why you only see this issue when you set kubeletConfig -- that's the only time when we don't create a new pointer and use the existing pointer.

Still validating, but if that's the case, should be a pretty easy fix -- just a tough thing to see :)

@jonathan-innis
Copy link
Contributor

jonathan-innis commented Sep 16, 2024

Confirmed, that's exactly what's happening. Added some print lines and this is what I see with the existing code (you actually see it returning the different value for the same instance type for different NodeClaims)

...
         // nolint:gosec
	// We know that it's not possible to have values that would overflow int32 here since we control
	// the maxPods values that we pass in here
	if kubeletConfig.MaxPods == nil {
		fmt.Printf("NodeClaim: %s. We should hit this every time\n", nodeClaim.Name)
		kubeletConfig.MaxPods = lo.ToPtr(int32(maxPods))
	}
	fmt.Printf("NodeClaim: %s, Generated MaxPods: %d, Used MaxPods: %d\n", nodeClaim.Name, maxPods, lo.FromPtr(kubeletConfig.MaxPods))
...
NodeClaim: nodes-default-amd64-cjrgj, Generated MaxPods: 58, Used MaxPods: 58
NodeClaim: nodes-default-amd64-cjrgj, Generated MaxPods: 234, Used MaxPods: 58
NodeClaim: nodes-default-amd64-9fqc5. We should hit this every time
NodeClaim: nodes-default-amd64-9fqc5, Generated MaxPods: 58, Used MaxPods: 58
NodeClaim: nodes-default-amd64-9fqc5, Generated MaxPods: 234, Used MaxPods: 58
NodeClaim: nodes-default-amd64-7d5tc. We should hit this every time
NodeClaim: nodes-default-amd64-7d5tc, Generated MaxPods: 58, Used MaxPods: 58
NodeClaim: nodes-default-amd64-7d5tc, Generated MaxPods: 234, Used MaxPods: 58
NodeClaim: nodes-default-amd64-cllx9. We should hit this every time
NodeClaim: nodes-default-amd64-cllx9, Generated MaxPods: 234, Used MaxPods: 234
NodeClaim: nodes-default-amd64-cllx9, Generated MaxPods: 58, Used MaxPods: 234
NodeClaim: nodes-default-amd64-wtbd9. We should hit this every time
NodeClaim: nodes-default-amd64-wtbd9, Generated MaxPods: 58, Used MaxPods: 58
NodeClaim: nodes-default-amd64-wtbd9, Generated MaxPods: 234, Used MaxPods: 58
NodeClaim: nodes-default-amd64-cj8jr. We should hit this every time
NodeClaim: nodes-default-amd64-cj8jr, Generated MaxPods: 58, Used MaxPods: 58
NodeClaim: nodes-default-amd64-cj8jr, Generated MaxPods: 234, Used MaxPods: 58

And when I change the pointer to be deep-copied.

...
	ret, err := utils.GetKubeletConfigurationWithNodeClaim(nodeClaim, nodeClass)
	if err != nil {
		return nil, fmt.Errorf("resolving kubelet configuration, %w", err)
	}
	kubeletConfig := &v1.KubeletConfiguration{}
	if ret != nil {
		kubeletConfig = ret.DeepCopy()
	}
         // nolint:gosec
	// We know that it's not possible to have values that would overflow int32 here since we control
	// the maxPods values that we pass in here
	if kubeletConfig.MaxPods == nil {
		fmt.Printf("NodeClaim: %s. We should hit this every time\n", nodeClaim.Name)
		kubeletConfig.MaxPods = lo.ToPtr(int32(maxPods))
	}
	fmt.Printf("NodeClaim: %s, Generated MaxPods: %d, Used MaxPods: %d\n", nodeClaim.Name, maxPods, lo.FromPtr(kubeletConfig.MaxPods))
...
NodeClaim: nodes-default-amd64-gbczg. We should hit this every time
NodeClaim: nodes-default-amd64-gbczg, Generated MaxPods: 58, Used MaxPods: 58
NodeClaim: nodes-default-amd64-gbczg. We should hit this every time
NodeClaim: nodes-default-amd64-gbczg, Generated MaxPods: 234, Used MaxPods: 234
NodeClaim: nodes-default-amd64-r6vqc. We should hit this every time
NodeClaim: nodes-default-amd64-r6vqc, Generated MaxPods: 58, Used MaxPods: 58
NodeClaim: nodes-default-amd64-r6vqc. We should hit this every time
NodeClaim: nodes-default-amd64-r6vqc, Generated MaxPods: 234, Used MaxPods: 234
NodeClaim: nodes-default-amd64-7p5hk. We should hit this every time
NodeClaim: nodes-default-amd64-7p5hk, Generated MaxPods: 58, Used MaxPods: 58
NodeClaim: nodes-default-amd64-7p5hk. We should hit this every time
NodeClaim: nodes-default-amd64-7p5hk, Generated MaxPods: 234, Used MaxPods: 234

@jonathan-innis
Copy link
Contributor

We'll raise something and get some testing out for it tomorrow morning PST time but for now it looks like we can actually make progress towards a patch 🎉

@jonathan-innis
Copy link
Contributor

PR has been raised. You should be able to try the snapshot with

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-cd04d65077eaed45e212e2140c0081768f3de547" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

For those willing to try -- let me know if you see the issue after the new install.

@jonathan-innis jonathan-innis self-assigned this Sep 16, 2024
@iharris-luno
Copy link

Looking good! 500 nodes created so far with no maxPods issues in either node or nodeclaim resources. I'll leave it churning for a bit, just in case, but looks like the problem's fixed. 🎉 Thank you!

@jonathan-innis
Copy link
Contributor

#7020 merged! So I think we are good to close this out now. We should have a patch that includes this soon! Please continue to post on this issue if you see any more issues with this, but from what I'm hearing, this appears to be resolved!

@guitmz
Copy link

guitmz commented Sep 24, 2024

@jonathan-innis when is the release expected? we are facing this issue now and karpenter can't spawn new machines. Adding/removing maxPods does not help

@msvechla
Copy link
Author

@jonathan-innis v1.0.3 has been released yesterday, but it looks like this is still not part of the release. Is there a specific reason for it? It looks like this is affecting quite a few user.

@caiohasouza
Copy link

Version 1.0.4 was released, but we don't have this fix either.

@aoi1
Copy link

aoi1 commented Oct 4, 2024

After upgrading Karpenter to v1.0.1 in our environment, we encountered a significant issue. This problem has a major impact on our environment, and we cannot proceed with upgrading Karpenter to v1 until it is resolved. We would appreciate it if you could inform us in which release the fix will be included.

@engedaam
Copy link
Contributor

engedaam commented Oct 4, 2024

@caiohasouza did you upgrade to use v1.0.4? The fix was included in that version https://github.com/aws/karpenter-provider-aws/releases/tag/v1.0.4

@caiohasouza
Copy link

@engedaam, I upgraded to v1.0.6 today. If the issue persists, I will update here.

Thank you!

@mohammed-nazim
Copy link

Hi Team,

After upgrading to 1.0.6, I am still receiving this error :

{"level":"ERROR","time":"2024-10-08T05:47:02.546Z","logger":"controller","message":"consistency error","commit":"6174c75","controller":"nodeclaim.consistency","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"XXXXXXXXXXXXXXXXXX"},"namespace":"","name":"XXXXXXXXXXXXXXXXXX","reconcileID":"XXXXXXXXXXXXXXXXXXXXXXXXXX","error":"expected 234 of resource pods, but found 58 (24.8% of expected)"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working burning Time sensitive issues
Projects
None yet
Development

No branches or pull requests