Karpenter not respecting max volumes limit attached per node on AWS #260

kant777 · 2023-04-01T13:11:56Z

Version

k8's version: 1.25
Karpenter version: v0.27.1

Expected Behavior

The max volumes attached per node on AWS is 26. whenever the node has 26 volumes already attached I expect Karpenter to spin up a new node and somehow communicate to kubernetes scheduler to not schedule any more pods with PV's onto that node. Pods without PV's can still be scheduled.

Actual Behavior

For some reason this is not happening, instead I see "Failed to attach volume" warnings and kubernetes scheduler still trying to schedule new pods with PV's on to the nodes where the max volumes are already attached

Steps to Reproduce the Problem

I had 100+ Deployments with one pod each. And each pod has a PV. Please note, this is not one Deployment with 100 pods. It is 100+ Deployment where each deployment can only have one pod with PV.

Resource Specs and Logs

sag-sagevmltchain-1680209613070807-1          2m28s       Warning   FailedAttachVolume    pod/chainlet-5f499678d7-crd7g                      AttachVolume.Attach failed for volume "pvc-fd105d23-8aaf-4ffa-8f58-638b704faace" : rpc error: code = Internal desc = Could not attach volume "vol-0eb3cc8c7215b8dfb" to node "i-0996343a5df5c33c5": attachment of disk "vol-0eb3cc8c7215b8dfb" failed, expected device to be attached but was attaching
sag-sagevmltchain-1680209613070807-1          20m         Warning   FailedMount           pod/chainlet-5f499678d7-crd7g                      Unable to attach or mount volumes: unmounted volumes=[saga-chainlet-persistent-storage], unattached volumes=[kube-api-access-9q4p2 saga-chainlet-persistent-storage]: timed out waiting for the condition
sag-sagevmltchain-1680209613070807-1          16s         Warning   FailedMount           pod/chainlet-5f499678d7-crd7g                      Unable to attach or mount volumes: unmounted volumes=[saga-chainlet-persistent-storage], unattached volumes=[saga-chainlet-persistent-storage kube-api-access-9q4p2]: timed out waiting for the condition
sag-testtestthree-1680227647274587-1     2m11s       Warning   FailedScheduling      pod/chainlet-64ccd59b8c-vhfzs                      0/5 nodes are available: 2 node(s) exceed max volume count, 3 Insufficient memory. preemption: 0/5 nodes are available: 5 No preemption victims found for incoming pod.

here are some more logs

 Warning  FailedScheduling  4m5s               default-scheduler  0/5 nodes are available: 5 persistentvolumeclaim "saga-chainlet-pvc" not found. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  4m4s               default-scheduler  0/5 nodes are available: 1 Too many pods, 1 node(s) exceed max volume count, 3 Insufficient memory. preemption: 0/5 nodes are available: 5 No preemption victims found for incoming pod.
  Normal   Nominated         4s (x3 over 4m4s)  karpenter          Pod should schedule on node: ip-172-16-88-211.us-west-2.compute.internal

and this


sag-whalechain-1680295256825605-1            40m         Warning   FailedScheduling                 pod/chainlet-77867f88d8-hfx6f                       0/12 nodes are available: 1 Insufficient cpu, 1 node(s) had volume node affinity conflict, 3 Too many pods, 3 node(s) were unschedulable, 7 Insufficient memory. preemption: 0/12 nodes are available: 4 Preemption is not helpful for scheduling, 8 No preemption victims found for incoming pod.
sag-whalechain-1680295256825605-1            40m         Normal    Scheduled                        pod/chainlet-77867f88d8-hfx6f                       Successfully assigned saga-whalechain-1680295256825605-1/chainlet-77867f88d8-hfx6f to ip-172-16-85-212.us-west-2.compute.internal
sag-whalechain-1680295256825605-1            40m         Warning   FailedAttachVolume               pod/chainlet-77867f88d8-hfx6f                       Multi-Attach error for volume "pvc-517fda86-15de-4087-9698-ffc7cc2d8a89" Volume is already used by pod(s) chainlet-77867f88d8-lk9vb
sag-whalechain-1680295256825605-1            2m10s       Warning   FailedMount                      pod/chainlet-77867f88d8-hfx6f                       Unable to attach or mount volumes: unmounted volumes=[saga-chainlet-persistent-storage], unattached volumes=[saga-chainlet-persistent-storage kube-api-access-6mf8k]: timed out waiting for the condition
sag-whalechain-1680295256825605-1            59s         Warning   FailedAttachVolume               pod/chainlet-77867f88d8-hfx6f                       AttachVolume.Attach failed for volume "pvc-517fda86-15de-4087-9698-ffc7cc2d8a89" : rpc error: code = Internal desc = Could not attach volume "vol-08525e22c69c925c5" to node "i-0ac065a1408c8fd7b": attachment of disk "vol-08525e22c69c925c5" failed, expected device to be attached but was attaching
sag-whalechain-1680295256825605-1            13m         Warning   FailedMount                      pod/chainlet-77867f88d8-hfx6f                       Unable to attach or mount volumes: unmounted volumes=[saga-chainlet-persistent-storage], unattached volumes=[kube-api-access-6mf8k saga-chainlet-persistent-storage]: timed out waiting for the condition
sag-whalechain-1680295256825605-1            56m         Warning   FailedScheduling                 pod/chainlet-77867f88d8-lk9vb                       0/6 nodes are available: 6 persistentvolumeclaim "saga-chainlet-pvc" not found. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
sag-whalechain-1680295256825605-1            55m         Warning   FailedScheduling                 pod/chainlet-77867f88d8-lk9vb                       0/6 nodes are available: 6 Insufficient memory. preemption: 0/6 nodes are available: 6 No preemption victims found for incoming pod.
sag-whalechain-1680295256825605-1            55m         Normal    Nominated                        pod/chainlet-77867f88d8-lk9vb                       Pod should schedule on node: ip-172-16-124-229.us-west-2.compute.internal
saga-whalechain-1680295256825605-1            55m         Normal    Scheduled                        pod/chainlet-77867f88d8-lk9vb                       Successfully assigned saga-whalechain-1680295256825605-1/chainlet-77867f88d8-lk9vb to ip-172-16-124-229.us-west-2.compute.internal
sag-whalechain-1680295256825605-1            55m         Normal    SuccessfulAttachVolume           pod/chainlet-77867f88d8-lk9vb

If you see this one even though in the end it says SuccessfulAttachVolume why even try to attach volumes when max volumes are already attached per node?

Here is the csinode spec

apiVersion: storage.k8s.io/v1
kind: CSINode
metadata:
  annotations:
    storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/aws-ebs,kubernetes.io/azure-disk,kubernetes.io/azure-file,kubernetes.io/cinder,kubernetes.io/gce-pd,kubernetes.io/vsphere-volume
  creationTimestamp: "2023-03-31T18:24:42Z"
  name: ip-172-16-87-5.ec2.internal
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: ip-172-16-87-5.ec2.internal
    uid: c1553f47-a550-45a0-8536-9b3bd8ab5710
  resourceVersion: "1615"
  uid: c8f48265-e9d8-49ac-a4ad-c86c7fbc949d
spec:
  drivers:
  - allocatable:
      count: 25
    name: ebs.csi.aws.com
    nodeID: i-05f60c635b93f5d87
    topologyKeys:
    - topology.ebs.csi.aws.com/zone

Here is the provisioner spec

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
providerRef:
  name: default
requirements:
- key: karpenter.k8s.aws/instance-category
  operator: In
  values:
  - c
  - m
  - r
- key: karpenter.k8s.aws/instance-generation
  operator: Gt
  values:
  - "2"
- key: karpenter.k8s.aws/instance-family
  operator: In
  values: [c5, m5, r5]
# Exclude small instance sizes
- key: karpenter.k8s.aws/instance-size
  operator: NotIn
  values: [nano, micro, small, large]
- key: "karpenter.k8s.aws/instance-cpu"
  operator: In
  values: ["16", "32"]

kubeletConfiguration:
  maxPods: 104

limits:
  resources:
    cpu: "1000"
    memory: 1000Gi

# Enables consolidation which attempts to reduce cluster cost by both removing un-needed nodes and down-sizing those
# that can't be removed.  Mutually exclusive with the ttlSecondsAfterEmpty parameter.
consolidation:
  enabled: true

# If omitted, the feature is disabled and nodes will never expire.  If set to less time than it requires for a node
# to become ready, the node may expire before any pods successfully start.
ttlSecondsUntilExpired: 2592000 # 30 Days = 60 * 60 * 24 * 30 Seconds;

---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
name: default
spec:
securityGroupSelector:
  karpenter.sh/discovery: "{{ eks_cluster_name }}"
subnetSelector:
  karpenter.sh/discovery: "{{ eks_cluster_name }}"

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

jonathan-innis · 2023-04-03T18:08:07Z

Can you share your StorageClass in this issue as well?

jonathan-innis · 2023-04-03T18:18:42Z

Digging into this a little further, it seems like we are hitting an issue here where we expect the StorageClass .spec.provisioner to match the spec[0].drivers.allocatable but this isn't what's happening due to CSI driver naming migration that automatically translates the in-tree driver name kubernetes.io/aws-ebs to the out-of-tree provider name ebs.csi.aws.com.

As a workaround, could you try updating your storageClass to use the ebs.csi.aws.com driver name as your provisioner and see if this fixes the issue where we aren't observing the limits?

jonathan-innis · 2023-04-03T19:19:42Z

This issue is compounded by the fact that limits were sometimes not being respected by the kube-scheduler in some earlier versions of K8s

r3kzi · 2023-04-04T15:35:26Z

@jonathan-innis do you have a pointer to the code where you expect the StorageClass .spec.provisioner to match the spec[0].drivers.allocatable?

Trying to understand karpenter a little bit better and it would help me a lot!!

jonathan-innis · 2023-04-04T18:30:27Z

@r3kzi Sure, we are doing volume limit mapping here and then find the volume usage for pods on a node here. We detect the driver that the pod volume is using here.

For context, we are planning to log errors now when users are using in-tree drivers so that there is an alerting mechanism to tell users when Karpenter can't detect limits for auto-scaling. ##267 add this error logging.

We'll document this in our upstream docs, but our stance is that in-tree providers have largely been deprecated by upstream kubernetes and users should migrate to the new CSI drivers, particularly because a lot of these in-tree drivers are going to get removed in upstream kubernetes in upcoming versions very soon.

github-actions · 2023-04-25T12:01:22Z

Labeled for closure due to inactivity in 10 days.

jonathan-innis · 2023-04-25T16:53:24Z

Closing this one since we had some fixes to issues related to this in #267 and #293

jmdeal · 2024-02-22T21:08:46Z

If anyone is continuing to encounter this issue on versions of Karpenter make sure you've added the following startup taint to your NodePools, as recommended by the ebs csi driver:

startupTaints:
  - key: ebs.csi.aws.com/agent-not-ready
    effect: NoExecute

This taint will be removed by the driver once it's ready and prevents a race condition where the kube-scheduler is able to schedule pods before limits can be discovered.

kant777 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 1, 2023

jonathan-innis added the burning Time sensitive issues label Apr 3, 2023

jonathan-innis removed the burning Time sensitive issues label Apr 3, 2023

jonathan-innis mentioned this issue Apr 3, 2023

Karpenter does not respect volume-attach-limit set for EBS volumes. aws/karpenter-provider-aws#3647

Closed

jonathan-innis added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Apr 4, 2023

jonathan-innis mentioned this issue Apr 6, 2023

docs: Add volume mount troubleshooting for volume limits aws/karpenter-provider-aws#3710

Merged

3 tasks

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2023

jonathan-innis closed this as completed Apr 25, 2023

prad9192 mentioned this issue Nov 29, 2024

Karpenter VolumeUsage calculation doesn't include volumes in "attaching" state #1852

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter not respecting max volumes limit attached per node on AWS #260

Karpenter not respecting max volumes limit attached per node on AWS #260

kant777 commented Apr 1, 2023

jonathan-innis commented Apr 3, 2023

jonathan-innis commented Apr 3, 2023 •

edited

Loading

jonathan-innis commented Apr 3, 2023 •

edited

Loading

r3kzi commented Apr 4, 2023

jonathan-innis commented Apr 4, 2023 •

edited

Loading

github-actions bot commented Apr 25, 2023

jonathan-innis commented Apr 25, 2023

jmdeal commented Feb 22, 2024 •

edited

Loading

Karpenter not respecting max volumes limit attached per node on AWS #260

Karpenter not respecting max volumes limit attached per node on AWS #260

Comments

kant777 commented Apr 1, 2023

Version

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Resource Specs and Logs

Community Note

jonathan-innis commented Apr 3, 2023

jonathan-innis commented Apr 3, 2023 • edited Loading

jonathan-innis commented Apr 3, 2023 • edited Loading

r3kzi commented Apr 4, 2023

jonathan-innis commented Apr 4, 2023 • edited Loading

github-actions bot commented Apr 25, 2023

jonathan-innis commented Apr 25, 2023

jmdeal commented Feb 22, 2024 • edited Loading

jonathan-innis commented Apr 3, 2023 •

edited

Loading

jonathan-innis commented Apr 3, 2023 •

edited

Loading

jonathan-innis commented Apr 4, 2023 •

edited

Loading

jmdeal commented Feb 22, 2024 •

edited

Loading