Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter not respecting max volumes limit attached per node on AWS #260

Closed
kant777 opened this issue Apr 1, 2023 · 8 comments
Closed
Labels
kind/support Categorizes issue or PR as a support question. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@kant777
Copy link

kant777 commented Apr 1, 2023

Version

k8's version: 1.25
Karpenter version: v0.27.1

Expected Behavior

The max volumes attached per node on AWS is 26. whenever the node has 26 volumes already attached I expect Karpenter to spin up a new node and somehow communicate to kubernetes scheduler to not schedule any more pods with PV's onto that node. Pods without PV's can still be scheduled.

Actual Behavior

For some reason this is not happening, instead I see "Failed to attach volume" warnings and kubernetes scheduler still trying to schedule new pods with PV's on to the nodes where the max volumes are already attached

Steps to Reproduce the Problem

I had 100+ Deployments with one pod each. And each pod has a PV. Please note, this is not one Deployment with 100 pods. It is 100+ Deployment where each deployment can only have one pod with PV.

Resource Specs and Logs

sag-sagevmltchain-1680209613070807-1          2m28s       Warning   FailedAttachVolume    pod/chainlet-5f499678d7-crd7g                      AttachVolume.Attach failed for volume "pvc-fd105d23-8aaf-4ffa-8f58-638b704faace" : rpc error: code = Internal desc = Could not attach volume "vol-0eb3cc8c7215b8dfb" to node "i-0996343a5df5c33c5": attachment of disk "vol-0eb3cc8c7215b8dfb" failed, expected device to be attached but was attaching
sag-sagevmltchain-1680209613070807-1          20m         Warning   FailedMount           pod/chainlet-5f499678d7-crd7g                      Unable to attach or mount volumes: unmounted volumes=[saga-chainlet-persistent-storage], unattached volumes=[kube-api-access-9q4p2 saga-chainlet-persistent-storage]: timed out waiting for the condition
sag-sagevmltchain-1680209613070807-1          16s         Warning   FailedMount           pod/chainlet-5f499678d7-crd7g                      Unable to attach or mount volumes: unmounted volumes=[saga-chainlet-persistent-storage], unattached volumes=[saga-chainlet-persistent-storage kube-api-access-9q4p2]: timed out waiting for the condition
sag-testtestthree-1680227647274587-1     2m11s       Warning   FailedScheduling      pod/chainlet-64ccd59b8c-vhfzs                      0/5 nodes are available: 2 node(s) exceed max volume count, 3 Insufficient memory. preemption: 0/5 nodes are available: 5 No preemption victims found for incoming pod.

here are some more logs

 Warning  FailedScheduling  4m5s               default-scheduler  0/5 nodes are available: 5 persistentvolumeclaim "saga-chainlet-pvc" not found. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  4m4s               default-scheduler  0/5 nodes are available: 1 Too many pods, 1 node(s) exceed max volume count, 3 Insufficient memory. preemption: 0/5 nodes are available: 5 No preemption victims found for incoming pod.
  Normal   Nominated         4s (x3 over 4m4s)  karpenter          Pod should schedule on node: ip-172-16-88-211.us-west-2.compute.internal

and this


sag-whalechain-1680295256825605-1            40m         Warning   FailedScheduling                 pod/chainlet-77867f88d8-hfx6f                       0/12 nodes are available: 1 Insufficient cpu, 1 node(s) had volume node affinity conflict, 3 Too many pods, 3 node(s) were unschedulable, 7 Insufficient memory. preemption: 0/12 nodes are available: 4 Preemption is not helpful for scheduling, 8 No preemption victims found for incoming pod.
sag-whalechain-1680295256825605-1            40m         Normal    Scheduled                        pod/chainlet-77867f88d8-hfx6f                       Successfully assigned saga-whalechain-1680295256825605-1/chainlet-77867f88d8-hfx6f to ip-172-16-85-212.us-west-2.compute.internal
sag-whalechain-1680295256825605-1            40m         Warning   FailedAttachVolume               pod/chainlet-77867f88d8-hfx6f                       Multi-Attach error for volume "pvc-517fda86-15de-4087-9698-ffc7cc2d8a89" Volume is already used by pod(s) chainlet-77867f88d8-lk9vb
sag-whalechain-1680295256825605-1            2m10s       Warning   FailedMount                      pod/chainlet-77867f88d8-hfx6f                       Unable to attach or mount volumes: unmounted volumes=[saga-chainlet-persistent-storage], unattached volumes=[saga-chainlet-persistent-storage kube-api-access-6mf8k]: timed out waiting for the condition
sag-whalechain-1680295256825605-1            59s         Warning   FailedAttachVolume               pod/chainlet-77867f88d8-hfx6f                       AttachVolume.Attach failed for volume "pvc-517fda86-15de-4087-9698-ffc7cc2d8a89" : rpc error: code = Internal desc = Could not attach volume "vol-08525e22c69c925c5" to node "i-0ac065a1408c8fd7b": attachment of disk "vol-08525e22c69c925c5" failed, expected device to be attached but was attaching
sag-whalechain-1680295256825605-1            13m         Warning   FailedMount                      pod/chainlet-77867f88d8-hfx6f                       Unable to attach or mount volumes: unmounted volumes=[saga-chainlet-persistent-storage], unattached volumes=[kube-api-access-6mf8k saga-chainlet-persistent-storage]: timed out waiting for the condition
sag-whalechain-1680295256825605-1            56m         Warning   FailedScheduling                 pod/chainlet-77867f88d8-lk9vb                       0/6 nodes are available: 6 persistentvolumeclaim "saga-chainlet-pvc" not found. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
sag-whalechain-1680295256825605-1            55m         Warning   FailedScheduling                 pod/chainlet-77867f88d8-lk9vb                       0/6 nodes are available: 6 Insufficient memory. preemption: 0/6 nodes are available: 6 No preemption victims found for incoming pod.
sag-whalechain-1680295256825605-1            55m         Normal    Nominated                        pod/chainlet-77867f88d8-lk9vb                       Pod should schedule on node: ip-172-16-124-229.us-west-2.compute.internal
saga-whalechain-1680295256825605-1            55m         Normal    Scheduled                        pod/chainlet-77867f88d8-lk9vb                       Successfully assigned saga-whalechain-1680295256825605-1/chainlet-77867f88d8-lk9vb to ip-172-16-124-229.us-west-2.compute.internal
sag-whalechain-1680295256825605-1            55m         Normal    SuccessfulAttachVolume           pod/chainlet-77867f88d8-lk9vb 

If you see this one even though in the end it says SuccessfulAttachVolume why even try to attach volumes when max volumes are already attached per node?

Here is the csinode spec

apiVersion: storage.k8s.io/v1
kind: CSINode
metadata:
  annotations:
    storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/aws-ebs,kubernetes.io/azure-disk,kubernetes.io/azure-file,kubernetes.io/cinder,kubernetes.io/gce-pd,kubernetes.io/vsphere-volume
  creationTimestamp: "2023-03-31T18:24:42Z"
  name: ip-172-16-87-5.ec2.internal
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: ip-172-16-87-5.ec2.internal
    uid: c1553f47-a550-45a0-8536-9b3bd8ab5710
  resourceVersion: "1615"
  uid: c8f48265-e9d8-49ac-a4ad-c86c7fbc949d
spec:
  drivers:
  - allocatable:
      count: 25
    name: ebs.csi.aws.com
    nodeID: i-05f60c635b93f5d87
    topologyKeys:
    - topology.ebs.csi.aws.com/zone
    

Here is the provisioner spec

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
providerRef:
  name: default
requirements:
- key: karpenter.k8s.aws/instance-category
  operator: In
  values:
  - c
  - m
  - r
- key: karpenter.k8s.aws/instance-generation
  operator: Gt
  values:
  - "2"
- key: karpenter.k8s.aws/instance-family
  operator: In
  values: [c5, m5, r5]
# Exclude small instance sizes
- key: karpenter.k8s.aws/instance-size
  operator: NotIn
  values: [nano, micro, small, large]
- key: "karpenter.k8s.aws/instance-cpu"
  operator: In
  values: ["16", "32"]

kubeletConfiguration:
  maxPods: 104

limits:
  resources:
    cpu: "1000"
    memory: 1000Gi

# Enables consolidation which attempts to reduce cluster cost by both removing un-needed nodes and down-sizing those
# that can't be removed.  Mutually exclusive with the ttlSecondsAfterEmpty parameter.
consolidation:
  enabled: true

# If omitted, the feature is disabled and nodes will never expire.  If set to less time than it requires for a node
# to become ready, the node may expire before any pods successfully start.
ttlSecondsUntilExpired: 2592000 # 30 Days = 60 * 60 * 24 * 30 Seconds;

---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
name: default
spec:
securityGroupSelector:
  karpenter.sh/discovery: "{{ eks_cluster_name }}"
subnetSelector:
  karpenter.sh/discovery: "{{ eks_cluster_name }}"

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@kant777 kant777 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 1, 2023
@jonathan-innis jonathan-innis added the burning Time sensitive issues label Apr 3, 2023
@jonathan-innis
Copy link
Member

Can you share your StorageClass in this issue as well?

@jonathan-innis
Copy link
Member

jonathan-innis commented Apr 3, 2023

Digging into this a little further, it seems like we are hitting an issue here where we expect the StorageClass .spec.provisioner to match the spec[0].drivers.allocatable but this isn't what's happening due to CSI driver naming migration that automatically translates the in-tree driver name kubernetes.io/aws-ebs to the out-of-tree provider name ebs.csi.aws.com.

As a workaround, could you try updating your storageClass to use the ebs.csi.aws.com driver name as your provisioner and see if this fixes the issue where we aren't observing the limits?

@jonathan-innis
Copy link
Member

jonathan-innis commented Apr 3, 2023

This issue is compounded by the fact that limits were sometimes not being respected by the kube-scheduler in some earlier versions of K8s

@r3kzi
Copy link

r3kzi commented Apr 4, 2023

@jonathan-innis do you have a pointer to the code where you expect the StorageClass .spec.provisioner to match the spec[0].drivers.allocatable?

Trying to understand karpenter a little bit better and it would help me a lot!!

@jonathan-innis
Copy link
Member

jonathan-innis commented Apr 4, 2023

@r3kzi Sure, we are doing volume limit mapping here and then find the volume usage for pods on a node here. We detect the driver that the pod volume is using here.

For context, we are planning to log errors now when users are using in-tree drivers so that there is an alerting mechanism to tell users when Karpenter can't detect limits for auto-scaling. ##267 add this error logging.

We'll document this in our upstream docs, but our stance is that in-tree providers have largely been deprecated by upstream kubernetes and users should migrate to the new CSI drivers, particularly because a lot of these in-tree drivers are going to get removed in upstream kubernetes in upcoming versions very soon.

@jonathan-innis jonathan-innis added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Apr 4, 2023
@github-actions
Copy link

Labeled for closure due to inactivity in 10 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2023
@jonathan-innis
Copy link
Member

Closing this one since we had some fixes to issues related to this in #267 and #293

@jmdeal
Copy link
Member

jmdeal commented Feb 22, 2024

If anyone is continuing to encounter this issue on versions of Karpenter make sure you've added the following startup taint to your NodePools, as recommended by the ebs csi driver:

startupTaints:
  - key: ebs.csi.aws.com/agent-not-ready
    effect: NoExecute

This taint will be removed by the driver once it's ready and prevents a race condition where the kube-scheduler is able to schedule pods before limits can be discovered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants