Karpenter launches more EC2 instances than required #1291

kotlovs · 2022-02-08T02:28:29Z

Version

Karpenter: v0.5.6

Kubernetes: v1.19

Expected Behavior

Karpenter launches exactly as many ec2 instances as required by existing pods.

Actual Behavior

Karpenter was used to run Spark applications on AWS EKS. I ran 3 Spark applications simultaneously, each of them required 14 executor pods (42 executor pods in total).
As a result, 5 * 4xlarge instances and 11 * 8xlarge instances were launched, which is significantly more than required for these pods.
Thus, the total capacity of the instances was: 5 * 2 + 11 * 4 = 54 pods. But the apps together required only 42 executors pods (overprovisioning - 28%).
As a result, some of the instances were only half filled with pods.

Resource Specs and Logs

The provisioner was created specifically for executors of Spark applications:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: karpenter-team-common-1b-spark-executors
spec:
  ttlSecondsAfterEmpty: 30
  taints:
    - key: purpose
      value: karpenter-team-common-1b-spark-executors
      effect: NoSchedule
  labels:
    purpose: karpenter-team-common-1b-spark-executors
  requirements:
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values: ["r5n.8xlarge", "r5.8xlarge", "r5a.8xlarge", "r5n.4xlarge", "r5.4xlarge", "r5a.4xlarge"]
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["eu-central-1b"]
    - key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand
      operator: In
      values: ["spot"]
  provider:
    securityGroupSelector:
      Name: eks-team-platform-node
    subnetSelector:
      kubernetes.io/cluster/streaming-kube: '*'
    launchTemplate: KarpenterCustomLaunchTemplate

All spark applications have the identical settings for their executor's pods, such that 4xlarge instance can hold 2 executor pods, 8xlarge - 4 pods.

apiVersion: v1
kind: Pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: purpose
            operator: In
            values:
            - karpenter-team-common-1b-spark-executors
  containers:
  - resources:
      limits:
        memory: 58Gi
      requests:
        cpu: 7500m
        memory: 58Gi
  .....
  tolerations:
  - effect: NoSchedule
    key: purpose
    operator: Equal
    value: karpenter-team-common-1b-spark-executors

Ec2 instances are created using the custom launchTemplate to increase a disk size:

AWSTemplateFormatVersion: '2010-09-09'
Resources:
  MyLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateData:
        IamInstanceProfile:
          Name: KarpenterNodeInstanceProfile-streaming-kube
        ImageId: ami-04ea0b353c9bfb834
        UserData: !Base64 >
            #!/bin/bash -xe
             exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1
             /etc/eks/bootstrap.sh 'streaming-kube' \
                --apiserver-endpoint 'https://...' \
                --b64-cluster-ca '...Qo=' \
                --kubelet-extra-args '--node-labels=karpenter.sh/capacity-type=spot,karpenter.sh/provisioner-name=karpenter-team-common-1b-spark-executors,logging=filebeat-spark-kafka,purpose=karpenter-team-common-1b-spark-executors --register-with-taints=purpose=karpenter-team-common-1b-spark-executors:NoSchedule'
        BlockDeviceMappings:
          - Ebs:
              VolumeSize: 300
              VolumeType: gp3
              DeleteOnTermination: true
            DeviceName: /dev/xvda
        SecurityGroupIds:
          - sg-03819708b39479710
        MetadataOptions:
          HttpEndpoint: enabled
          HttpProtocolIpv6: disabled
          HttpPutResponseHopLimit: "2"
          HttpTokens: required
      LaunchTemplateName: KarpenterCustomLaunchTemplate

And another similar case

Provisioner instance-type was defined as ["r5dn.2xlarge", "r5.2xlarge","r5.4xlarge","r5.8xlarge"]
Pods additionally had in the spec:

affinity:
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
            - key: "node.kubernetes.io/instance-type"
              operator: In
              values: ["r5dn.2xlarge"]

One Spark-app was started requiring 10 executors. Karpenter launched 10 ec2 instances (r5dn.2xlarge) for them.
Then I killed one of the executor pods, and as a result, a new executor was created instead of the killed one, on the same instance.
But at the same time, Karpenter launched a new ec2 instance, that was left unused and was deleted after ttlSecondsAfterEmpty.
Expected behavior: Karpenter should not create unused instances.

The text was updated successfully, but these errors were encountered:

felix-zhe-huang · 2022-02-08T17:24:31Z

Thank you very much for bring up this problem. Can you also share the Karpenter logs? Meanwhile I will try to recreate it.
We recently updated the scheduling and requirement logic. You can also try the latest v0.6.1. It may be able to solve your problem.

felix-zhe-huang · 2022-02-08T20:18:36Z

I tried your configurations and the launch template. In my environment Karpenter launches 11 * 8xlarge instances. I am wondering if the 5 * 4xlarge instances and 11 * 8xlarge instances are launched at the same time.

I am just thinking out loud here. There is a possibility that some constraints are preventing pods to be scheduled by the k8s scheduler even when there are nodes with enough extra resources. That will make the pod un-schedulable and trigger Karpenter to launch new nodes. This also seems like what happened in your second case for a brief moment.

Do you run any DaemonSets in your cluster?

kotlovs · 2022-02-09T02:41:09Z

Thanks for looking into this problem! It seems you are right.
Just tried another example (more simple) on version v0.6.1. I started 1 Spark application with 10 pods. Also, only 4xlarge instances are used in the Provisioner. Karpenter launched one more 4xlarge instance than needed (2 pods got individual 4xlarge instances).

It looks like the reason is:
Spark creates executor's pods not quite simultaneously, but with some time lag (first 1 executor, then several more).
In this example, Spark created the first executor, and Karpenter launched an instance for it and attached a pod.

2022-02-08T21:33:27.050Z        INFO    controller.provisioning Batched 1 pods in 1.000908884s  {"commit": "df57892", "provisioner": "karpenter-team-common-1b-spark-executors"}
2022-02-08T21:33:27.169Z        DEBUG   controller.provisioning Discovered subnets: [subnet-0b50cd90b4406b78b (eu-central-1a) subnet-051e4c5a7359cd08d (eu-central-1b) subnet-082db90dc89a73a03 (eu-central-1b) subnet-0486c00572613d945 (eu-central-1a) subnet-0146c945ea5a8656e (eu-central-1c) subnet-02abb75561da709a6 (eu-central-1c)]  {"commit": "df57892", "provisioner": "karpenter-team-common-1b-spark-executors"}
2022-02-08T21:33:27.173Z        INFO    controller.provisioning Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [r5n.4xlarge r5a.4xlarge r5.4xlarge]    {"commit": "df57892", "provisioner": "karpenter-team-common-1b-spark-executors"}
2022-02-08T21:33:29.133Z        INFO    controller.provisioning Launched instance: i-08636c82766872fa6, hostname: ip-10-20-107-101.eu-central-1.compute.internal, type: r5a.4xlarge, zone: eu-central-1b, capacityType: spot {"commit": "df57892", "provisioner": "karpenter-team-common-1b-spark-executors"}
2022-02-08T21:33:29.152Z        INFO    controller.provisioning Bound 1 pod(s) to node ip-10-20-107-101.eu-central-1.compute.internal   {"commit": "df57892", "provisioner": "karpenter-team-common-1b-spark-executors"}

Then, after a short time, spark created the second executor. It could not be assigned to instance#1 because this instance was not yet ready. In the events of the second pod, I saw:

default-scheduler  0/92 nodes are available: 1 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, ...

And Karpenter launched a new instance for it:

2022-02-08T21:34:11.120Z        INFO    controller.provisioning Batched 1 pods in 1.000859529s  {"commit": "df57892", "provisioner": "karpenter-team-common-1b-spark-executors"}
2022-02-08T21:34:11.124Z        INFO    controller.provisioning Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [r5n.4xlarge r5a.4xlarge r5.4xlarge]    {"commit": "df57892", "provisioner": "karpenter-team-common-1b-spark-executors"}
2022-02-08T21:34:13.247Z        INFO    controller.provisioning Launched instance: i-0fcc6bd910d4872c4, hostname: ip-10-20-110-241.eu-central-1.compute.internal, type: r5a.4xlarge, zone: eu-central-1b, capacityType: spot {"commit": "df57892", "provisioner": "karpenter-team-common-1b-spark-executors"}
2022-02-08T21:34:13.261Z        INFO    controller.provisioning Bound 1 pod(s) to node ip-10-20-110-241.eu-central-1.compute.internal   {"commit": "df57892", "provisioner": "karpenter-team-common-1b-spark-executors"}

Next, Spark almost simultaneously created the remaining pods. They also could not be assigned to the first 2 instances due to their 'not-ready' status. But this time, Karpenter processed them in a batch and did not create redundant instances.

2022-02-08T21:34:14.448Z        INFO    controller.provisioning Computed packing of 4 node(s) for 8 pod(s) with instance type option(s) [r5n.4xlarge r5a.4xlarge r5.4xlarge]    {"commit": "df57892", "provisioner": "karpenter-team-common-1b-spark-executors"}
2022-02-08T21:34:16.567Z        INFO    controller.provisioning Launched instance: i-0b1a572a0a7f6f50e, hostname: ip-10-20-121-40.eu-central-1.compute.internal, type: r5a.4xlarge, zone: eu-central-1b, capacityType: spot  {"commit": "df57892", "provisioner": "karpenter-team-common-1b-spark-executors"}
2022-02-08T21:34:16.567Z        INFO    controller.provisioning Launched instance: i-0037ca062b0370680, hostname: ip-10-20-113-138.eu-central-1.compute.internal, type: r5a.4xlarge, zone: eu-central-1b, capacityType: spot {"commit": "df57892", "provisioner": "karpenter-team-common-1b-spark-executors"}
2022-02-08T21:34:16.567Z        INFO    controller.provisioning Launched instance: i-0de3ea592bb351b2e, hostname: ip-10-20-99-208.eu-central-1.compute.internal, type: r5a.4xlarge, zone: eu-central-1b, capacityType: spot  {"commit": "df57892", "provisioner": "karpenter-team-common-1b-spark-executors"}
2022-02-08T21:34:16.567Z        INFO    controller.provisioning Launched instance: i-032920f0cb6b132e1, hostname: ip-10-20-118-14.eu-central-1.compute.internal, type: r5a.4xlarge, zone: eu-central-1b, capacityType: spot  {"commit": "df57892", "provisioner": "karpenter-team-common-1b-spark-executors"}
2022-02-08T21:34:16.583Z        INFO    controller.provisioning Bound 2 pod(s) to node ip-10-20-121-40.eu-central-1.compute.internal    {"commit": "df57892", "provisioner": "karpenter-team-common-1b-spark-executors"}
2022-02-08T21:34:16.611Z        INFO    controller.provisioning Bound 2 pod(s) to node ip-10-20-113-138.eu-central-1.compute.internal   {"commit": "df57892", "provisioner": "karpenter-team-common-1b-spark-executors"}
2022-02-08T21:34:16.631Z        INFO    controller.provisioning Bound 2 pod(s) to node ip-10-20-99-208.eu-central-1.compute.internal    {"commit": "df57892", "provisioner": "karpenter-team-common-1b-spark-executors"}
2022-02-08T21:34:16.653Z        INFO    controller.provisioning Bound 2 pod(s) to node ip-10-20-118-14.eu-central-1.compute.internal    {"commit": "df57892", "provisioner": "karpenter-team-common-1b-spark-executors"}

I was able to remove this overprovisioning by adding to the pod spec:

tolerations:
    - effect: NoSchedule
      key: karpenter.sh/not-ready
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/unreachable
      operator: Exists

But this doesn't seem to be the best solution.
Can Karpenter take into account the availability of free space in previously launched instances and, if possible, bind new pods to them?

Do you run any DaemonSets in your cluster?

Yes, there are a couple of DaemonSets, but they require few resources and, in theory, should not prevent the executor pods from launching.

ellistarn · 2022-02-09T03:44:21Z

After discussion, I think this is a duplicate of #1044.

felix-zhe-huang · 2022-02-09T20:06:34Z

Can Karpenter take into account the availability of free space in previously launched instances and, if possible, bind new pods to them?

Yes we are tracking this feature request in #1044.

Closing this since it is a duplicate. Please reopen if it is not.

kotlovs added the bug Something isn't working label Feb 8, 2022

felix-zhe-huang added the burning Time sensitive issues label Feb 8, 2022

felix-zhe-huang self-assigned this Feb 8, 2022

ellistarn mentioned this issue Feb 9, 2022

Karpenter should consider in-flight capacity when scaling out #1044

Closed

felix-zhe-huang added feature New feature or request and removed bug Something isn't working burning Time sensitive issues labels Feb 9, 2022

felix-zhe-huang closed this as completed Feb 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter launches more EC2 instances than required #1291

Karpenter launches more EC2 instances than required #1291

kotlovs commented Feb 8, 2022

felix-zhe-huang commented Feb 8, 2022

felix-zhe-huang commented Feb 8, 2022 •

edited

Loading

kotlovs commented Feb 9, 2022

ellistarn commented Feb 9, 2022

felix-zhe-huang commented Feb 9, 2022

Karpenter launches more EC2 instances than required #1291

Karpenter launches more EC2 instances than required #1291

Comments

kotlovs commented Feb 8, 2022

Version

Expected Behavior

Actual Behavior

Resource Specs and Logs

And another similar case

felix-zhe-huang commented Feb 8, 2022

felix-zhe-huang commented Feb 8, 2022 • edited Loading

kotlovs commented Feb 9, 2022

ellistarn commented Feb 9, 2022

felix-zhe-huang commented Feb 9, 2022

felix-zhe-huang commented Feb 8, 2022 •

edited

Loading