Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter cannot fit workload on instance type where it should fit #1306

Closed
nonoswz opened this issue Feb 9, 2022 · 7 comments
Closed

Karpenter cannot fit workload on instance type where it should fit #1306

nonoswz opened this issue Feb 9, 2022 · 7 comments
Assignees
Labels
bug Something isn't working lifecycle/stale

Comments

@nonoswz
Copy link

nonoswz commented Feb 9, 2022

Version

Karpenter: v0.6.1

Kubernetes: v1.20+

Expected Behavior

I expect Karpenter to be able to schedule a deployment on an instance type where the workload (resources) fits

Actual Behavior

I am trying to switch from ASG managed nodes to Karpenter. Currently it fails to fit one of our deployment (prometheus) on the same instance type as it was before in one of the ASG node (r5.12xlarge).

Our Prometheus deployment requests around 350GiB memory and 40 CPU, and a r5.12xlarge has 48 vCPU and 384 GiB as per AWS docs.
Extract of prometheus pod spec

    resources:
      limits:
        memory: 350Gi
      requests:
        cpu: "40"
        memory: 350Gi

Karpenter fails to run it on this specific instance type saying it won't fit.

2022-02-09T20:11:34.749Z	ERROR	controller.provisioning	Failed to compute packing, pod(s) [monitoring/prometheus-infrastructure-0] did not fit in instance type option(s) [r5.12xlarge]	{"commit": "df57892", "provisioner": "prometheus"}

Notes:

  • we also have multiple damonsets running on each nodes (they are shown below in the node description), in case that comes in the calculation to see if the workload can fit.
  • I tried to use a bigger instance type (r5.16xlarge) and it works fine. We can use bigger nodes as a workaround, but ultimately it would be good to be able to get the smaller instance type working.

Steps to Reproduce the Problem

  1. Create a deployment requesting 40vCPU and 350Gi memory
  2. Create a provisioners with only one instance type : r5.12xlarge
  3. Scale up replicas of this deployment to 1 and see if Karpenter is able to fit the deployment into a r5.12xlarge instance type.

Resource Specs and Logs

Pod spec (prometheus, I included relevant part only)

spec:
  containers:
   # container 1
    ....
    resources:
      limits:
        memory: 350Gi
      requests:
        cpu: "40"
        memory: 350Gi
   # container 2
    .....
    resources:
      limits:
        cpu: 100m
        memory: 25Mi
      requests:
        cpu: 100m
        memory: 25Mi
   # container 3
   ......
    resources:
      limits:
        cpu: 100m
        memory: 25Mi
      requests:
        cpu: 100m
        memory: 25Mi
  nodeSelector:
    group: prometheus
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  terminationGracePeriodSeconds: 600
  tolerations:
  - effect: NoSchedule
    key: dedicated
    operator: Equal
    value: prometheus
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

ASG managed node running prometheus, showing prometheus is able to fit on r5.12xlarge

Name:              ********
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=r5.12xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-1
                    failure-domain.beta.kubernetes.io/zone=us-east-1b
                    group=prometheus
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=******
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=r5.12xlarge
                    topology.ebs.csi.aws.com/zone=us-east-1b
                    topology.kubernetes.io/region=us-east-1
                    topology.kubernetes.io/zone=us-east-1b
Annotations:        
                    csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"*********"}
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: **********
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 09 Feb 2022 15:11:45 -0500
Taints:             dedicated=prometheus:NoSchedule
Unschedulable:      false
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         48
  ephemeral-storage:           20959212Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      391804104Ki
  pods:                        234
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         47750m
  ephemeral-storage:           18241637770
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      400241034854400m
  pods:                        234
System Info:
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
Non-terminated Pods:          (11 in total)
  Namespace                   Name                           CPU Requests  CPU Limits  Memory Requests  Memory Limits   AGE
  ---------                   ----                           ------------  ----------  ---------------  -------------   ---
  fluentd                           ***********                         400m (0%)     3 (6%)      6128Mi (1%)      7024Mi (1%)     2m28s
  kube-system                 aws-node-kr6qd                   10m (0%)      0 (0%)      0 (0%)           0 (0%)          2m28s
  kube-system                 calico-node-pfczj                 20m (0%)      0 (0%)      32Mi (0%)        0 (0%)          2m28s
  kube-system                 ebs-csi-node-vxjd2                   0 (0%)        0 (0%)      0 (0%)           0 (0%)          2m28s
  kube-system                 kube-proxy-48trm               100m (0%)     0 (0%)      0 (0%)           0 (0%)          2m28s
  kube-system                 node-local-dns-f2jdl           100m (0%)     1 (2%)      100Mi (0%)       1Gi (0%)        2m28s
  logging                         ***********                            1300m (2%)    2 (4%)      4224Mi (1%)      5Gi (1%)        2m28s
  monitoring                  ***********                               600m (1%)     1 (2%)      800Mi (0%)       0 (0%)          2m28s
  monitoring                  ***********                               200m (0%)     200m (0%)   768Mi (0%)       768Mi (0%)      2m28s
  monitoring                  ***********                                110m (0%)     220m (0%)   50Mi (0%)        90Mi (0%)       2m28s
  monitoring                  prometheus-infrastructure-0    40200m (84%)  200m (0%)   358450Mi (93%)   358450Mi (93%)  4m34s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests        Limits
  --------                    --------        ------
  cpu                         43040m (90%)    7620m (15%)
  memory                      370552Mi (97%)  372476Mi (97%)
  ephemeral-storage           0 (0%)          0 (0%)
  hugepages-1Gi               0 (0%)          0 (0%)
  hugepages-2Mi               0 (0%)          0 (0%)
  attachable-volumes-aws-ebs  0               0

Karpenter logs

2022-02-09T20:11:28.741Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "df57892", "provisioner": "prometheus"}
2022-02-09T20:11:34.743Z	INFO	controller.provisioning	Batched 1 pods in 1.000548092s	{"commit": "df57892", "provisioner": "prometheus"}
2022-02-09T20:11:34.749Z	ERROR	controller.provisioning	Failed to compute packing, pod(s) [monitoring/prometheus-infrastructure-0] did not fit in instance type option(s) [r5.12xlarge]	{"commit": "df57892", "provisioner": "prometheus"}

Prometheus provisioner spec

spec:
  kubeletConfiguration: {}
  labels:
    group: prometheus
  limits: {}
  provider:
    apiVersion: extensions.karpenter.sh/v1alpha1
    instanceProfile: ***********
    kind: AWS
    launchTemplate: ***********
    securityGroupSelector:
      kubernetes.io/cluster/***********: '*'
    subnetSelector:
      Name: '*private*'
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - r5.12xlarge
  - key: topology.kubernetes.io/zone
    operator: In
    values:
    - us-east-1a
    - us-east-1b
    - us-east-1c
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  taints:
  - effect: NoSchedule
    key: dedicated
    value: prometheus
  ttlSecondsAfterEmpty: 30
@nonoswz nonoswz added the bug Something isn't working label Feb 9, 2022
@felix-zhe-huang felix-zhe-huang self-assigned this Feb 9, 2022
@bwagner5 bwagner5 assigned bwagner5 and unassigned felix-zhe-huang Feb 9, 2022
@bwagner5
Copy link
Contributor

bwagner5 commented Feb 9, 2022

I printed our node overhead calculation (the amount of space reserved for system level kube resources)

[r5.12xlarge - Total]
cpu = 48,000m
memory = 393,216Mi

[r5.12xlarge - Allocatable]
cpu = 47,750m
memory = 381,700Mi

[r5.12xlarge - Overhead]
cpu = 290m 
memory = 3,029Mi 

[r5.12xlarge - Other things on the node you pasted]
cpu = 2,840
memory = 12,102Mi

[r5.12xlarge - Available for the Pod]
cpu = 44,620m
memory = 366,569Mi

[Needed for Prometheus Pod]
cpu = 40,200m
memory = 358,450Mi

So it does look like the pod should fit. I'll have to dig into this some more.

** My calcs were slightly off, but corrected the math and it still seems it should fit.

@ellistarn ellistarn added burning Time sensitive issues bug Something isn't working and removed bug Something isn't working labels Feb 10, 2022
@bwagner5
Copy link
Contributor

Oh, we also dedicate a certain portion of memory to the VM resource (7.5% of the total memory of the instance).

With that in mind,

[r5.12xlarge - Total]
cpu = 48,000m
memory = 393,216Mi

We take the 7.5% of memory bringing it down to:

memory = 393,216 * 0.075 = 363,724Mi

and then we subtract the CPU and memory overhead which mainly takes into account the resources the kubelet needs based on the number of vcpus and ENIs the instance has (https://github.com/bottlerocket-os/bottlerocket#kubernetes-settings).

Since the full of your pods is:
cpu = 43,040m
memory = 370,552Mi

the memory overflows our bin-packing by 6,828Mi

The 7.5% memory overhead may be a little aggressive on our part. We want to be a little cautious so that pods don't get OOM Killed.

@bwagner5
Copy link
Contributor

I'm going to try and remove this 7.5% memory overhead tomorrow and replace it with a static overhead of memory. I believe this was only to protect packing on to really small instance types (micros and nanos).

@nonoswz
Copy link
Author

nonoswz commented Feb 11, 2022

Thanks for the information and the fixes. I believe, the information about the memory overhead and how Karpenter computes if a workload can be placed on a node could be interesting to have in Karpenter docs.

@bwagner5
Copy link
Contributor

Thanks for the information and the fixes. I believe, the information about the memory overhead and how Karpenter computes if a workload can be placed on a node could be interesting to have in Karpenter docs.

That's an excellent idea. I'll open a docs issue to add that content

@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2022

Labeled for closure due to inactivity in 10 days.

@dewjam
Copy link
Contributor

dewjam commented May 10, 2022

Closing this out in favor of #1329 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lifecycle/stale
Projects
None yet
Development

No branches or pull requests

6 participants