Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provisioner is creating some arm64 nodes even with kubernetes.io/arch in ["amd64"] requirement #1540

Closed
armujahid opened this issue Mar 18, 2022 · 6 comments · Fixed by #1543
Assignees
Labels
bug Something isn't working burning Time sensitive issues

Comments

@armujahid
Copy link
Contributor

armujahid commented Mar 18, 2022

Version

Karpenter: v0.7.1

Kubernetes: v1.2.1
Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9", GitTreeState:"clean", BuildDate:"2021-10-29T23:32:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

Expected Behavior

karpenter should only provision amd64 nodes since it's mentioned in provisioner requirements.

Actual Behavior

karpenter is adding some arm64 nodes that are causing "standard_init_linux.go:228: exec user process caused: exec format error" for some pods because of architecture incompatibility.
Also
k get nodes -l kubernetes.io/arch=arm64 is returning some nodes created by karpenter

Steps to Reproduce the Problem

  1. create eks cluster using https://karpenter.sh/v0.7.1/getting-started/getting-started-with-eksctl/
  2. apply this provisioner
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot"]
    - key: kubernetes.io/arch
      operator: In
      values: ["amd64"]
  limits:
    resources:
      cpu: 1000
  provider:
    subnetSelector:
      karpenter.sh/discovery: ${CLUSTER_NAME}
    securityGroupSelector:
      # karpenter.sh/discovery returns multiple security groups which cause conflict with ingress controller
      # karpenter.sh/discovery: ${CLUSTER_NAME}
      kubernetes.io/cluster/${CLUSTER_NAME}: owned
  ttlSecondsAfterEmpty: 30
  1. Run some pods to force karpenter to add nodes.

Resource Specs and Logs

Provisioner spec is already provided.

Logs:

2022-03-19T13:44:18.523Z	INFO	controller.provisioning	Batched 1 pods in 1.000971953s	{"commit": "4b14787", "provisioner": "default"}
2022-03-19T13:44:18.770Z	DEBUG	controller.provisioning	Discovered subnets: [subnet-019e68fa57e9a2ca3 (ap-southeast-1b) subnet-0301cc1a0a9fde185 (ap-southeast-1b) subnet-058ed4ba496f28c4e (ap-southeast-1c) subnet-0733e641a57b135b2 (ap-southeast-1c) subnet-0fc878db1edb5d8e9 (ap-southeast-1a) subnet-061b9463265420a6f (ap-southeast-1a)]	{"commit": "4b14787", "provisioner": "default"}
2022-03-19T13:44:18.776Z	DEBUG	controller.provisioning	Excluding instance type t3.nano because there are not enough resources for kubelet and system overhead	{"commit": "4b14787", "provisioner": "default"}
2022-03-19T13:44:18.782Z	DEBUG	controller.provisioning	Excluding instance type t4g.nano because there are not enough resources for kubelet and system overhead	{"commit": "4b14787", "provisioner": "default"}
2022-03-19T13:44:18.794Z	DEBUG	controller.provisioning	Excluding instance type t3a.nano because there are not enough resources for kubelet and system overhead	{"commit": "4b14787", "provisioner": "default"}
2022-03-19T13:44:18.822Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [m1.small c6gd.medium c6gn.medium c6g.medium a1.medium m1.medium m3.medium m6gd.medium m6g.medium r6g.medium r6gd.medium c1.medium t3a.small t3.small t4g.small c4.large c3.large]	{"commit": "4b14787", "provisioner": "default"}
2022-03-19T13:44:18.911Z	DEBUG	controller.provisioning	Discovered security groups: [sg-0bfdbf9b25fb3c6a6]	{"commit": "4b14787", "provisioner": "default"}
2022-03-19T13:44:18.914Z	DEBUG	controller.provisioning	Discovered kubernetes version 1.21	{"commit": "4b14787", "provisioner": "default"}
2022-03-19T13:44:18.959Z	DEBUG	controller.provisioning	Discovered ami-0a620d8210b5d94ac for query /aws/service/eks/optimized-ami/1.21/amazon-linux-2/recommended/image_id	{"commit": "4b14787", "provisioner": "default"}
2022-03-19T13:44:19.001Z	DEBUG	controller.provisioning	Discovered ami-013ca3e2e14b35693 for query /aws/service/eks/optimized-ami/1.21/amazon-linux-2-arm64/recommended/image_id	{"commit": "4b14787", "provisioner": "default"}
2022-03-19T13:44:19.046Z	DEBUG	controller.provisioning	Discovered launch template Karpenter-arm-karpenter-demo-13741680538296305221	{"commit": "4b14787", "provisioner": "default"}
2022-03-19T13:44:19.189Z	DEBUG	controller.provisioning	Created launch template, Karpenter-arm-karpenter-demo-3170135309939700436	{"commit": "4b14787", "provisioner": "default"}
2022-03-19T13:44:21.415Z	INFO	controller.provisioning	Launched instance: i-03f829b84b0ab2518, hostname: ip-192-168-173-182.ap-southeast-1.compute.internal, type: t4g.small, zone: ap-southeast-1b, capacityType: spot	{"commit": "4b14787", "provisioner": "default"}
2022-03-19T13:44:21.431Z	INFO	controller.provisioning	Bound 1 pod(s) to node ip-192-168-173-182.ap-southeast-1.compute.internal	{"commit": "4b14787", "provisioner": "default"}
@armujahid armujahid added the bug Something isn't working label Mar 18, 2022
@dewjam dewjam self-assigned this Mar 18, 2022
@dewjam
Copy link
Contributor

dewjam commented Mar 18, 2022

Hey @armujahid I'm trying to reproduce this now. In the meantime, can you share some details about your cluster?

Is this a new cluster you've built just for testing out Karpenter? Or is this an existing cluster with other workloads?

Can you enable debug logging and supply some logs showing problem?

@armujahid
Copy link
Contributor Author

It's a newly created cluster that I created by following ekctl karpenter getting started guide (link in step 1).
It is running few of our containers published in private ecr registry.
I will also share logs here.
Issue could also be because of architecture of container image. Although we mostly use amd64 hardware but some devs also have Apple arm64 hardware (that could push arm64 image by default?)

@dewjam
Copy link
Contributor

dewjam commented Mar 18, 2022

Hey @armujahid I am able to reproduce the problem in my test cluster. From what we can tell v0.7.1 is not filtering out instance types based on architecture when identifying feasible instance type options.

Here's an example log from v0.6.5 showing instance type options:

2022-03-18T19:49:16.111Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [m1.small m1.medium m3.medium c1.medium t3a.small t3.small c4.large c3.large c6a.large c5.large t3a.medium c5d.large t3.medium c5ad.large c6i.large c5a.large c5n.large m3.large]	{"commit": "7b5afee", "provisioner": "default"}

Here's an example log from v0.7.1 with same workload and provisioner spec:

2022-03-18T19:33:45.888Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [m1.small c6g.medium a1.medium c6gd.medium c6gn.medium m1.medium m3.medium m6g.medium m6gd.medium r6g.medium r6gd.medium c1.medium t3a.small t3.small t4g.small c4.large c3.large]	{"commit": "4b14787", "provisioner": "default"}

(notice the graviton instances are in the list)

We are working on a fix for this now. Thank you for bringing this issue to our attention!

@dewjam
Copy link
Contributor

dewjam commented Mar 18, 2022

As a note, you can temporarily work around this by manually specifying instance types in the Provider spec. Such as:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  ttlSecondsAfterEmpty: 60
  requirements:
  - key: node.kubernetes.io/instance-type
    operator: In
    values: ["c5.large", "c5.xlarge"]

@rtripat rtripat added the burning Time sensitive issues label Mar 18, 2022
tzneal added a commit to tzneal/karpenter that referenced this issue Mar 18, 2022
tzneal added a commit to tzneal/karpenter that referenced this issue Mar 18, 2022
@armujahid
Copy link
Contributor Author

Thanks for the quick fix. I will test it with v0.7.2. I have also added logs. it was indeed provisioning graviton instances (t4g.small in my case)

@armujahid
Copy link
Contributor Author

With v0.7.2 it's working fine :)
logs:

2022-03-19T14:07:17.529Z	INFO	controller.provisioning	Batched 1 pods in 1.000508466s	{"commit": "c9e015e", "provisioner": "default"}
2022-03-19T14:07:17.533Z	DEBUG	controller.provisioning	Excluding instance type t3a.nano because there are not enough resources for kubelet and system overhead	{"commit": "c9e015e", "provisioner": "default"}
2022-03-19T14:07:17.535Z	DEBUG	controller.provisioning	Excluding instance type t3.nano because there are not enough resources for kubelet and system overhead	{"commit": "c9e015e", "provisioner": "default"}
2022-03-19T14:07:17.541Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [m1.small m1.medium m3.medium c1.medium t3.small t3a.small c4.large c3.large c5d.large c6i.large t3a.medium c5ad.large c5.large t3.medium c5a.large c5n.large m1.large m3.large]	{"commit": "c9e015e", "provisioner": "default"}
2022-03-19T14:07:17.689Z	DEBUG	controller.provisioning	Discovered security groups: [sg-0bfdbf9b25fb3c6a6]	{"commit": "c9e015e", "provisioner": "default"}
2022-03-19T14:07:17.695Z	DEBUG	controller.provisioning	Discovered kubernetes version 1.21	{"commit": "c9e015e", "provisioner": "default"}
2022-03-19T14:07:17.744Z	DEBUG	controller.provisioning	Discovered ami-0a620d8210b5d94ac for query /aws/service/eks/optimized-ami/1.21/amazon-linux-2/recommended/image_id	{"commit": "c9e015e", "provisioner": "default"}
2022-03-19T14:07:17.869Z	DEBUG	controller.provisioning	Created launch template, Karpenter-arm-karpenter-demo-13741680538296305221	{"commit": "c9e015e", "provisioner": "default"}
2022-03-19T14:07:20.406Z	INFO	controller.provisioning	Launched instance: i-046fdb82e98d01ca7, hostname: ip-192-168-139-88.ap-southeast-1.compute.internal, type: t3.medium, zone: ap-southeast-1a, capacityType: spot	{"commit": "c9e015e", "provisioner": "default"}
2022-03-19T14:07:20.444Z	INFO	controller.provisioning	Bound 1 pod(s) to node ip-192-168-139-88.ap-southeast-1.compute.internal	{"commit": "c9e015e", "provisioner": "default"}

Note that I had to delete and recreate my provisioner after upgrading from 0.7.1 to 0.7.2 via helm

helm upgrade -n karpenter karpenter karpenter/karpenter --version 0.7.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working burning Time sensitive issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants