Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter Node NotReady when provided with extra kubelet args #5043

Closed
hitsub2 opened this issue Nov 7, 2023 · 6 comments
Closed

Karpenter Node NotReady when provided with extra kubelet args #5043

hitsub2 opened this issue Nov 7, 2023 · 6 comments
Labels
question Issues that are support related questions

Comments

@hitsub2
Copy link

hitsub2 commented Nov 7, 2023

Description

Observed Behavior:
When provided the following kubelet args, some nodes(2 out of 400) are not ready and karpenter can not disrupt them, leaving them forever.

Extra kubelet config:

--cpu-manager-policy=static --enforce-node-allocatable=pods,kube-reserved,system-reserved --system-reserved-cgroup=/system.slice --kube-reserved-cgroup=/system.slice

ec2 nodelcass.yaml

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: dc-spark-ue1-prod-memory
spec:
  amiFamily: AL2
  amiSelectorTerms:
  - id: ami-0c97930d0d19e564a
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      deleteOnTermination: true
      encrypted: false
      iops: 3000
      throughput: 125
      volumeSize: 200Gi
      volumeType: gp3
  detailedMonitoring: false
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required
  role: KarpenterNodeRole-dongdgy-karpenter-demo
  securityGroupSelectorTerms:
  - id: sg-03b6a8b2900572e14
  subnetSelectorTerms:
  - id: subnet-041c9f82b633f50ca
  tags:
    Name: dc-spark-ue1-prod-memory
    billing_entry: data_engineering
    billing_group: bigdata
    billing_service: spark
    workload: general
  userData: "#!/bin/bash 
 set -o xtrace 
 mkdir -p /sys/fs/cgroup/cpuset/system.slice && mkdir -p /sys/fs/cgroup/hugetlb/system.slice 
 /etc/eks/bootstrap.sh dc-spark-ue1-prod --kubelet-extra-args '--node-labels=billing_service=spark,lifecycle=Ec2Spot,billing_group=bigdata,billing_entry=data_engineering,workload=general,node_group_name=dc-spark-ue1-prod-memory,NAME=dc-spark-ue1-prod,env=prod --cpu-manager-policy=static --enforce-node-allocatable=pods,kube-reserved,system-reserved --system-reserved-cgroup=/system.slice --kube-reserved-cgroup=/system.slice --kube-reserved=cpu=500m,memory=1Gi,ephemeral-storage=2Gi --system-reserved=cpu=500m,memory=1Gi,ephemeral-storage=2Gi'"

kublet error log


Nov 02 07:55:34 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:34.137963    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?resourceVersion=0&timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:34 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:34.142093    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:34 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:34.146351    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:34 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:34.150486    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:34 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:34.155528    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:34 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:34.155545    4330 kubelet_node_status.go:526] "Unable to update node status" err="update node status exceeds retry count"
Nov 02 07:55:40 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:40.396985    4330 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://123456.gr7.us-east-1.eks.amazonaws.com/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-172-17-182-225.ec2.internal?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority
Nov 02 07:55:40 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:40.406468    4330 event.go:276] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"ip-172-17-182-225.ec2.internal.1793bf338cd27e84", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-172-17-182-225.ec2.internal", UID:"ip-172-17-182-225.ec2.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kubelet.", Source:v1.EventSource{Component:"kubelet", Host:"ip-172-17-182-225.ec2.internal"}, FirstTimestamp:time.Date(2023, time.November, 2, 7, 55, 12, 575651460, time.Local), LastTimestamp:time.Date(2023, time.November, 2, 7, 55, 12, 575651460, time.Local), Count:1, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/namespaces/default/events": tls: failed to verify certificate: x509: certificate signed by unknown authority'(may retry after sleeping)
Nov 02 07:55:41 ip-172-17-182-225.ec2.internal kubelet[4330]: I1102 07:55:41.523852    4330 kubelet_resources.go:45] "Allocatable" allocatable=map[attachable-volumes-aws-ebs:{i:{value:25 scale:0} d:{Dec:<nil>} s:25 Format:DecimalSI} cpu:{i:{value:7 scale:0} d:{Dec:<nil>} s:7 Format:DecimalSI} ephemeral-storage:{i:{value:188967217652 scale:0} d:{Dec:<nil>} s:188967217652 Format:DecimalSI} hugepages-1Gi:{i:{value:0 scale:0} d:{Dec:<nil>} s:0 Format:DecimalSI} hugepages-2Mi:{i:{value:0 scale:0} d:{Dec:<nil>} s:0 Format:DecimalSI} memory:{i:{value:64331677696 scale:0} d:{Dec:<nil>} s: Format:BinarySI} pods:{i:{value:58 scale:0} d:{Dec:<nil>} s:58 Format:DecimalSI}]
Nov 02 07:55:41 ip-172-17-182-225.ec2.internal kubelet[4330]: I1102 07:55:41.860071    4330 kubelet.go:2156] "SyncLoop (PLEG): event for pod" pod="kube-admin/collector-wnt57" event=&{ID:d2978f3b-836b-451a-ad25-baf6dcd98f70 Type:ContainerStarted Data:6bc832bd393af94de8f6a7455e60f333a390c286d5b1f474f16e559422a47c85}
Nov 02 07:55:41 ip-172-17-182-225.ec2.internal kubelet[4330]: I1102 07:55:41.860276    4330 kubelet_pods.go:897] "Unable to retrieve pull secret, the image pull may not succeed." pod="kube-admin/collector-wnt57" secret="" err="secret \"default-secret\" not found"
Nov 02 07:55:42 ip-172-17-182-225.ec2.internal kubelet[4330]: I1102 07:55:42.861650    4330 kubelet_pods.go:897] "Unable to retrieve pull secret, the image pull may not succeed." pod="kube-admin/collector-wnt57" secret="" err="secret \"default-secret\" not found"
Nov 02 07:55:42 ip-172-17-182-225.ec2.internal kubelet[4330]: I1102 07:55:42.978675    4330 state_mem.go:80] "Updated desired CPUSet" podUID="d2978f3b-836b-451a-ad25-baf6dcd98f70" containerName="collector" cpuSet="0-7"
Nov 02 07:55:44 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:44.322831    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?resourceVersion=0&timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:44 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:44.327170    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:44 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:44.330937    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:44 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:44.334953    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:44 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:44.338930    4330 kubelet_node_status.go:539] "Error updating node status, will retry" err="error getting node \"ip-172-17-182-225.ec2.internal\": Get \"https://123456.gr7.us-east-1.eks.amazonaws.com/api/v1/nodes/ip-172-17-182-225.ec2.internal?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Nov 02 07:55:44 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:44.338945    4330 kubelet_node_status.go:526] "Unable to update node status" err="update node status exceeds retry count"
Nov 02 07:55:47 ip-172-17-182-225.ec2.internal kubelet[4330]: E1102 07:55:47.403782    4330 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://123456.gr7.us-east-1.eks.amazonaws.com/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-172-17-182-225.ec2.internal?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority
Nov 02 11:09:58 ip-172-17-182-225.ec2.internal kubelet[4330]: I1102 11:09:58.521341    4330 log.go:194] http: TLS handshake error from 172.29.73.185:36194: no serving certificate available for the kubelet


Expected Behavior:
All the nodes should be ready, if notready nodes come up, karpenter should recycle them or disrup them.
Reproduction Steps (Please include YAML):

Versions:

  • Karpenter Version: v0.32.1
  • Kubernetes Version (kubectl version): 1.25
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@hitsub2 hitsub2 added the bug Something isn't working label Nov 7, 2023
@hitsub2
Copy link
Author

hitsub2 commented Nov 7, 2023

After changing amiFamily from AL2 to Custom, it seems that there is no any noready nodes. So my question: what is the behavior when providing kubelet config via user data? Does the user data will be executed twice which caused this bug?

@engedaam
Copy link
Contributor

engedaam commented Nov 7, 2023

This seems like a duplicate of Node repair. Since most of the nodes (398/400) became ready, it seems like a transit error was the problem in this case.

@engedaam engedaam added question Issues that are support related questions and removed bug Something isn't working labels Nov 7, 2023
@hitsub2
Copy link
Author

hitsub2 commented Nov 8, 2023

It is the responsibility for karpenter to do the node repair, but I just wondering why this happens, does it due to the two time running of user-data?

@engedaam
Copy link
Contributor

I suspect its not due to userData, as most of the nodes are ready

@engedaam
Copy link
Contributor

Closing as a duplicate of kubernetes-sigs/karpenter#750

@sidewinder12s
Copy link
Contributor

@hitsub2 Just wondering, where you following a guide or something else for working with these flags?

--enforce-node-allocatable=pods,kube-reserved,system-reserved --system-reserved-cgroup=/system.slice --kube-reserved-cgroup=/system.slice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Issues that are support related questions
Projects
None yet
Development

No branches or pull requests

3 participants