Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding CriticalAddonsOnly taint doesn't allow cluster to start #487

Closed
ChristopherEdwards opened this issue Oct 14, 2020 · 9 comments
Closed
Assignees

Comments

@ChristopherEdwards
Copy link

ChristopherEdwards commented Oct 14, 2020

When installing a new cluster on a clean Amazon Linux 2 instance and setting:

node-taint:
  - "CriticalAddonsOnly=true:NoExecute"

in the config.yaml file results in a cluster node that will never become ready.

I'm installing via the tarball installer and can provide details if required.

config.yaml:

token: ${rancher-token}
node-taint:
  - "CriticalAddonsOnly=true:NoExecute"
tls-san:
  - "${control-plane-dns}"
@ChristopherEdwards
Copy link
Author

get pods output:

NAMESPACE     NAME                                                  READY   STATUS    RESTARTS   AGE
kube-system   etcd-ip-xxx-xxx-xxx-xxx.ec2.internal                      1/1     Running   0          4m31s
kube-system   helm-install-rke2-canal-z4mcr                         0/1     Pending   0          5m18s
kube-system   helm-install-rke2-coredns-qhldn                       0/1     Pending   0          5m18s
kube-system   helm-install-rke2-ingress-nginx-2w45l                 0/1     Pending   0          5m18s
kube-system   helm-install-rke2-kube-proxy-767j4                    0/1     Pending   0          5m18s
kube-system   helm-install-rke2-metrics-server-htmp4                0/1     Pending   0          5m18s
kube-system   kube-apiserver-ip-xxx-xxx-xxx-xxx.ec2.internal            1/1     Running   0          4m6s
kube-system   kube-controller-manager-ip-xxx-xxx-xxx-xxx.ec2.internal   1/1     Running   0          4m10s
kube-system   kube-scheduler-ip-xxx-xxx-xxx-xxx.ec2.internal            1/1     Running   0          4m29s

All pending helm-install-rke2 pods are failing with:

  Warning  FailedScheduling  8m24s  default-scheduler  0/1 nodes are available: 1 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate.
  Warning  FailedScheduling  8m24s  default-scheduler  0/1 nodes are available: 1 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate.
  Warning  FailedScheduling  6m46s  default-scheduler  0/2 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate.

get nodes output:

NAME                          STATUS     ROLES         AGE     VERSION
ip-xxx-xxx-xxx-xxx.ec2.internal   NotReady   <none>        7m36s   v1.18.9+rke2r1
ip-xxx-xxx-xxx-xxx.ec2.internal   NotReady   etcd,master   9m30s   v1.18.9+rke2r1

@joshrwolf
Copy link
Contributor

can confirm this issue, it appears that while the system charts are properly tainted, the helm install jobs that deploy them are not, resulting in what @ChristopherEdwards is showing above

@innobead
Copy link

innobead commented Dec 9, 2020

@c3y1huang please help with this issue which is related to the pod created by the job does not have the correct tolerations to make the helm installation pending.

Please try to review the below code (helm-controller) to see how to make CriticalAddonsOnly workable in this case.

https://github.com/k3s-io/k3s/blob/15d03c5930e37cb7aad00c65486902fb66dc744a/vendor/github.com/k3s-io/helm-controller/pkg/helm/controller.go#L209-L209

cc @jenting @cclhsu

@davidnuzik
Copy link
Contributor

davidnuzik commented Dec 16, 2020

Assigning to you @brandond as per our discussion in last sprint (yesterday, 12/15). You will take point on this issue. Review the PR that @c3y1huang proposed (thank you for the PR) and ensure it looks good / provide feedback. We should work towards getting this in to a January release if possible. I've set the 1.19.6 milestone for mid-January.

@brandond
Copy link
Member

brandond commented Jan 8, 2021

This should be fixed in the above-linked PR:

[root@centos01 ~]# kubectl describe pod -n kube-system   helm-install-rke2-canal-vqgfh
Name:         helm-install-rke2-canal-vqgfh
Namespace:    kube-system
Priority:     0
Node:         centos01.lan.khaus/10.0.1.137
Start Time:   Fri, 08 Jan 2021 13:41:20 -0800
Labels:       controller-uid=191e6541-893b-4ce2-a786-7f954a1e71e5
              helmcharts.helm.cattle.io/chart=rke2-canal
              job-name=helm-install-rke2-canal
Annotations:  helmcharts.helm.cattle.io/configHash: SHA256=D54E55CAEE0F4088268827DA9824C7A101199892C9F0D3CE6E991DA685802A46
              kubernetes.io/psp: global-unrestricted-psp
Status:       Succeeded
...
Tolerations:     CriticalAddonsOnly op=Exists
                 node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
                 node.kubernetes.io/not-ready:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

@ShylajaDevadiga
Copy link
Contributor

ShylajaDevadiga commented Jan 12, 2021

Using master build rke2 version v1.19.5-dev+5c837fe7
On a single node cluster helm-install-rke2-ingress-nginx and helm-install-rke2-metrics failed as expected as they are not tolerated.
rke2-coredns fail is fixed in rancher/rke2-charts#40

$ kubectl describe pod -n kube-system helm-install-rke2-canal-nbb8p  |grep -i critical
Tolerations:     CriticalAddonsOnly op=Exists
$ kubectl describe pod -n kube-system helm-install-rke2-coredns-cgttf   |grep -i critical
Tolerations:     CriticalAddonsOnly op=Exists

 $ kubectl describe pod -n kube-system helm-install-rke2-ingress-nginx-m2qdl |grep -i critical
  Warning  FailedScheduling  2m7s  default-scheduler  0/1 nodes are available: 1 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate.
  Warning  FailedScheduling  2m7s  default-scheduler  0/1 nodes are available: 1 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate.
  
 $ kubectl describe pod -n kube-system helm-install-rke2-metrics-server-96tzr |grep -i critical
  Warning  FailedScheduling  23m   default-scheduler  0/1 nodes are available: 1 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate.
  Warning  FailedScheduling  23m   default-scheduler  0/1 nodes are available: 1 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate.
  
  $ kubectl describe pod -n kube-system rke2-coredns-rke2-coredns-bbf9475cb-n6hnw |grep -i critical
Priority Class Name:  system-cluster-critical
                      scheduler.alpha.kubernetes.io/critical-pod: 
                      scheduler.alpha.kubernetes.io/tolerations: [{"key":"CriticalAddonsOnly", "operator":"Exists"}]
  Warning  FailedScheduling  19m   default-scheduler  0/1 nodes are available: 1 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate.
  Warning  FailedScheduling  19m   default-scheduler  0/1 nodes are available: 1 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate.

On a two node cluster
rke2 version v1.19.5-dev+5c837fe7

Created a two node cluster with one Control Plane and One worker node

$ cat /etc/rancher/rke2/config.yaml
node-taint:
  - "CriticalAddonsOnly=true:NoExecute"

$ kubectl get nodes
NAME               STATUS   ROLES         AGE     VERSION
ip-172-31-26-237   Ready    <none>        5m31s   v1.19.5-dev+5c837fe7
ip-172-31-31-92    Ready    etcd,master   96m     v1.19.5-dev+5c837fe7

$ kubectl get pods -A
NAMESPACE     NAME                                                 READY   STATUS      RESTARTS   AGE
kube-system   etcd-ip-172-31-31-92                                 1/1     Running     0          95m
kube-system   helm-install-rke2-canal-msrnz                        0/1     Completed   0          96m
kube-system   helm-install-rke2-coredns-l76v2                      0/1     Completed   0          96m
kube-system   helm-install-rke2-ingress-nginx-79wj2                0/1     Completed   0          96m
kube-system   helm-install-rke2-kube-proxy-42jwn                   0/1     Completed   0          96m
kube-system   helm-install-rke2-metrics-server-xjbls               0/1     Completed   0          96m
kube-system   kube-apiserver-ip-172-31-31-92                       1/1     Running     0          95m
kube-system   kube-controller-manager-ip-172-31-31-92              1/1     Running     0          96m
kube-system   kube-proxy-4t94k                                     1/1     Running     0          96m
kube-system   kube-proxy-t69bp                                     1/1     Running     0          5m39s
kube-system   kube-scheduler-ip-172-31-31-92                       1/1     Running     0          96m
kube-system   rke2-canal-4jb2p                                     2/2     Running     0          96m
kube-system   rke2-canal-ls59f                                     2/2     Running     0          5m39s
kube-system   rke2-coredns-rke2-coredns-bbf9475cb-ht7r9            1/1     Running     0          96m
kube-system   rke2-ingress-nginx-controller-54946dd48f-r5jsg       1/1     Running     0          4m45s
kube-system   rke2-ingress-nginx-default-backend-5795954f8-rkl8h   1/1     Running     0          4m45s
kube-system   rke2-metrics-server-5f9b5757dc-zr67r                 1/1     Running     0          4m45s

$ kubectl describe pod -n kube-system   helm-install-rke2-coredns-l76v2 |tail -6
Tolerations:     CriticalAddonsOnly op=Exists
                 node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
                 node.kubernetes.io/not-ready:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

@ShylajaDevadiga
Copy link
Contributor

Validated on rke2 version v1.19.7-rc1+rke2r1, rke2-coredns is in Running state on a single node cluster

ubuntu@ip-172-31-6-18:~$ kubectl get nodes
NAME             STATUS   ROLES         AGE     VERSION
ip-172-31-6-18   Ready    etcd,master   8m30s   v1.19.7-rc1+rke2r1
ubuntu@ip-172-31-6-18:~$ kubectl get pods -A
NAMESPACE     NAME                                         READY   STATUS      RESTARTS   AGE
kube-system   etcd-ip-172-31-6-18                          1/1     Running     0          7m36s
kube-system   helm-install-rke2-canal-5p4jc                0/1     Completed   0          8m24s
kube-system   helm-install-rke2-coredns-d6z4w              0/1     Completed   0          8m24s
kube-system   helm-install-rke2-ingress-nginx-s4n49        0/1     Pending     0          8m24s
kube-system   helm-install-rke2-kube-proxy-bvtds           0/1     Completed   0          8m24s
kube-system   helm-install-rke2-metrics-server-x9p4c       0/1     Pending     0          8m24s
kube-system   kube-apiserver-ip-172-31-6-18                1/1     Running     0          7m40s
kube-system   kube-controller-manager-ip-172-31-6-18       1/1     Running     0          7m36s
kube-system   kube-proxy-t9stq                             1/1     Running     0          8m10s
kube-system   kube-scheduler-ip-172-31-6-18                1/1     Running     0          7m28s
kube-system   rke2-canal-2mm8j                             2/2     Running     0          8m10s
kube-system   rke2-coredns-rke2-coredns-6cd96645d6-h8fvf   1/1     Running     0          8m8s
ubuntu@ip-172-31-6-18:~$ 

@gawsoftpl
Copy link

This issue still occurs during create cluster: 3x control_plane + etcd, If you want to create cluster you have to taint k3s-io/k3s#6383 nodes after creation

@brandond
Copy link
Member

brandond commented Jun 15, 2024

@gawsoftpl you're on the wrong repo. This issue is regarding rke2. If you have a problem with k3s, please open an issue over there.

@rancher rancher locked as resolved and limited conversation to collaborators Jun 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants