Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control Plane NoSchedule Taint Missing After Upgrade #10217

Closed
tman5 opened this issue Jun 14, 2023 · 7 comments · Fixed by #10464
Closed

Control Plane NoSchedule Taint Missing After Upgrade #10217

tman5 opened this issue Jun 14, 2023 · 7 comments · Fixed by #10464
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@tman5
Copy link

tman5 commented Jun 14, 2023

Environment:

  • Cloud provider or hardware configuration: on-prem

  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"): Rocky 8.7

  • Version of Ansible (ansible --version): 2.12

  • Version of Python (python --version): 3.11.3

Kubespray version (commit) (git rev-parse --short HEAD):
0955df2ec

Network plugin used:
calico

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):

[kube_control_plane:children]
kube-control

[etcd:children]
kube-control

[kube_node:children]
kube-worker

[calico_rr]

[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr

Anything else do we need to know: After upgrading from 1.24.7 -> 1.25.10 the taint is not applied to the control_plane nodes even though in the kubeadm config file it does appear there, it is not configured on the nodes themselves. This taint is missing:
node-role.kubernetes.io/control-plane:NoSchedule

$ cat /etc/kubeadm-config.yaml
/etc/kubernetes/kubeadm-config.yaml 
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: <address>
  bindPort: 6443
nodeRegistration:
  name: kube-control-01
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
  - effect: NoSchedule
    key: node-role.kubernetes.io/control-plane
  criSocket: unix:///var/run/containerd/containerd.sock

I should also note running the cluster.yaml playbook after does not fix this issue. Nor does running the upgrade.yaml playbook either. The only "fix" is to manually apply the taint afterward.

See this issue #9578

@tman5 tman5 added the kind/bug Categorizes issue or PR as related to a bug. label Jun 14, 2023
@leonbartlett
Copy link

leonbartlett commented Jun 14, 2023

I have the same issue upgrading from 1.24.6 -> 1.25.6

@yankay
Copy link
Member

yankay commented Jun 20, 2023

Thanks @tman5 for the bug report.

Would you please help to provide a PR to fix it. :-)

@tman5
Copy link
Author

tman5 commented Jun 30, 2023

More info - on new clusters on 1.25 the taints are applied appropriately. I don't know yet if this issue also exists going from 1.25 -> 1.26 or if it only exists on clusters 1.24 -> 1.25

@tman5
Copy link
Author

tman5 commented Mar 19, 2024

FYI I just did an upgrade from kubespray 2.21 -> 2.22 -> 2.23 - 2.24 upgrading the k8s version along the way and the control plane taint did not re-apply

@rptaylor
Copy link
Contributor

I don't really understand how exactly #10464 can fix this, but I patched it in to Kubespray 2.21 and then the issue did not happen when upgrading anymore.

@unai-ttxu
Copy link
Contributor

unai-ttxu commented Jul 18, 2024

I don't really understand how exactly #10464 can fix this, but I patched it in to Kubespray 2.21 and then the issue did not happen when upgrading anymore.

Hi @rptaylor!

Did you patch #10464 or #10532 into Kubespray 2.21?

The issue related to the missing taint after the upgrade relies on this commit that was introduced in Kubernetes v1.25, which is the default Kubernetes version of Kubespray 2.21.

Due to this commit, kubeadm removes the legacy taint node-role.kubernetes.io/master during the upgrade to v1.25. So in case of a cluster where control-plane nodes just have this taint and not both (node-role.kubernetes.io/master and node-role.kubernetes.io/control-plane), we would need to ensure that node-role.kubernetes.io/control-plane is set before the upgrade. #10532 fixes this, adding the new taint just before de upgrade in order to keep the control-plane nodes with the new taint.

This fix was only backported to Kubespray 2.23 but checking it wright now it makes sense to backport it in 2.22 and 2.21 too since this versions can be used to upgrade Kubernetes to 1.25.

@rptaylor
Copy link
Contributor

@unai-ttxu thanks for the extra details!

It seems crazy that k8s 1.25 would automatically remove the old taint without also automatically adding the new taint... !

In my environment when I hit this bug upgrading to kubespray 2.21, the /etc/kubernetes/kubeadm-config.yaml file for the master nodes looked correct:

nodeRegistration:
  name: cluster-dev-k8s-master-1
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
  - effect: NoSchedule
    key: node-role.kubernetes.io/control-plane

This should have caused 'control-plane' to be added whether or not 'master' was ignored. Very odd.

But I patched https://github.com/kubernetes-sigs/kubespray/pull/10464/files#diff-2510b9cc3e44d8d6e2cc83bd5b60ba888f278a70f1a87ba4df53a2d6f881fcae into my branch which removes "master" from the kubeadm config and that fixed it.
Looks like #10532 should be a nicer easier way to fix it, thanks for that!
I agree, backporting important fixes is very helpful so people don't need to rediscover known issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants