Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate node-role.kubernetes.io/master to node-role.kubernetes.io/con… #10464

Merged

Conversation

unai-ttxu
Copy link
Contributor

@unai-ttxu unai-ttxu commented Sep 21, 2023

…trol-plane

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
Since kubeadm 1.25, legacy control-plane taints are deleted:

kubernetes/kubernetes@ddd046f

This PR aligns control-plane taints definitions and ensure that taints are not lost during upgrade.

Which issue(s) this PR fixes:
Fixes #10217

Special notes for your reviewer:

If you hace any doubt feel free to ask 😄

Does this PR introduce a user-facing change?:

Migrate node-role.kubernetes.io/master to node-role.kubernetes.io/control-plane

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 21, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @unai-ttxu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 21, 2023
@yankay
Copy link
Member

yankay commented Sep 22, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 22, 2023
@floryut floryut added the kind/deprecation Categorizes issue or PR as related to a feature/enhancement marked for deprecation. label Sep 22, 2023
Copy link
Member

@floryut floryut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@unai-ttxu Thank you 👍

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 22, 2023
@mzaian
Copy link
Contributor

mzaian commented Sep 25, 2023

That could be a breaking change for apps/deployment that follows the old taint. We need to make sure to add this to the release notes.

@unai-ttxu
Copy link
Contributor Author

unai-ttxu commented Sep 25, 2023

That could be a breaking change for apps/deployment that follows the old taint. We need to make sure to add this to the release notes.

Since we are removing the old taint and keeping the new one it shouldn't break any apps/deployments.

Now, when both taints are set, apps/deployments should have both tolerations configured:

tolerations:
- effect: NoSchedule
  key: node-role.kubernetes.io/master
- effect: NoSchedule
  key: node-role.kubernetes.io/control-plane

So, this change will not break these apps/deployments. Of couse we will need to clean this code and remove the toleration for the node-role.kubernetes.io/master taint, but it's not required for keeping apps/deployments running.

@MrFreezeex
Copy link
Member

MrFreezeex commented Oct 11, 2023

Hi @unai-ttxu, thanks for the cleanup, much appreciated! Could you also update this in the rest of kubespray as well? Like those places:

❯ grep -rin node-role | grep -i master
kubernetes-apps/csi_driver/vsphere/templates/vsphere-csi-controller-deployment.yml.j2:38:          key: node-role.kubernetes.io/master
kubernetes-apps/csi_driver/azuredisk/templates/azure-csi-azuredisk-controller.yml.j2:23:        - key: "node-role.kubernetes.io/master"
kubernetes-apps/ansible/templates/coredns-deployment.yml.j2:35:        - key: node-role.kubernetes.io/master
kubernetes-apps/ansible/templates/dns-autoscaler.yml.j2:46:          key: node-role.kubernetes.io/master
kubernetes-apps/ansible/templates/dashboard.yml.j2:228:        - key: node-role.kubernetes.io/master
kubernetes-apps/ansible/templates/dashboard.yml.j2:335:        - key: node-role.kubernetes.io/master
kubernetes-apps/metallb/defaults/main.yml:13:    key: node-role.kubernetes.io/master
kubernetes-apps/metrics_server/templates/metrics-server-deployment.yaml.j2:92:        - key: node-role.kubernetes.io/master
kubernetes-apps/external_cloud_controller/openstack/templates/external-openstack-cloud-controller-manager-ds.yml.j2:34:      - key: node-role.kubernetes.io/master
kubernetes-apps/external_cloud_controller/huaweicloud/templates/external-huawei-cloud-controller-manager-ds.yml.j2:38:      - key: node-role.kubernetes.io/master
kubernetes-apps/external_cloud_controller/hcloud/templates/external-hcloud-cloud-controller-manager-ds.yml.j2:30:        - key: "node-role.kubernetes.io/master"
kubernetes-apps/external_cloud_controller/hcloud/templates/external-hcloud-cloud-controller-manager-ds-with-networks.yml.j2:28:        - key: "node-role.kubernetes.io/master"
kubernetes-apps/external_cloud_controller/vsphere/templates/external-vsphere-cloud-controller-manager-ds.yml.j2:34:      - key: node-role.kubernetes.io/master
kubernetes-apps/cloud_controller/oci/templates/oci-cloud-provider.yml.j2:44:      - key: node-role.kubernetes.io/master
kubernetes-apps/policy_controller/calico/templates/calico-kube-controllers.yml.j2:31:        - key: node-role.kubernetes.io/master
kubernetes/control-plane/tasks/kubeadm-setup.yml:246:    - "node-role.kubernetes.io/master:NoSchedule-"
kubernetes/control-plane/templates/kubeadm-controlplane.v1beta3.yaml.j2:25:    key: node-role.kubernetes.io/master
kubernetes/control-plane/templates/kubeadm-config.v1beta3.yaml.j2:22:    key: node-role.kubernetes.io/master
network_plugin/calico/templates/calico-typha.yml.j2:64:        - key: node-role.kubernetes.io/master
network_plugin/calico/templates/calico-apiserver.yml.j2:96:        key: node-role.kubernetes.io/master

@unai-ttxu
Copy link
Contributor Author

unai-ttxu commented Oct 13, 2023

Hi @unai-ttxu, thanks for the cleanup, much appreciated! Could you also update this in the rest of kubespray as well? Like those places:

❯ grep -rin node-role | grep -i master
kubernetes-apps/csi_driver/vsphere/templates/vsphere-csi-controller-deployment.yml.j2:38:          key: node-role.kubernetes.io/master
kubernetes-apps/csi_driver/azuredisk/templates/azure-csi-azuredisk-controller.yml.j2:23:        - key: "node-role.kubernetes.io/master"
kubernetes-apps/ansible/templates/coredns-deployment.yml.j2:35:        - key: node-role.kubernetes.io/master
kubernetes-apps/ansible/templates/dns-autoscaler.yml.j2:46:          key: node-role.kubernetes.io/master
kubernetes-apps/ansible/templates/dashboard.yml.j2:228:        - key: node-role.kubernetes.io/master
kubernetes-apps/ansible/templates/dashboard.yml.j2:335:        - key: node-role.kubernetes.io/master
kubernetes-apps/metallb/defaults/main.yml:13:    key: node-role.kubernetes.io/master
kubernetes-apps/metrics_server/templates/metrics-server-deployment.yaml.j2:92:        - key: node-role.kubernetes.io/master
kubernetes-apps/external_cloud_controller/openstack/templates/external-openstack-cloud-controller-manager-ds.yml.j2:34:      - key: node-role.kubernetes.io/master
kubernetes-apps/external_cloud_controller/huaweicloud/templates/external-huawei-cloud-controller-manager-ds.yml.j2:38:      - key: node-role.kubernetes.io/master
kubernetes-apps/external_cloud_controller/hcloud/templates/external-hcloud-cloud-controller-manager-ds.yml.j2:30:        - key: "node-role.kubernetes.io/master"
kubernetes-apps/external_cloud_controller/hcloud/templates/external-hcloud-cloud-controller-manager-ds-with-networks.yml.j2:28:        - key: "node-role.kubernetes.io/master"
kubernetes-apps/external_cloud_controller/vsphere/templates/external-vsphere-cloud-controller-manager-ds.yml.j2:34:      - key: node-role.kubernetes.io/master
kubernetes-apps/cloud_controller/oci/templates/oci-cloud-provider.yml.j2:44:      - key: node-role.kubernetes.io/master
kubernetes-apps/policy_controller/calico/templates/calico-kube-controllers.yml.j2:31:        - key: node-role.kubernetes.io/master
kubernetes/control-plane/tasks/kubeadm-setup.yml:246:    - "node-role.kubernetes.io/master:NoSchedule-"
kubernetes/control-plane/templates/kubeadm-controlplane.v1beta3.yaml.j2:25:    key: node-role.kubernetes.io/master
kubernetes/control-plane/templates/kubeadm-config.v1beta3.yaml.j2:22:    key: node-role.kubernetes.io/master
network_plugin/calico/templates/calico-typha.yml.j2:64:        - key: node-role.kubernetes.io/master
network_plugin/calico/templates/calico-apiserver.yml.j2:96:        key: node-role.kubernetes.io/master

Sure I could, but I don't know if we should. If we decide to remove these tolerations, those components won't be able to deploy if node-role.kubernetes.io/master taint is configured.

Since Minimum required version of Kubernetes is v1.26, I think this won't cause any problem, but maybe as @mzaian said, in case of making this changes we should add this to the release notes.

What do you think?

Copy link
Member

@MrFreezeex MrFreezeex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I could, but I don't know if we should. If we decide to remove these tolerations, those components won't be able to deploy if node-role.kubernetes.io/master taint is configured.

I thought this won't be a problem with current code you added to add the taint but actually I have a few comment on this.

Since Minimum required version of Kubernetes is v1.26, I think this won't cause any problem, but maybe as @mzaian said, in case of making this changes we should add this to the release notes.

In any case it would be present but we should probably make it very visible yeah

@@ -11,6 +11,15 @@
--timeout={{ upgrade_post_cilium_wait_timeout }}
delegate_to: "{{ groups['kube_control_plane'][0] }}"

- name: Ensure control-plane taints after upgrade
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has two problems, you taint every node even kube_node which is very wrong. And even for control-plane nodes the user could want to not taint it. I think the correct behavior should be:

  • Adding the new taint to every nodes in kube_control_plane that does have the taint master already
  • removing the taint master from kube_control_plane nodes

And doing it somewhat early would be a plus as well, so that we can upgrade the toleration of the app we install with the new manifest directly. WDYT?

Copy link
Contributor Author

@unai-ttxu unai-ttxu Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task is skipped for every kube_node, notice the not in the second conditional. So this task won't taint every kube_node, it just taint all kube_control_plane except when a node is both kube_control_plane and kube_node. In this case the taint should not be set in order to allow the node to run all types of loads, since it's both a worker and a control-plane in this case.

If a node is just a kube_control_plane the user should not have the choice to not taint it, since it's the default configuration for k8s control-planes. If the user wants to install a control-plane which can run any type of loads, without any toleration configuration, it should add it to both kube_control_plane and kube_node groups. This is already documented in https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ansible.md#inventory.

If you want the server to act both as control-plane and node, the server must be defined on both groups kube_control_plane and kube_node. If you want a standalone and unschedulable control plane, the server must be defined only in the kube_control_plane and not kube_node.

Finally I don't think there is any need of doing it somewhat early in order to allow toleration adaptations. As I have said in #10464 (comment), this taint migration won't break any deployment which defines both tolerations.

After the upgrade, the user will be able to update their deployments to remove the legacy not needed toleration, but this can be done after the upgrade.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task is skipped for every kube_node, notice the not in the second conditional.
Ah yes indeed my bad!

If a node is just a kube_control_plane the user should not have the choice to not taint it, since it's the default configuration for k8s control-planes. If the user wants to install a control-plane which can run any type of loads, without any toleration configuration, it should add it to both kube_control_plane and kube_node groups. This is already documented in https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ansible.md#inventory.

Yes but I am bit concerned by this because some users might have untaint a node outside of kubespray scope. And I am not sure if we should force the taints on users. basically.

Finally I don't think there is any need of doing it somewhat early in order to allow toleration adaptations. As I have said in #10464 (comment), this taint migration won't break any deployment which defines both tolerations.

After the upgrade, the user will be able to update their deployments to remove the legacy not needed toleration, but this can be done after the upgrade.

I know but doing this early would allow us to remove our legacy taint safely within one kubespray version. If we don't do that now, we will have to wait for 1 maybe 2 kubespray release to do that and this might get lost in the process. So I think that having this taint early and removing the legacy taint everywhere else in kubespray would very much help

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but I am bit concerned by this because some users might have untaint a node outside of kubespray scope. And I am not sure if we should force the taints on users. basically.

The thing is that kubeadm removes the legacy taint node-role.kubernetes.io/master during the upgrade to v1.25. So in case of a cluster where control-plane nodes just have this taint and not both (legacy and current taints), we would need to ensure that node-role.kubernetes.io/control-plane is set before the upgrade. Do you think that adding the node-role.kubernetes.io/control-plane taint before the upgrade in case of node-role.kubernetes.io/master taint is set is a better approach?

I think that it could be cool in order to cover some corner cases (like some users who might have untaint a node outside of kubespray scope) altough this corner case should not be covered, as they are not following the Kubespray guidelines to configure a shared control-plane worker node IMHO.

I just decided to set the taint just before uncordoning the node in order to ensure idempotency and code readability. But if you think that covering these corner cases are important enough I can check how to implement it as nice as it's now.

Copy link
Member

@MrFreezeex MrFreezeex Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that kubeadm removes the legacy taint node-role.kubernetes.io/master during the upgrade to v1.25. So in case of a cluster where control-plane nodes just have this taint and not both (legacy and current taints), we would need to ensure that node-role.kubernetes.io/control-plane is set before the upgrade. Do you think that adding the node-role.kubernetes.io/control-plane taint before the upgrade in case of node-role.kubernetes.io/master taint is set is a better approach?

Kubernetes 1.25 is no longer supported in master so we shouldn't need to worry about the master role if it is already removed in that version then and we can get rid of it everywhere. Considering that we don't support 1.25 I am not sure where we should add the control-plane taint though 🤔. I see two possible paths moving forward:

  • Backport the code enforcing the control plane taint to kubespray 2.23.0 + add a note to kubespray 2.24.0 mentioning that users that did a 1.25 upgrade and didn't upgraded their cluster once using the new 2.23 version (2.23.1 probably?) should check their taint and that the control-plane should be re-added manually
  • Systematically enforce control plane node like you are doing in this PR but that's technically a change of behavior vs what we used to do. As before we didn't enforce that on upgrade, so people could remove the taint outside of kubespray and it would work even when reapplying kubespray afaiu

Copy link
Contributor Author

@unai-ttxu unai-ttxu Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the aim of this PR is fixing #10217 I think that we can do the following:

  • master: Just keep in the changes in the kubeadm configuration files to remove the legacy taint in fresh installations. Also we can include the changes suggested in Migrate node-role.kubernetes.io/master to node-role.kubernetes.io/con… #10464 (comment) since every cluster installed using the future 2.24.X won't have the legacy taint and every cluster upgraded using the future 2.24.X will be already with k8s 1.25.

  • release-2.23: Include the option to enforcing the control plane taint in order to avoid missing all control plane taints if upgrading to k8s 1.25. In this case we can tune this PR in orde to avoid this change of behavior about force tainting any node during the upgrade. Also, as you have said, I think is a good idea mentioning the users that have already upgraded their clusters to k8s v1.25 to check their control plane taints, since it can be already missing. Also I think the best idea is to delegate them the operation to re-added manually.

What do you think about this?

Thanks for the feedback 😄

Copy link
Member

@MrFreezeex MrFreezeex Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be perfect thank you🙏.

And thanks for bearing with me I admit that I kind of multitasked with other topics and didn't get every details in my previous answers. We do have a nice solution now though 🤝! :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be perfect thank you🙏.

And thanks for bearing with me I admit that I kind of multitasked with other topics and didn't get every details in my previous answers. We do have a nice solution now though 🤝! :D

I'll try to make this changes next week. I'll ping you as soon as they are ready.

No problem, I appreciate the feedback. I also multitask more than I want to, I know the feeling 😄

@unai-ttxu unai-ttxu force-pushed the fix/taints-control-plane branch from 647fa3e to a1c09a0 Compare October 17, 2023 17:01
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 17, 2023
@unai-ttxu
Copy link
Contributor Author

@MrFreezeex I have updated this PR and created #10532

If you have any doubt let me know 😄

Copy link
Member

@MrFreezeex MrFreezeex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And thanks again 👍, it should get merged when the CI succeeds
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 17, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: floryut, MrFreezeex, unai-ttxu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 228efcb into kubernetes-sigs:master Oct 17, 2023
59 checks passed
guytet pushed a commit to guytet/kubespray that referenced this pull request Oct 30, 2023
kubernetes-sigs#10464)

* Migrate node-role.kubernetes.io/master to node-role.kubernetes.io/control-plane

* Migrate node-role.kubernetes.io/master to node-role.kubernetes.io/control-plane

* Migrate node-role.kubernetes.io/master to node-role.kubernetes.io/control-plane
@yankay yankay mentioned this pull request Dec 15, 2023
pedro-peter pushed a commit to pedro-peter/kubespray that referenced this pull request May 8, 2024
kubernetes-sigs#10464)

* Migrate node-role.kubernetes.io/master to node-role.kubernetes.io/control-plane

* Migrate node-role.kubernetes.io/master to node-role.kubernetes.io/control-plane

* Migrate node-role.kubernetes.io/master to node-role.kubernetes.io/control-plane
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/deprecation Categorizes issue or PR as related to a feature/enhancement marked for deprecation. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Control Plane NoSchedule Taint Missing After Upgrade
6 participants