Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade from 1.20 to 1.21 is failing #1568

Closed
CecileRobertMichon opened this issue Jul 29, 2021 · 5 comments
Closed

Upgrade from 1.20 to 1.21 is failing #1568

CecileRobertMichon opened this issue Jul 29, 2021 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@CecileRobertMichon
Copy link
Contributor

CecileRobertMichon commented Jul 29, 2021

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

I was able to repro locally. New upgraded nodes are not becoming ready.

NAME                                   STATUS     ROLES                  AGE     VERSION
default-template-control-plane-8dhn9   Ready      control-plane,master   17m     v1.20.9
default-template-md-0-hbl5n            Ready      <none>                 15m     v1.20.9
default-template-md-0-qlrqc            NotReady   <none>                 6m49s   v1.21.2
default-template-md-0-vlvnd            Ready      <none>                 14m     v1.20.9

#1557 (comment)

CAPI e2e job has been broken https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-periodic-capi-e2e-main for the past week because it is trying to upgrade from 1.19 to 1.21 which is not allowed.

What did you expect to happen:

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • cluster-api-provider-azure version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 29, 2021
@CecileRobertMichon CecileRobertMichon added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jul 29, 2021
@CecileRobertMichon CecileRobertMichon changed the title Upgrade from 1.19 to 1.20 and 1.20 to 1.21 is failing Upgrade from 1.20 to 1.21 is failing Jul 29, 2021
@CecileRobertMichon
Copy link
Contributor Author

Initial investigation details:

Nodes are staying NotReady because the CNI pods are failing to initialize. Kubelet logs show:

  Warning  FailedCreatePodSandBox  99s (x39 over 9m50s)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim: OCI runtime create failed: expected cgroupsPath to be of format "slice:prefix:name" for systemd cgroups, got "/kubepods/burstable/podc9e0e61f-7279-47f4-a185-1f201e49c62c/0647a728fffda52c0adf7e9512745f8f8afea16aa15351b08caa0114ebe154d0" instead: unknown

This is due to kubeadm defaulting to systemd cgroups starting in 1.21. The 1.21 image used by the new node was built by image builder and has containderd setup for systemd cgroups (kubernetes-sigs/image-builder#471) but because KCP still has version 1.20, the kubelet config is old and doesn't have https://github.com/kubernetes-sigs/cluster-api/pull/4236/files.

Confirmed this only happens if worker nodes are upgraded before the control plane, which is not recommended.

@CecileRobertMichon
Copy link
Contributor Author

Also confirmed I cannot repro this for 1.19 -> 1.20, only 1.20 -> 1.21

k --kubeconfig kubeconfig get nodes
NAME                                  STATUS                        ROLES    AGE    VERSION
capi-quickstart-control-plane-fft28   Ready                         master   13m    v1.19.13
capi-quickstart-md-0-8tn79            Ready                         <none>   12m    v1.19.13
capi-quickstart-md-0-9f67z            Ready                         <none>   12m    v1.19.13
capi-quickstart-md-0-gxc7q            Ready                         <none>   7m4s   v1.20.9

@CecileRobertMichon
Copy link
Contributor Author

kubernetes-sigs/cluster-api#4896 tracks changing the tests to not upgrade worker nodes before control plane

@CecileRobertMichon
Copy link
Contributor Author

/close

this only affects 1.20 -> 1.21 upgrades where the worker nodes get upgraded before the control plane, which is against k8s upgrade recommendations (https://kubernetes.io/docs/tasks/administer-cluster/cluster-upgrade/)

@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: Closing this issue.

In response to this:

/close

this only affects 1.20 -> 1.21 upgrades where the worker nodes get upgraded before the control plane, which is against k8s upgrade recommendations (https://kubernetes.io/docs/tasks/administer-cluster/cluster-upgrade/)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

2 participants