Upgrade from 1.20 to 1.21 is failing #1568

CecileRobertMichon · 2021-07-29T16:13:14Z

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

I was able to repro locally. New upgraded nodes are not becoming ready.

NAME                                   STATUS     ROLES                  AGE     VERSION
default-template-control-plane-8dhn9   Ready      control-plane,master   17m     v1.20.9
default-template-md-0-hbl5n            Ready      <none>                 15m     v1.20.9
default-template-md-0-qlrqc            NotReady   <none>                 6m49s   v1.21.2
default-template-md-0-vlvnd            Ready      <none>                 14m     v1.20.9

#1557 (comment)

CAPI e2e job has been broken https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-periodic-capi-e2e-main for the past week because it is trying to upgrade from 1.19 to 1.21 which is not allowed.

What did you expect to happen:

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

cluster-api-provider-azure version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

CecileRobertMichon · 2021-07-29T22:34:40Z

Initial investigation details:

Nodes are staying NotReady because the CNI pods are failing to initialize. Kubelet logs show:

  Warning  FailedCreatePodSandBox  99s (x39 over 9m50s)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim: OCI runtime create failed: expected cgroupsPath to be of format "slice:prefix:name" for systemd cgroups, got "/kubepods/burstable/podc9e0e61f-7279-47f4-a185-1f201e49c62c/0647a728fffda52c0adf7e9512745f8f8afea16aa15351b08caa0114ebe154d0" instead: unknown

This is due to kubeadm defaulting to systemd cgroups starting in 1.21. The 1.21 image used by the new node was built by image builder and has containderd setup for systemd cgroups (kubernetes-sigs/image-builder#471) but because KCP still has version 1.20, the kubelet config is old and doesn't have https://github.com/kubernetes-sigs/cluster-api/pull/4236/files.

Confirmed this only happens if worker nodes are upgraded before the control plane, which is not recommended.

CecileRobertMichon · 2021-07-29T22:58:27Z

Also confirmed I cannot repro this for 1.19 -> 1.20, only 1.20 -> 1.21

k --kubeconfig kubeconfig get nodes
NAME                                  STATUS                        ROLES    AGE    VERSION
capi-quickstart-control-plane-fft28   Ready                         master   13m    v1.19.13
capi-quickstart-md-0-8tn79            Ready                         <none>   12m    v1.19.13
capi-quickstart-md-0-9f67z            Ready                         <none>   12m    v1.19.13
capi-quickstart-md-0-gxc7q            Ready                         <none>   7m4s   v1.20.9

CecileRobertMichon · 2021-07-29T23:31:20Z

kubernetes-sigs/cluster-api#4896 tracks changing the tests to not upgrade worker nodes before control plane

CecileRobertMichon · 2021-07-30T17:53:05Z

/close

this only affects 1.20 -> 1.21 upgrades where the worker nodes get upgraded before the control plane, which is against k8s upgrade recommendations (https://kubernetes.io/docs/tasks/administer-cluster/cluster-upgrade/)

k8s-ci-robot · 2021-07-30T17:53:09Z

@CecileRobertMichon: Closing this issue.

In response to this:

/close

this only affects 1.20 -> 1.21 upgrades where the worker nodes get upgraded before the control plane, which is against k8s upgrade recommendations (https://kubernetes.io/docs/tasks/administer-cluster/cluster-upgrade/)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 29, 2021

CecileRobertMichon added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jul 29, 2021

CecileRobertMichon changed the title ~~Upgrade from 1.19 to 1.20 and 1.20 to 1.21 is failing~~ Upgrade from 1.20 to 1.21 is failing Jul 29, 2021

CecileRobertMichon mentioned this issue Jul 29, 2021

AzureMachinePool upgrade does not wait for new node to be Ready before proceeding #1570

Closed

k8s-ci-robot closed this as completed Jul 30, 2021

CecileRobertMichon mentioned this issue Jul 30, 2021

Upgrade job should upgrade from 1.20 #1557

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade from 1.20 to 1.21 is failing #1568

Upgrade from 1.20 to 1.21 is failing #1568

CecileRobertMichon commented Jul 29, 2021 •

edited

Loading

CecileRobertMichon commented Jul 29, 2021

CecileRobertMichon commented Jul 29, 2021

CecileRobertMichon commented Jul 29, 2021

CecileRobertMichon commented Jul 30, 2021

k8s-ci-robot commented Jul 30, 2021

Upgrade from 1.20 to 1.21 is failing #1568

Upgrade from 1.20 to 1.21 is failing #1568

Comments

CecileRobertMichon commented Jul 29, 2021 • edited Loading

CecileRobertMichon commented Jul 29, 2021

CecileRobertMichon commented Jul 29, 2021

CecileRobertMichon commented Jul 29, 2021

CecileRobertMichon commented Jul 30, 2021

k8s-ci-robot commented Jul 30, 2021

CecileRobertMichon commented Jul 29, 2021 •

edited

Loading