Calico + etcd mayhem when upgrading to v2.23.0 #10436

olevitt · 2023-09-14T10:04:00Z

Hello,

When upgrading to v2.23.0 in a calico in etcd mode, every calico-node pod will have it's configuration set to the same node name (first controlplane) resulting in IP allocation mayhem with every new pod getting an IP from the first controlplane IP block resulting in broken network for any new pod (existing pods are fine).
This seems to be due to #10177 which makes the install-cni init-container of calico-node pulls configuration from a single configmap that has first controlplane name set in stone (4f85b75#diff-91635da451087a93ab261ec90f794c825a5d584d12562fc94d183c50f63d81c3R43) instead of having it parametrized by node name (which is the case with kdd mode : 4f85b75#diff-91635da451087a93ab261ec90f794c825a5d584d12562fc94d183c50f63d81c3R38 and was the case in etcd mode before this PR when config was pulled from a config file on each host).

One workaround is to change calico-config (namespace kube-system) configmap replacing nodename with "nodename": "__KUBERNETES_NODE_NAME__" and then adding

        - name: KUBERNETES_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName

to env of install-cni init container of calico-node daemonset (just like it's done for kdd mode, see here :

kubespray/roles/network_plugin/calico/templates/calico-node.yml.j2

Lines 98 to 104 in 0f243d7

    
           {% if calico_datastore == "kdd" %} 
        
                       # Set the hostname based on the k8s node name. 
        
                       - name: KUBERNETES_NODE_NAME 
        
                         valueFrom: 
        
                           fieldRef: 
        
                             fieldPath: spec.nodeName 
        
           {% endif %}

).
calico-node pods will then restart, init container install-cni will write the right nodename to the config file on each node and voila. New pods will be fine but any pod created during the bug will have to be deleted so that it gets a new - correct - IP.

We will submit a PR to fix this but if you encounter this issue please try this workaround, it worked for us 🥳

The text was updated successfully, but these errors were encountered:

mzaian · 2023-09-14T14:58:46Z

/assign @olevitt

jonathansloman · 2024-01-02T09:58:22Z

Hi - will this fix also be added to a 2.23.2 release? The issue is preventing us from upgrading our cluster, and kubespray documentation specifies not to skip releases when upgrading (ie, we shouldn't go from 2.22 directly to 2.24), so we need a working 2.23 to give us an upgrade path.

Thank you.

olevitt added the kind/bug Categorizes issue or PR as related to a bug. label Sep 14, 2023

olevitt mentioned this issue Sep 14, 2023

Fix calico-node in etcd mode #10438

Merged

k8s-ci-robot assigned olevitt Sep 14, 2023

This was referenced Dec 14, 2023

Release Proposal v2.24 #10720

Closed

Calico config breaks if you use etcd #10721

Closed

Add test case for calico using etcd datastore #10722

Merged

k8s-ci-robot closed this as completed in #10438 Dec 19, 2023

VannTen mentioned this issue Dec 19, 2023

not able to communicate to pods from node-1 to the pods on node-2 #9601

Open

This was referenced Jan 11, 2024

Fix etcd client generation #10769

Merged

Release Proposal v2.23.2 #10792

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calico + etcd mayhem when upgrading to v2.23.0 #10436

Calico + etcd mayhem when upgrading to v2.23.0 #10436

olevitt commented Sep 14, 2023 •

edited

Loading

mzaian commented Sep 14, 2023

jonathansloman commented Jan 2, 2024

Calico + etcd mayhem when upgrading to v2.23.0 #10436

Calico + etcd mayhem when upgrading to v2.23.0 #10436

Comments

olevitt commented Sep 14, 2023 • edited Loading

mzaian commented Sep 14, 2023

jonathansloman commented Jan 2, 2024

olevitt commented Sep 14, 2023 •

edited

Loading