Calico does not start after scale #7495

oguera · 2021-04-12T20:48:51Z

Environment: hardware configuration:

The Cluster is running on multiple VM's the scaled node has the following spec:

Capacity - CPU: 4, Memory: 16040Mi, Pods: 110
Allocatable - CPU: 3900m, Memory: 15684Mi, Pods: 110
Addresses - InternalIP: *** (same as external ip?)

OS - linux (amd64)
OS Image - Ubuntu 18.04.5 LTS
Kernel version - 4.15.0-141-generic
Container runtime - docker://19.3.14
Kubelet version - v1.19.7
Labels
-beta.kubernetes.io/arch=amd64
-beta.kubernetes.io/os=linux
-kubernetes.io/arch=amd64
-kubernetes.io/hostname=node4
-kubernetes.io/os=linux
Annotations
-kubeadm.alpha.kubernetes.io/cri-socket=/var/run/dockershim.sock
-node.alpha.kubernetes.io/ttl=0
-volumes.kubernetes.io/controller-managed-attach-detach=true
Conditions - Ready

OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"): of Hostmachine

Linux 4.15.0-106-generic x86_64
NAME="Ubuntu"
VERSION="18.04.4 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.4 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Version of Ansible (ansible --version):

ansible 2.9.6
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/home/$USER/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.6/dist-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.6.9 (default, Apr 18 2020, 01:56:04) [GCC 8.4.0]

Version of Python (python --version):

Python 2.7.17

Kubespray version (commit) (git rev-parse --short HEAD):
Tag v2.15.1

Network plugin used:
Calico

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):

Executing this command makes a lot of sensitive values present. If needed i can provide the relevant part.
Command used to invoke ansible:
nsible-playbook -i inventory/$INVENTORY/hosts.yaml --become -u $REMOTE_USER scale.yml

Output of ansible run:

Anything else do we need to know:

** Description **
I added a new node (node4) to the cluster by adding it to the existing host.yml in my inventory and run the scale.yml script:

all:
  hosts:
    node1:
      ansible_host: $IP_1
      ip: $IP_1
      access_ip: $IP_1
    node2:
      ansible_host: $IP_2
      ip: $IP_2
      access_ip: $IP_2
    node3:
      ansible_host: $IP_3
      ip: $IP_3
      access_ip: $IP_3
    node4:
      ansible_host: $IP_4
      ip: $IP_4
      access_ip: $IP_4
  children:
    kube-master:
      hosts:
        node1:
        node2:
    kube-node:
      hosts:
        node1:
        node2:
        node3:
        node4:
    etcd:
      hosts:
        node1:
        node2:
        node3:
    k8s-cluster:
      children:
        kube-master:
        kube-node:
    calico-rr:
      hosts: { }

The scale script worked well and there where no failed output. After the upgrade no pods from the kube-system namespace where initialized on the node. I applied the upgrade-cluster.yml and restarted the node. After that pods from the kube-system namespace are running.

I've got the following pods running:

kube-proxy ✔️
nodelocaldns ✔️
nginx-proxy ✔️
calico-node ⚠️
coredns ⚠️

coredns is not running because calico-node is not running.

The only events i get from the calico-node is

Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: $IP_FROM_MY_THIRD_NODE+IP_V6

The log from the calico-node pod:

[INFO][9] startup/startup.go 376: Early log level set to info
[INFO][9] startup/startup.go 392: Using NODENAME environment for node name
[INFO][9] startup/startup.go 404: Determined node name: node4
Calico node failed to start

And from the install-cni pod

time="2021-04-12T20:18:25Z" level=info msg="Running as a Kubernetes pod" source="install.go:140"
time="2021-04-12T20:18:25Z" level=info msg="Installed /host/opt/cni/bin/bandwidth"
time="2021-04-12T20:18:26Z" level=info msg="Installed /host/opt/cni/bin/calico"
time="2021-04-12T20:18:26Z" level=info msg="Installed /host/opt/cni/bin/calico-ipam"
time="2021-04-12T20:18:26Z" level=info msg="Installed /host/opt/cni/bin/flannel"
time="2021-04-12T20:18:26Z" level=info msg="Installed /host/opt/cni/bin/host-local"
time="2021-04-12T20:18:26Z" level=info msg="Installed /host/opt/cni/bin/install"
time="2021-04-12T20:18:26Z" level=info msg="Installed /host/opt/cni/bin/loopback"
time="2021-04-12T20:18:26Z" level=info msg="Installed /host/opt/cni/bin/portmap"
time="2021-04-12T20:18:26Z" level=info msg="Installed /host/opt/cni/bin/tuning"
time="2021-04-12T20:18:26Z" level=info msg="Wrote Calico CNI binaries to /host/opt/cni/bin\n"
time="2021-04-12T20:18:26Z" level=info msg="CNI plugin version: v3.16.9\n"
time="2021-04-12T20:18:26Z" level=info msg="/host/secondary-bin-dir is not writeable, skipping"
time="2021-04-12T20:18:26Z" level=info msg="Using CNI config template from CNI_NETWORK_CONFIG_FILE" source="install.go:323"
time="2021-04-12T20:18:26Z" level=info msg="Created /host/etc/cni/net.d/10-calico.conflist"
time="2021-04-12T20:18:26Z" level=info msg="Done configuring CNI. Sleep= false"
{
"name": "cni0",
"cniVersion":"0.3.1",

Any advice on how to get the calico-node running in order to use my new node?

Thanks in advance!

The text was updated successfully, but these errors were encountered:

oguera · 2021-04-13T12:11:56Z

After logging not only the INFO log but also the ERROR log:

ERROR: Error accessing the Calico datastore: could not initialize etcdv3 client: open /calico-secrets/cert.crt: no such file or directory

But i have no clue how to solve that tbh. Any hint is appreciated :)

floryut · 2021-04-13T12:23:29Z

could not initialize etcdv3 client: ope

Looks like this projectcalico/calico#4313

champtar · 2021-04-13T12:37:41Z

I bet it's our autodetection of calico backend that is broken, thus etcd certs are not copied on the new node
Maybe you need to force calico_datastore: etcd (if you are using calico etcd of course)

oguera · 2021-04-13T12:38:00Z

Hi @floryut
after searching in the kubespray issues for the log line i found this issue: #7156

It turns out that i initialized the cluster with calico_datastore: "etcd" (default by the time of creation) and that the bug from the linked issue is still existing. The scale.yml does not seem to respect the existing datastore setting and set it to kdd (current default) for the scaled node. After setting the value explictily in inventory/$CLUSTER/group_vars/k8s-cluster/k8s-net-calico.yml and running the scale.yml playbook again it resolved the issue and the node is running now.

floryut · 2021-04-13T12:43:37Z

Hi @floryut
after searching in the kubespray issues for the log line i found this issue: #7156

It turns out that i initialized the cluster with calico_datastore: "etcd" (default by the time of creation) and that the bug from the linked issue is still existing. The scale.yml does not seem to respect the existing datastore setting and set it to kdd (current default) for the scaled node. After setting the value explictily in inventory/$CLUSTER/group_vars/k8s-cluster/k8s-net-calico.yml and running the scale.yml playbook again it resolved the issue and the node is running now.

Then maybe this is fixed since #7449 is merged ?

oguera · 2021-04-13T12:49:26Z

Yep. Thanks for the nice work :)

floryut · 2021-04-13T13:36:48Z

Good if this works, closing, please reopen if anything to add 👍

liupeng0518 · 2021-04-14T02:28:59Z

@floryut
Should we modify these commented out vars to kdd? Otherwise it will easily cause misunderstandings.
k8s-net-calico.yml:

# Choose data store type for calico: "etcd" or "kdd" (kubernetes datastore)
# calico_datastore: "etcd"

oguera · 2021-04-14T07:34:19Z

As a user i would +1 that. For me it reads like "change the calico_datastore but the default will be etcd if you dont change it or leave it commented out".

floryut · 2021-04-14T07:40:33Z

Totally agree, that's way to weird to be left like that, we should uncomment and set to kdd.

oguera added the kind/bug Categorizes issue or PR as related to a bug. label Apr 12, 2021

oguera mentioned this issue Apr 13, 2021

Calico Node stuck in CrashLoopBackOff projectcalico/calico#4528

Closed

floryut closed this as completed Apr 13, 2021

liupeng0518 mentioned this issue Apr 17, 2021

Problems to add a new node using calico #7519

Closed

liupeng0518 mentioned this issue Apr 27, 2021

Modify the calico datastore commented config info #7558

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calico does not start after scale #7495

Calico does not start after scale #7495

oguera commented Apr 12, 2021

oguera commented Apr 13, 2021

floryut commented Apr 13, 2021

champtar commented Apr 13, 2021

oguera commented Apr 13, 2021

floryut commented Apr 13, 2021

oguera commented Apr 13, 2021

floryut commented Apr 13, 2021

liupeng0518 commented Apr 14, 2021

oguera commented Apr 14, 2021

floryut commented Apr 14, 2021

Calico does not start after scale #7495

Calico does not start after scale #7495

Comments

oguera commented Apr 12, 2021

oguera commented Apr 13, 2021

floryut commented Apr 13, 2021

champtar commented Apr 13, 2021

oguera commented Apr 13, 2021

floryut commented Apr 13, 2021

oguera commented Apr 13, 2021

floryut commented Apr 13, 2021

liupeng0518 commented Apr 14, 2021

oguera commented Apr 14, 2021

floryut commented Apr 14, 2021