Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico does not start after scale #7495

Closed
oguera opened this issue Apr 12, 2021 · 10 comments
Closed

Calico does not start after scale #7495

oguera opened this issue Apr 12, 2021 · 10 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@oguera
Copy link

oguera commented Apr 12, 2021

Environment: hardware configuration:

The Cluster is running on multiple VM's the scaled node has the following spec:

Capacity - CPU: 4, Memory: 16040Mi, Pods: 110
Allocatable - CPU: 3900m, Memory: 15684Mi, Pods: 110
Addresses - InternalIP: *** (same as external ip?)

OS - linux (amd64)
OS Image - Ubuntu 18.04.5 LTS
Kernel version - 4.15.0-141-generic
Container runtime - docker://19.3.14
Kubelet version - v1.19.7
Labels
-beta.kubernetes.io/arch=amd64
-beta.kubernetes.io/os=linux
-kubernetes.io/arch=amd64
-kubernetes.io/hostname=node4
-kubernetes.io/os=linux
Annotations
-kubeadm.alpha.kubernetes.io/cri-socket=/var/run/dockershim.sock
-node.alpha.kubernetes.io/ttl=0
-volumes.kubernetes.io/controller-managed-attach-detach=true
Conditions - Ready
  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"): of Hostmachine
Linux 4.15.0-106-generic x86_64
NAME="Ubuntu"
VERSION="18.04.4 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.4 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
  • Version of Ansible (ansible --version):
ansible 2.9.6
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/home/$USER/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.6/dist-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.6.9 (default, Apr 18 2020, 01:56:04) [GCC 8.4.0]
  • Version of Python (python --version):
Python 2.7.17

Kubespray version (commit) (git rev-parse --short HEAD):
Tag v2.15.1

Network plugin used:
Calico

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):

Executing this command makes a lot of sensitive values present. If needed i can provide the relevant part.
Command used to invoke ansible:
nsible-playbook -i inventory/$INVENTORY/hosts.yaml --become -u $REMOTE_USER scale.yml

Output of ansible run:

Anything else do we need to know:

** Description **
I added a new node (node4) to the cluster by adding it to the existing host.yml in my inventory and run the scale.yml script:

all:
  hosts:
    node1:
      ansible_host: $IP_1
      ip: $IP_1
      access_ip: $IP_1
    node2:
      ansible_host: $IP_2
      ip: $IP_2
      access_ip: $IP_2
    node3:
      ansible_host: $IP_3
      ip: $IP_3
      access_ip: $IP_3
    node4:
      ansible_host: $IP_4
      ip: $IP_4
      access_ip: $IP_4
  children:
    kube-master:
      hosts:
        node1:
        node2:
    kube-node:
      hosts:
        node1:
        node2:
        node3:
        node4:
    etcd:
      hosts:
        node1:
        node2:
        node3:
    k8s-cluster:
      children:
        kube-master:
        kube-node:
    calico-rr:
      hosts: { }

The scale script worked well and there where no failed output. After the upgrade no pods from the kube-system namespace where initialized on the node. I applied the upgrade-cluster.yml and restarted the node. After that pods from the kube-system namespace are running.

I've got the following pods running:

  • kube-proxy ✔️
  • nodelocaldns ✔️
  • nginx-proxy ✔️
  • calico-node ⚠️
  • coredns ⚠️

coredns is not running because calico-node is not running.

The only events i get from the calico-node is

Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: $IP_FROM_MY_THIRD_NODE+IP_V6

The log from the calico-node pod:

[INFO][9] startup/startup.go 376: Early log level set to info
[INFO][9] startup/startup.go 392: Using NODENAME environment for node name
[INFO][9] startup/startup.go 404: Determined node name: node4
Calico node failed to start

And from the install-cni pod

time="2021-04-12T20:18:25Z" level=info msg="Running as a Kubernetes pod" source="install.go:140"
time="2021-04-12T20:18:25Z" level=info msg="Installed /host/opt/cni/bin/bandwidth"
time="2021-04-12T20:18:26Z" level=info msg="Installed /host/opt/cni/bin/calico"
time="2021-04-12T20:18:26Z" level=info msg="Installed /host/opt/cni/bin/calico-ipam"
time="2021-04-12T20:18:26Z" level=info msg="Installed /host/opt/cni/bin/flannel"
time="2021-04-12T20:18:26Z" level=info msg="Installed /host/opt/cni/bin/host-local"
time="2021-04-12T20:18:26Z" level=info msg="Installed /host/opt/cni/bin/install"
time="2021-04-12T20:18:26Z" level=info msg="Installed /host/opt/cni/bin/loopback"
time="2021-04-12T20:18:26Z" level=info msg="Installed /host/opt/cni/bin/portmap"
time="2021-04-12T20:18:26Z" level=info msg="Installed /host/opt/cni/bin/tuning"
time="2021-04-12T20:18:26Z" level=info msg="Wrote Calico CNI binaries to /host/opt/cni/bin\n"
time="2021-04-12T20:18:26Z" level=info msg="CNI plugin version: v3.16.9\n"
time="2021-04-12T20:18:26Z" level=info msg="/host/secondary-bin-dir is not writeable, skipping"
time="2021-04-12T20:18:26Z" level=info msg="Using CNI config template from CNI_NETWORK_CONFIG_FILE" source="install.go:323"
time="2021-04-12T20:18:26Z" level=info msg="Created /host/etc/cni/net.d/10-calico.conflist"
time="2021-04-12T20:18:26Z" level=info msg="Done configuring CNI. Sleep= false"
{
"name": "cni0",
"cniVersion":"0.3.1",

Any advice on how to get the calico-node running in order to use my new node?

Thanks in advance!

@oguera oguera added the kind/bug Categorizes issue or PR as related to a bug. label Apr 12, 2021
@oguera
Copy link
Author

oguera commented Apr 13, 2021

After logging not only the INFO log but also the ERROR log:

ERROR: Error accessing the Calico datastore: could not initialize etcdv3 client: open /calico-secrets/cert.crt: no such file or directory

But i have no clue how to solve that tbh. Any hint is appreciated :)

@floryut
Copy link
Member

floryut commented Apr 13, 2021

could not initialize etcdv3 client: ope

Looks like this projectcalico/calico#4313

@champtar
Copy link
Contributor

I bet it's our autodetection of calico backend that is broken, thus etcd certs are not copied on the new node
Maybe you need to force calico_datastore: etcd (if you are using calico etcd of course)

@oguera
Copy link
Author

oguera commented Apr 13, 2021

Hi @floryut
after searching in the kubespray issues for the log line i found this issue: #7156

It turns out that i initialized the cluster with calico_datastore: "etcd" (default by the time of creation) and that the bug from the linked issue is still existing. The scale.yml does not seem to respect the existing datastore setting and set it to kdd (current default) for the scaled node. After setting the value explictily in inventory/$CLUSTER/group_vars/k8s-cluster/k8s-net-calico.yml and running the scale.yml playbook again it resolved the issue and the node is running now.

@floryut
Copy link
Member

floryut commented Apr 13, 2021

Hi @floryut
after searching in the kubespray issues for the log line i found this issue: #7156

It turns out that i initialized the cluster with calico_datastore: "etcd" (default by the time of creation) and that the bug from the linked issue is still existing. The scale.yml does not seem to respect the existing datastore setting and set it to kdd (current default) for the scaled node. After setting the value explictily in inventory/$CLUSTER/group_vars/k8s-cluster/k8s-net-calico.yml and running the scale.yml playbook again it resolved the issue and the node is running now.

Then maybe this is fixed since #7449 is merged ?

@oguera
Copy link
Author

oguera commented Apr 13, 2021

Yep. Thanks for the nice work :)

@floryut
Copy link
Member

floryut commented Apr 13, 2021

Good if this works, closing, please reopen if anything to add 👍

@floryut floryut closed this as completed Apr 13, 2021
@liupeng0518
Copy link
Member

@floryut
Should we modify these commented out vars to kdd? Otherwise it will easily cause misunderstandings.
k8s-net-calico.yml:

# Choose data store type for calico: "etcd" or "kdd" (kubernetes datastore)
# calico_datastore: "etcd"

@oguera
Copy link
Author

oguera commented Apr 14, 2021

As a user i would +1 that. For me it reads like "change the calico_datastore but the default will be etcd if you dont change it or leave it commented out".

@floryut
Copy link
Member

floryut commented Apr 14, 2021

Totally agree, that's way to weird to be left like that, we should uncomment and set to kdd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants