Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cilium: kubelet does not detect cilium CNI #10887

Closed
RubenMakandra opened this issue Feb 5, 2024 · 9 comments · Fixed by #10966
Closed

Cilium: kubelet does not detect cilium CNI #10887

RubenMakandra opened this issue Feb 5, 2024 · 9 comments · Fixed by #10966
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@RubenMakandra
Copy link

RubenMakandra commented Feb 5, 2024

What happened?

I used kubespray v2.24 with

kube_network_plugin: cilium  
cilium_version: "v1.14.0"
kube_version: v1.27.7

I did not set any other cilium related variable.
The playbook finished successfully, but all nodes (3 control plane, 2 worker) stayed NotReady.
Relevant output of kubectl describe node control-plane-1:

Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Mon, 05 Feb 2024 17:20:32 +0100   Mon, 05 Feb 2024 17:20:32 +0100   CiliumIsUp                   Cilium is running on this node
  MemoryPressure       False   Mon, 05 Feb 2024 17:25:33 +0100   Mon, 05 Feb 2024 17:18:27 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Mon, 05 Feb 2024 17:25:33 +0100   Mon, 05 Feb 2024 17:18:27 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Mon, 05 Feb 2024 17:25:33 +0100   Mon, 05 Feb 2024 17:18:27 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                False   Mon, 05 Feb 2024 17:25:33 +0100   Mon, 05 Feb 2024 17:18:27 +0100   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

Same log message was displayed directly querying kubelet logs on the node.
The cilium pods were all running, output of cilium status:

cilium status
   /¯¯\
/¯¯\__/¯¯\    Cilium:             OK
\__/¯¯\__/    Operator:           OK
/¯¯\__/¯¯\    Envoy DaemonSet:    disabled (using embedded mode)
\__/¯¯\__/    Hubble Relay:       disabled
   \__/       ClusterMesh:        disabled

DaemonSet              cilium             Desired: 5, Ready: 5/5, Available: 5/5
Deployment             cilium-operator    Desired: 2, Ready: 2/2, Available: 2/2
Containers:            cilium             Running: 5
                      cilium-operator    Running: 2
Cluster Pods:          0/2 managed by Cilium
Helm chart version:
Image versions         cilium             quay.io/cilium/cilium:v1.14.0: 5
                      cilium-operator    quay.io/cilium/operator:v1.14.0: 2

cilium version had the following output:
cilium version

cilium-cli: v0.15.22 compiled with go1.21.6 on linux/amd64
cilium image (default): v1.15.0
cilium image (stable): v1.14.6
cilium image (running): unknown. Unable to obtain cilium version. Reason: release: not found

What did you expect to happen?

The nodes becoming ready/cilium being detected by kubelet.

How can we reproduce it (as minimally and precisely as possible)?

Running kubespray with

kube_network_plugin: cilium  
cilium_version: "v1.14.0"

as variables, with Ubuntu 22.04 targets

OS

Nodes:

Linux 5.15.0-91-generic x86_64
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

I run ansible in quay.io/kubespray/kubespray:v2.24.0

Version of Ansible

ansible [core 2.15.8]
  config file = /kubespray/ansible.cfg
  configured module search path = ['/kubespray/library']
  ansible python module location = /usr/local/lib/python3.10/dist-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (/usr/bin/python3)
  jinja version = 3.1.2
  libyaml = True

Version of Python

Python 3.10.12

Version of Kubespray (commit)

The containerimage quay.io/kubespray/kubespray:v2.24.0 was used

Network plugin used

cilium

Full inventory with variables

This contains some sensitive information, if it is strictly required I can upload it later with some redactions.
They do contain "cilium_version": "v1.14.0" and "kube_network_plugin": "cilium"

Command used to invoke ansible

ansible-playbook -i inventory/inventory.yaml cluster.yaml

Output of ansible run

https://gist.github.com/RubenMakandra/933719a5caa6cb1daa92b115dd6e37ef

Anything else we need to know

No response

@RubenMakandra RubenMakandra added the kind/bug Categorizes issue or PR as related to a bug. label Feb 5, 2024
@RubenMakandra
Copy link
Author

RubenMakandra commented Feb 7, 2024

I got it working by using the kubespray 1.23.1 container and not setting the cilium version explicitly, only setting kube_network_plugin: cilium as ansible variable. I do not know what caused the error with kubespray 1.24 and cilium 1.14

@manicole
Copy link

manicole commented Feb 8, 2024

Same problem encountered here, please reopen.

This is actually 2 issues in one :

  • First issue is that Cilium is not writing its configuration on the nodes, at least not in /etc/cni/net.d, thus it is not recognized as the CNI plugin.

  • Second issue is that the kube_network_plugin variable may not be well scoped. As long as it stays in group_vars/k8s_cluster/k8s_cluser.yml, it is not taken in account when launching cluster.yaml playbook. The consequence is that certs are not generated for nodes which are neither in the control plane nor in the etcd group, making the playbook fail. The workaround is to move the variable to group_vars/all/all.yml.

@RubenMakandra RubenMakandra reopened this Feb 8, 2024
@RubenMakandra
Copy link
Author

RubenMakandra commented Feb 8, 2024

Reopening, since I only avoided the issue and did not actually fix it, and apparently others run into the same.

@manicole We do have the cilium kube_network_plugin (and other vars) set in inventory/group_vars/k8s_cluster/k8s_cluster.yml, and the playbook finished without error with a working cluster (as long as I didnt set the cilium version)

While I no longer have the broken cluster I am pretty sure that it also had an empty /etc/cni/net.d directory.

@manicole
Copy link

manicole commented Feb 9, 2024

@RubenMakandra Thanks for reopening and for your insights. Testing is in progress on my side ;)

In addition, I found the issue #10684 mentionning the problem and proposing a way to solve it. I commented it to ask for a PR.

@RubenMakandra
Copy link
Author

Thanks for looking into the other issue and pull request!

To sum it up (correct me if I'm wrong!), kubespray v2.24 only added support for upgrading cilium <1.14 to 1.14, but does not support provisioning clusters directly with cilium 1.14.
It would be nice if someone of the cilium team could update the 2.24 release notes to change the bulletpoint

[cilium] Adds support for deploying clusters with cilium 1.14+ (#10684, @rl0nergan)

to

[cilium] Adds support for updating clusters with cilium <1.14 to cilium 1.14+ (#10684, @rl0nergan)

since the current release note does not appear to be correct in implying that it is possible to deploy a [new] cluster with cilium 1.14.

Fixing the provisioning of new clusters with Cilium 1.14 would be highly appreciated too of course!

@cccsss01
Copy link

I'm experiencing similar issues.

@cleman95
Copy link
Contributor

/assign @cleman95

@ZeleniJure
Copy link

ZeleniJure commented Feb 26, 2024

I'm not completely sure #10945 solves the issue, at least for cilium 1.15.X
The issue seems to be that the configuration file does not accept the write-cni-conf-when-ready parameter any more, while the parameter is still available for the cilium-agent. Thus we used this in kubespray configuration instead:

# Extra arguments for the Cilium agent
cilium_agent_custom_args:
  - --write-cni-conf-when-ready=/host/etc/cni/net.d/05-cilium.conflist

Should we perhaps add this hint to group vars sample?

@cleman95
Copy link
Contributor

cleman95 commented Feb 26, 2024

I might have made a mistake with the path in the configMap. It should have been /host/etc/cni/net.d/ as you mentioned instead of /etc/cni/net.d since the deamonSet also mounts it to /host/etc/cni/net.d/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants