using the ZFS containerd snapshotter, kubeadm join
fails because it cannot contact the API server on 127.0.0.1. Containerd snapshotter not specified for nerdctl.
#11734
Labels
kind/bug
Categorizes issue or PR as related to a bug.
What happened?
I added a node to a Kubernetes cluster, where
/var/lib/containerd
is on ZFS. To make that work with Linux kernels 5.x, one needs to use the ZFS snapshotter for containerd. Fortunately kubespray provides acontainerd_snapshotter
variable to customize which snapshotter to use.Note: I solved this issue for myself in a somewhat hacky way. I don't think it should be upstreamed as-is, but will help you get an idea of what it is: https://github.com/kubernetes-sigs/kubespray/compare/release-2.25...AlignmentResearch:kubespray:far/zfs-gpu-fixes?expand=1
Symptom 1: kubeadm join fails to connect to localhost
The first observable symptom was the deployment failing at the
kubeadm join
step with:Complete trace of running kubeadm
The reason for this is that there's no
nginx
static pod running on localhost, redirecting API server calls to the actual API server. But why would there be, if kubelet hasn't been set up yet?systemctl start kubelet
made it error with/etc/kubernetes/bootstrap-kubelet.conf
, which makes sense because it's created by kubeadm.Symptom 2: all the container images that were supposed to be downloaded had failed
I tried to set up a nginx proxy container manually with
ctr
, and got the following error:the reason is that the images had been pulled without specifying the snapshotter, and ctr does not read
/etc/containerd/config.toml
to decide which snapshotter to use. So containerd had been usingoverlayfs
and that fails on ZFS with linux 5.x.Symptom 3: kubelet still stuck, missing
ca.crt
Even after making sure
/etc/containerd/config.toml
has ZFS specified and the correct images were downloaded with the ZFS snapshotter, checkingsystemctl status kubectl
after kubeadm fails to contact localhost shows this error:Is there something missing in the bootstrapping here? What is wrong?
Summary of changes I made to fix this:
--snapshotter={{ containerd_snapshotter }}
tonerdctl_image_pull_command
inroles/kubespray-defaults/defaults/main/download.yml
containerd_snapshotter
instead of the nonexistentnerdctl_snapshotter
inroles/container-engine/nerdctl/templates/nerdctl.toml.j2
What did you expect to happen?
I expected the node to join to the cluster without problems.
How can we reproduce it (as minimally and precisely as possible)?
Deploy a cluster on a ZFS filesystem, setting
containerd_snapshotter=zfs
.OS
Linux 5.15.0-126-generic x86_64
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
Version of Ansible
ansible [core 2.16.13]
config file = None
configured module search path = ['/Users/adria/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /Users/adria/.pyenv/versions/3.10.13/envs/fl-3.10/lib/python3.10/site-packages/ansible
ansible collection location = /Users/adria/.ansible/collections:/usr/share/ansible/collections
executable location = /Users/adria/.pyenv/versions/fl-3.10/bin/ansible
python version = 3.10.13 (main, Mar 23 2024, 16:18:41) [Clang 15.0.0 (clang-1500.1.0.2.5)] (/Users/adria/.pyenv/versions/3.10.13/envs/fl-3.10/bin/python)
jinja version = 3.1.4
libyaml = True
(yes, I'm running this from a Mac host, that is definitely not part of the cluster and connects to all the Linuxes)
Version of Python
Python 3.10.13
Version of Kubespray (commit)
586ba66
Network plugin used
calico
Full inventory with variables
Too much sensitive stuff -- I'll provide if really necessary
Command used to invoke ansible
ansible-playbook -v scale.yaml -i ../inventory/cluster/hosts.yaml --become --become-user=root \ --extra-vars="@../inventory/cluster/group_vars/hardening.yaml" \ --extra-vars="ansible_ssh_private_key_file=${SSH_PRIVKEY}"
Output of ansible run
I'll provide this if relevant -- it's not easy for me to recapture now.
Anything else we need to know
I apologize for not providing a reproduction and hope this is enough info. At the very least I hope you'll appreciate the surefire bug that is using the
nerdctl_snapshotter
variable which does not exist.The text was updated successfully, but these errors were encountered: