Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.27.2 mount-bpffs init container fails to load libpcap on Rockylinux9.3/Arm64 #8542

Closed
RyrieNorth opened this issue Feb 21, 2024 · 13 comments
Closed
Assignees
Labels
area/arm64 relates to arm64

Comments

@RyrieNorth
Copy link

When I install the calico network plugin after initializing the kubernetes cluster the following occurs:

[root@k8s-master docker.io]# kubectl create -f calico.yaml
......

[root@k8s-master docker.io]# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-68cdf756d9-r75jj 0/1 Pending 0 3s

kube-system calico-node-25ctj 0/1 Init:Error 0 3s

kube-system calico-node-7t8qv 0/1 Init:2/3 0 3s

kube-system calico-node-qrnsr 0/1 Init:1/3 0 3s

kube-system coredns-857d9ff4c9-j6jmj 0/1 Pending 0 105s

kube-system coredns-857d9ff4c9-nh8tf 0/1 Pending 0 98s
......

You can see that the state quickly switches to Init:Error in a very short time.

By describing the analysis, I found the keyword:

[root@k8s-master docker.io]# kubectl describe -n kube-system pods calico-node-25ctj
Events:
Type Reason Age From Message


Warning BackOff 4m59s (x24 over 9m58s) kubelet Back-off restarting failed container mount-bpffs in pod calico-node-25ctj_kube-system(ec997881-48b9-4bc0-9203-d25ef3171052)
......

When I looked at the logs I found one error that appeared more frequently:

[root@k8s-master docker.io]# cat /var/log/messages | grep "qrnsr"
......
Feb 22 01:48:00 localhost kubelet[15596]: E0222 01:48:00.536276 15596 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to "StartContainer" for "mount-bpffs" with CrashLoopBackOff: "back-off 5m0s restarting failed container=mount-bpffs pod=calico-node-qrnsr_kube-system(5b0c855d-482c-4cc0-98f7-da9ae03070c1)"" pod="kube-system/calico-node-qrnsr" podUID="5b0c855d-482c-4cc0-98f7-da9ae03070c1"
......

I've used the mount -l command to check that my system has the bpffs device.
[root@k8s-master docker.io]# mount -l | grep bpf
bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
......

I've tried many things, but nothing works.
Then I tried switching the calico version to v3.27.0 and it worked!
[root@k8s-master docker.io]# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-5fc7d6cf67-lcczd 0/1 Pending 0 53s

kube-system calico-node-ptnz8 1/1 Running 0 53s

kube-system calico-node-sthqf 1/1 Running 0 53s

kube-system calico-node-vhwxn 1/1 Running 0 53s

[root@k8s-master docker.io]# cat calico.yaml | grep image
image: docker.io/calico/cni:v3.27.0
image: docker.io/calico/node:v3.27.0
image: docker.io/calico/kube-controllers:v3.27.0
......
I'm puzzled by this, is it the operating system problem? Or is it the kernel version? I hope the officials can answer my question.

Possible Solution

Rolling back calico to v3.27.0

My Environment

  • Calico version: v3.27.2
  • Kubernetes: v1.29.2
  • Operating System and version: Rocky Linux release 9.3 (Blue Onyx), kernel 5.14.0-362.8.1.el9_3.aarch64,arm64
  • Link to your project (optional): none
@tomastigera tomastigera added the area/bpf eBPF Dataplane issues label Feb 21, 2024
@tomastigera tomastigera changed the title calico-nodes init fali when install on kubernetes v1.29.2 Rockylinux9.3/Arm64 calico ebpf init fails when installing on kubernetes v1.29.2 Rockylinux9.3/Arm64 Feb 21, 2024
@tomastigera
Copy link
Contributor

When you try 3.27.0 does the ebpf dataplane come up correctly or the init containers only do not fail? There was definitely a regression in 3.27.0 not building ebpf for arm correctly. That got fixed #8470 but it may not completely bring it back. This said, 3.27.0 may be just a false positive.

Could you share calico-node logs from 3.27.0 just for verification? Would you be able to provide more logs from the failed 3.27.2 init container?

@RyrieNorth
Copy link
Author

Okay, I've collected some of the logs, so hopefully that will be helpful

calico-v3.27.0.zip
calico-v3.27.2.zip

@RyrieNorth
Copy link
Author

I tried v3.27.1 later and got the same results as v3.27.2

@tomastigera
Copy link
Contributor

First of all, you did not enable BPF dataplane, right? The logs show that BFPEnabled is false. But it seems like 3.27.0 is not quite healthy either:

Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused

Is /var/run/calico/ writable?

Let me check why is 3.27.2 trying to run mount-bpffs when bpf is disabled 🤔

@tomastigera tomastigera added the area/arm64 relates to arm64 label Feb 22, 2024
@RyrieNorth
Copy link
Author

Yes, I didn't enable BPF dataplane because I didn't find the relevant configuration item in my previous deployment method, but the cluster's network is able to forward data traffic normally.
......

Pod status:
[root@k8s-master ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 1/1 Running 0 4m40s

[root@k8s-master ~]# kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 443/TCP 8h
nginx NodePort 10.96.130.124 80:32737/TCP 4m25s
......

Traffic test:
[root@k8s-master ~]# curl http://$(kubectl get svc nginx -o jsonpath={.spec.clusterIP})
nginx test page.

Eq:
[root@k8s-master ~]# curl http://10.96.130.124
......

[root@k8s-master ~]# curl http://$(hostname -i):$(kubectl get svc nginx -o jsonpath={.spec.ports..nodePort})
nginx test page.

Eq:
[root@k8s-master ~]# curl http://192.168.1.11:32737
......

But I'm sure the /var/run/calico directory is readable and writable.

This is the result of viewing it with the command:
[root@k8s-master ~]# ls /var/run/calico/ -ld
drwxr-xr-x 3 root root 100 Feb 23 00:59 /var/run/calico/
......

[root@k8s-master ~]# ls /var/run/calico/
bird6.ctl bird.ctl cgroup
......

[root@k8s-master ~]# ls /var/run/calico/ -l
total 0
srw-rw---- 1 root root 0 Feb 23 00:59 bird6.ctl
srw-rw---- 1 root root 0 Feb 23 00:59 bird.ctl
dr-xr-xr-x 12 root root 0 Feb 23 00:58 cgroup
......

And the SELINUX is disable:
[root@k8s-master ~]# getenforce
Disabled

@tomastigera
Copy link
Contributor

Could you provide us with logs for the failing mount-bpffs container, not the default calico-node one?

@tomastigera tomastigera removed the area/bpf eBPF Dataplane issues label Feb 22, 2024
@tomastigera tomastigera changed the title calico ebpf init fails when installing on kubernetes v1.29.2 Rockylinux9.3/Arm64 mount-bpffs init container fails when installing on kubernetes v1.29.2 Rockylinux9.3/Arm64 in iptables mode Feb 22, 2024
@RyrieNorth
Copy link
Author

Sorry, I found it.

[root@k8s-master docker.io]# crictl logs 5a1
calico-node: error while loading shared libraries: libpcap.so.0.8: cannot open shared object file: No such file or directory

@tomastigera tomastigera changed the title mount-bpffs init container fails when installing on kubernetes v1.29.2 Rockylinux9.3/Arm64 in iptables mode 3.27.2 mount-bpffs init container fails to load libpcap on Rockylinux9.3/Arm64 Feb 22, 2024
@hjiawei
Copy link
Contributor

hjiawei commented Feb 22, 2024

libpcap issue is related to #8541.

@RyrieNorth
Copy link
Author

libpcap issue is related to #8541.

Okay, it looks like it's a problem when building the image, and this is the library used by RockyLinux 9.3
搞半天镜像问题可还行

@RyrieNorth
Copy link
Author

RyrieNorth commented Feb 22, 2024

[root@k8s-slave1 lib64]# ls | grep libpca
libpcap.so.1
libpcap.so.1.10.0

@RyrieNorth
Copy link
Author

Guys, I found a temporary workaround, you can install an old version of the libpcap package via yum and then modify calico.yaml to get him running:
......

Note that this step replaces the libpcap.so.1.9.1 library on the system:

yum install -y https://dl.rockylinux.org/pub/rocky/8/Devel/aarch64/os/Packages/l/libpcap-1.9.1-5.el8.aarch64.rpm
......

Modify calico.yaml

    - name: "mount-bpffs"
      image: docker.io/calico/node:v3.27.2
      imagePullPolicy: IfNotPresent
      command: ["calico-node", "-init", "-best-effort"]
      volumeMounts:
        - mountPath: /nodeproc
          name: nodeproc
          readOnly: true
        - mountPath: /usr/lib64/libpcap.so.0.8 // Add it here
          name: libpcap-mount
    - name: "calico-node"
      image: docker.io/calico/node:v3.27.2
      imagePullPolicy: IfNotPresent
      volumeMounts:
        - mountPath: /nodeproc
          name: nodeproc
          readOnly: true
        - mountPath: /usr/lib64/libpcap.so.0.8
          name: libpcap-mount  // Add it here

  volumes:
      hostPath:
        type: DirectoryOrCreate
        path: /var/run/nodeagent
    - name: libpcap-mount
      hostPath:
        path: /usr/lib64/libpcap.so.1.9.1  // Add it here

Photos

9ef3d73fc14864fd2264a0edd0cad5f

a10f95cf50f2b460319cb36d58a84f0

Finally, it works

3

193b597c9edda815596716df7e29e37

@RyrieNorth
Copy link
Author

Maybe it's the best solution.
......

  containers:
    - name: "mount-bpffs"
      image: docker.io/calico/node:v3.27.2
      imagePullPolicy: IfNotPresent
      #command: ["calico-node", "-init", "-best-effort"]
      command: ["/bin/sh", "-c", "ln -s /usr/lib64/libpcap.so.1.9.1 /usr/lib64/libpcap.so.0.8 && calico-node -init -best-effort"]
  containers:
    - name: calico-node
      image: docker.io/calico/node:v3.27.2
      imagePullPolicy: IfNotPresent
      command: [ "/bin/sh", "-c", "ln -s /usr/lib64/libpcap.so.1.9.1 /usr/lib64/libpcap.so.0.8 && start_runit"]

......
6
......

It also run it on v3.27.1

@tomastigera
Copy link
Contributor

@NorthSkybk thank you for sharing your workaround and reporting that issue. We will try to come up with a proper fix for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/arm64 relates to arm64
Projects
None yet
Development

No branches or pull requests

3 participants