Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cgroups: invalid group path leads to a segmentation violation #1524

Open
mbana opened this issue Nov 13, 2024 · 0 comments
Open

[BUG] cgroups: invalid group path leads to a segmentation violation #1524

mbana opened this issue Nov 13, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@mbana
Copy link

mbana commented Nov 13, 2024

Cross-posting from containerd/containerd#11001 as I am not sure which component I should raise the bug report under.


Description

I cannot disclose the full details of the setup I am running as it is propriety, that said, I am running k3d (which uses k3s) like below, in particular notice the --volume /sys/fs/cgroup:/sys/fs/cgroup:rw mount:

In any event, the error message invalid group path should not lead to a SIGSEGV, unless of course you have other opinions.

Steps to reproduce the issue

$ k3d cluster create \
    --no-lb \
    --no-rollback \
    --agents 6 \
    --image="${IMAGE}" \
    --gpus=all \
    --k3s-arg "--disable=traefik,servicelb,metrics-server@server:*" \
    --k3s-arg "-v=6@server:*" \
    --k3s-arg "--debug@server:*" \
    --k3s-arg "--alsologtostderr@server:*" \
    --volume /sys/fs/cgroup:/sys/fs/cgroup:rw \
    --trace --verbose
# Install something
$ helm upgrade cert-manager cert-manager \
    --repo=https://charts.jetstack.io \
    --namespace cert-manager \
    --create-namespace \
    --install \
    --version=v1.10.2 \
    --set=installCRDs=true \
    --wait \
    --wait-for-jobs

I've tried all combinations of K3D_FIX_CGROUPV2=0|1 and K3D_FIX_MOUNTS=0|1 but it did not make a difference.

Additional Information

$ docker exec -it k3d-k3s-default-server-0 bash -c 'mount -v | grep -i cgroup'
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
freezer on /sys/fs/cgroup/freezer type cgroup (rw,relatime,freezer)

Describe the results you received and expected

Logs

containerd

time="2024-11-12T22:21:27.911145320Z" level=debug msg="shim bootstrap parameters" address="unix:///run/containerd/s/92c72650133423e0998fdd4bd84988d1f197650fc55d21e9468a9e22218d1f80" namespace=k8s.io protocol=ttrpc
time="2024-11-12T22:21:27.916349466Z" level=info msg="loading plugin ½"io.containerd.event.v1.publisher½"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
time="2024-11-12T22:21:27.916430504Z" level=info msg="loading plugin ½"io.containerd.internal.v1.shutdown½"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
time="2024-11-12T22:21:27.916445573Z" level=info msg="loading plugin ½"io.containerd.ttrpc.v1.task½"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
time="2024-11-12T22:21:27.916551908Z" level=debug msg="registering ttrpc service" id=io.containerd.ttrpc.v1.task
time="2024-11-12T22:21:27.916570129Z" level=info msg="loading plugin ½"io.containerd.ttrpc.v1.pause½"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
time="2024-11-12T22:21:27.916581130Z" level=debug msg="registering ttrpc service" id=io.containerd.ttrpc.v1.pause
time="2024-11-12T22:21:27.916714737Z" level=debug msg="serving api on socket" socket="ÿinherited from parent¦"
time="2024-11-12T22:21:27.916751507Z" level=debug msg="starting signal loop" namespace=k8s.io path=/run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/ad5170f629e608538b87774a2fc5c74fca871396c2b49d705b38f168283b8cd8 pid=2225 runtime=io.containerd.runc.v2
time="2024-11-12T22:21:27.975199118Z" level=error msg="loading cgroup2 for 2249" error="cgroups: invalid group path"
panic: runtime error: invalid memory address or nil pointer dereference
ÿsignal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x6f835a¦

goroutine 27 ÿrunning¦:
github.com/containerd/cgroups/v3/cgroup2.(*Manager).RootControllers(0xc0000c4660?)
	/go/src/github.com/k3s-io/k3s/build/src/github.com/containerd/containerd/vendor/github.com/containerd/cgroups/v3/cgroup2/manager.go:270 +0x1a
github.com/containerd/containerd/runtime/v2/runc/task.(*service).Start(0xc000164d80, ¨0xb96838, 0xc000382700¼, 0xc0002594f0)
	/go/src/github.com/k3s-io/k3s/build/src/github.com/containerd/containerd/runtime/v2/runc/task/service.go:314 +0x305
github.com/containerd/containerd/api/runtime/task/v2.RegisterTaskService.func3(¨0xb96838, 0xc000382700¼, 0xc00004e2e0)
	/go/src/github.com/k3s-io/k3s/build/src/github.com/containerd/containerd/vendor/github.com/containerd/containerd/api/runtime/task/v2/shim_ttrpc.pb.go:53 +0x8c
github.com/containerd/ttrpc.defaultServerInterceptor(¨0xb96838?, 0xc000382700?¼, 0x7fffb14c0e68?, 0x10?, 0x7ffff7fb95b8?)
	/go/src/github.com/k3s-io/k3s/build/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/interceptor.go:52 +0x22
github.com/containerd/ttrpc.(*serviceSet).unaryCall(0xc0000b2330, ¨0xb96838, 0xc000382700¼, 0xc0000b2378, 0xc0002c0740, ¨0xc0001b4140, 0x42, 0x50¼)
	/go/src/github.com/k3s-io/k3s/build/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/services.go:75 +0xe3
github.com/containerd/ttrpc.(*serviceSet).handle.func1()
	/go/src/github.com/k3s-io/k3s/build/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/services.go:118 +0x158
created by github.com/containerd/ttrpc.(*serviceSet).handle in goroutine 41
	/go/src/github.com/k3s-io/k3s/build/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/services.go:111 +0x14c
time="2024-11-12T22:21:27.989682727Z" level=info msg="shim disconnected" id=ad5170f629e608538b87774a2fc5c74fca871396c2b49d705b38f168283b8cd8 namespace=k8s.io
time="2024-11-12T22:21:27.989743198Z" level=warning msg="cleaning up after shim disconnected" id=ad5170f629e608538b87774a2fc5c74fca871396c2b49d705b38f168283b8cd8 namespace=k8s.io
time="2024-11-12T22:21:27.989753741Z" level=info msg="cleaning up dead shim" namespace=k8s.io
time="2024-11-12T22:21:27.989902787Z" level=error msg="Failed to delete sandbox container ½"ad5170f629e608538b87774a2fc5c74fca871396c2b49d705b38f168283b8cd8½"" error="ttrpc: closed: unknown"
time="2024-11-12T22:21:27.990347663Z" level=error msg="encountered an error cleaning up failed sandbox ½"ad5170f629e608538b87774a2fc5c74fca871396c2b49d705b38f168283b8cd8½", marking sandbox state as SANDBOX_UNKNOWN" error="ttrpc: closed: unknown"
time="2024-11-12T22:21:27.990394639Z" level=error msg="RunPodSandbox for &PodSandboxMetadata¨Name:k8s-device-plugin-daemonset-qj8x8,Uid:c3d6073e-395d-4cca-a27d-57a30c29be6c,Namespace:nvidia,Attempt:6,¼ failed, error" error="failed to start sandbox container task ½"ad5170f629e608538b87774a2fc5c74fca871396c2b49d705b38f168283b8cd8½": ttrpc: closed: unknown"

kubectl

$ kubectl describe pod -n cert-manager cert-manager-5dfb9c94b5-k6hj2
Events:
  Type     Reason                  Age                      From     Message
  ----     ------                  ----                     ----     -------
  Warning  FailedCreatePodSandBox  8m46s (x13192 over 11h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container task "15867aa6216f302ec59607cbfa619dbc766b5432109731711406bb82c293d1a5": ttrpc: closed: unknown
  Normal   SandboxChanged          3m45s (x15307 over 11h)  kubelet  Pod sandbox changed, it will be killed and re-created.

Really this should should something like level=error msg="loading cgroup2 for 2249" error="cgroups: invalid group path", the warning (not error) leaves one clueless.

What version of containerd are you using?

v1.7.22-k3s1.28

Any other relevant information

$ docker exec -it k3d-k3s-default-server-0 bash -c 'containerd --version'
containerd github.com/k3s-io/containerd v1.7.22-k3s1.28
$ docker exec -it k3d-k3s-default-server-0 bash -c 'runc --version'
runc version 1.1.14
commit: 12de61f
spec: 1.0.2-dev
go: go1.22.8
libseccomp: 2.5.5
$ docker info
Client:
 Version:    24.0.7
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  0.12.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx

Server:
 Containers: 8
  Running: 7
  Paused: 0
  Stopped: 1
 Images: 28
 Server Version: 24.0.7
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: nvidia
 Init Binary: docker-init
 containerd version: 83031836b2cf55637d7abf847b17134c51b38e53
 runc version: v1.1.12-0-g51d5e946
 init version:
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.8.0-48-generic
 Operating System: Ubuntu 22.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 62.79GiB
 Name: mbana-2
 ID: 6e707ea5-5476-44c3-82ee-616d9f97a99a
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 80
  Goroutines: 78
  System Time: 2024-11-13T09:50:55.884211506Z
  EventsListeners: 0
 Username: mohamedbana
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Show configuration if it is related to CRI plugin.

$ docker exec -it k3d-k3s-default-server-0 bash -c 'cat /etc/containerd/config.toml' 
version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
@mbana mbana added the bug Something isn't working label Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant