-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
containerd-shim processes are leaking inotify instances with cgroups v2 #5670
Comments
@AkihiroSuda @Random-Liu Can someone help me debug this? |
Thanks for reporting, seems happening around here https://github.com/containerd/cgroups/blob/616d4347a49f244931740491cac6fedf8d2dd873/v2/manager.go#L551-L640 Do you know whether the issue is specific to Kubernetes? How long minutes does it take to hit the issue? |
@AkihiroSuda Thanks for the pointer.
Not sure if this is specific to kubernetes. It doesn't take long to observe this issue since we have very many number of pods running per node. However I have noticed that when a container crashes the count goes up by 1 suggesting that the previous inotify instance was not reaped. |
https://github.com/containerd/cgroups/blob/616d4347a49f244931740491cac6fedf8d2dd873/v2/manager.go#L578-L579 |
Quoting from the inotify man page: "When all file descriptors referring to an inotify instance have been closed (using close(2)), the underlying object and its resources are freed for reuse by the kernel; all @mikebrow Good catch. But considering the above info, shouldn't the watches be freed automatically as the fd is closed thus making the first defer statement redundant? |
nod.. |
@bharathguvvala How are you running containerd with cgroup v2? I did not find a good documentation around that. Trying to repro this issue on my setup. |
@alakesh Sorry about the delayed response. You can start here. You need to configure the cgroupv2 controller in the corresponding operating system grub (debian example) . |
Hi, I tracked this issue down: it's not specific to k8s but any situation where a long-running containerd-shim This is easy to reproduce on current Flatcar releases because we switched to cgroup2. Run k8s and start a crashing container that gets auto-restarted (the manifest from #6323 is enough). I've prepared a fix here containerd/cgroups#212, which mirrors what we do for cgroup1 (get notified when container exits). I'll be happy to prepare a PR here to update containerd/cgroups when that's merged. We'll also need this change to gracefully handle exits: commit 6d2d2e7ba3eb3f25d7c4f6c10b025a7636428559 (HEAD -> jepio/fix-cgroupv2-oom-event)
Author: Jeremi Piotrowski <[email protected]>
Date: Fri Dec 3 13:53:19 2021 +0000
cgroup2: handle EventChan closure gracefully
EventChan() now terminates when the cgroup exits, and closes the errorChannel.
This is expected so check the error and skip logging when error is nil.
Signed-off-by: Jeremi Piotrowski <[email protected]>
diff --git a/pkg/oom/v2/v2.go b/pkg/oom/v2/v2.go
index c0717eb99..8b3e0d387 100644
--- a/pkg/oom/v2/v2.go
+++ b/pkg/oom/v2/v2.go
@@ -102,10 +102,12 @@ func (w *watcher) Add(id string, cgx interface{}) error {
i.ev = ev
w.itemCh <- i
case err := <-errCh:
- i.err = err
- w.itemCh <- i
- // we no longer get any event/err when we got an err
- logrus.WithError(err).Warn("error from *cgroupsv2.Manager.EventChan")
+ if err != nil {
+ i.err = err
+ w.itemCh <- i
+ // we no longer get any event/err when we got an err
+ logrus.WithError(err).Warn("error from *cgroupsv2.Manager.EventChan")
+ }
return
}
|
Helps to mitigate the inotify fd leak issue related with containerd runtime: containerd/containerd#5670 Created as a flag to be able to support both docker and containerd runtime setups
Helps to mitigate the inotify fd leak issue related with containerd runtime: containerd/containerd#5670 Created as a flag to be able to support both docker and containerd runtime setups
#6961 will fix the issue in 1.5.x. |
Resolving. |
Description
containerd-shim processes are leaking inotify instances when setup with cgroups v2. The following is the count of the instances held by the shim processes (other processes have been filtered out from the output). The count only goes up as time progresses. With cgroups v1 there are no inotify instances being held by the container-shim processes. I suspect that this could be due to a runc PR to listen to oom events which uses the inotify mechanism. but that code is cleanly handling the inotify instances and closing them. Since I have not much insight into the containerd-shim and runc interaction, I suspect this could be due to some corner case in the interaction that's causing this.
To add more context, this is running on a kubernetes node with containerd as the runtime. Pods which are crash looping are contributing more to the problem.
$ find /proc/*/fd -lname anon_inode:inotify | cut -d/ -f3 | xargs -I '{}' -- ps --no-headers -o '%p %U %c %a %P' -p '{}' | uniq -c | sort -nr | grep root
COUNT PID USER COMMAND COMMAND PPID
46 3276111 root containerd-shim /usr/local/bin/containerd-s 1
15 1585532 root containerd-shim /usr/local/bin/containerd-s 1
14 19306 root containerd-shim /usr/local/bin/containerd-s 1
14 19279 root containerd-shim /usr/local/bin/containerd-s 1
7 2937840 root containerd-shim /usr/local/bin/containerd-s 1
3 28459 root containerd-shim /usr/local/bin/containerd-s 1
3 19232 root containerd-shim /usr/local/bin/containerd-s 1
2 3251811 root containerd-shim /usr/local/bin/containerd-s 1
2 19200 root containerd-shim /usr/local/bin/containerd-s 1
2 19156 root containerd-shim /usr/local/bin/containerd-s 1
1 3602946 root containerd-shim /usr/local/bin/containerd-s 1
1 3171279 root containerd-shim /usr/local/bin/containerd-s 1
1 213009 root containerd /usr/local/bin/containerd 1
Steps to reproduce the issue:
Describe the results you received:
The inotify instances are increasing in an unbounded manner causing concern that the fs.inotify.max_user_instances would be breached over time. The impact on containerd-shim when the limit breaches is not known to me right now. If this were in kubernetes, 'kubectl logs -f' to tail logs on any pod on that node would fail unable to create a new inotify instance.
Describe the results you expected:
The number of instances should be bounded and should not increase/leak.
What version of containerd are you using:
Any other relevant information (runC version, CRI configuration, OS/Kernel version, etc.):
runc --version
uname -a
The text was updated successfully, but these errors were encountered: