Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zombie slirp4netns processes left on the system when using podman API service #9777

Closed
henryhchchc opened this issue Mar 22, 2021 · 17 comments · Fixed by #10851
Closed

Zombie slirp4netns processes left on the system when using podman API service #9777

henryhchchc opened this issue Mar 22, 2021 · 17 comments · Fixed by #10851
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@henryhchchc
Copy link

henryhchchc commented Mar 22, 2021

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

podman REST API service will left zombie slirp4netns processes until it is stopped.

I think the reason is that at Line 468 of libpod/networking_linux.go, the slirp4netns process is started but never waited by podman.

Steps to reproduce the issue:

  1. Run podman system service -t 0 to start the API service and keep it running

  2. Start and stop N containers via the podman REST API

Describe the results you received:

There will be N zombie slirp4netns processes left on the system until the API service is stopped.

Describe the results you expected:

There is no zombie slirp4netns process.

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

Version:      3.0.0-dev
API Version:  3.0.0
Go Version:   go1.15.7
Built:        Wed Feb  3 06:06:33 2021
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.19.2
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: conmon-2.0.25-1.module_el8.4.0+673+eabfc99d.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.25, commit: 897f4ebd69b9e9c725621fabf1d7c918ef635a68'
  cpus: 36
  distribution:
    distribution: '"centos"'
    version: "8"
  eventLogger: file
  hostname: <host name>
  idMappings:
    gidmap:
    - container_id: 0
      host_id: <gid>
      size: 1
    - container_id: 1
      host_id: 200000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: <uid>
      size: 1
    - container_id: 1
      host_id: 200000
      size: 65536
  kernel: 4.18.0-277.el8.x86_64
  linkmode: dynamic
  memFree: 21868609536
  memTotal: 269860601856
  ociRuntime:
    name: crun
    package: crun-0.17-1.module_el8.4.0+673+eabfc99d.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 0.17
      commit: 0e9229ae34caaebcb86f1fde18de3acaf18c6d9a
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/<id>/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    selinuxEnabled: false
  slirp4netns:
    executable: /bin/slirp4netns
    package: slirp4netns-1.1.8-1.module_el8.4.0+641+6116a774.x86_64
    version: |-
      slirp4netns version 1.1.8
      commit: d361001f495417b880f20329121e3aa431a8f90f
      libslirp: 4.3.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.1
  swapFree: 137420599296
  swapTotal: 137438949376
  uptime: 58h 9m 49.4s (Approximately 2.42 days)
registries:
  search:
  - registry.access.redhat.com
  - registry.redhat.io
  - docker.io
store:
  configFile: /data/<user>/.config/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs-1.4.0-2.module_el8.4.0+673+eabfc99d.x86_64
      Version: |-
        fusermount3 version: 3.2.1
        fuse-overlayfs: version 1.4
        FUSE library version 3.2.1
        using FUSE kernel interface version 7.26
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /data/<user>/ssd/containers/storage
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 39
  runRoot: /run/user/<id>/containers
  volumePath: /data/<user>/ssd/containers/storage/volumes
version:
  APIVersion: 3.0.0
  Built: 1612303593
  BuiltTime: Wed Feb  3 06:06:33 2021
  GitCommit: ""
  GoVersion: go1.15.7
  OsArch: linux/amd64
  Version: 3.0.0-dev

Package info (e.g. output of rpm -q podman or apt list podman):

podman-3.0.0-0.33rc2.module_el8.4.0+673+eabfc99d.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

  • CentOS Stream 8
  • i9 10980XE
@henryhchchc henryhchchc changed the title Zombie slirp4netns processed left on the system when using podman API service Zombie slirp4netns processes left on the system when using podman API service Mar 22, 2021
@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 22, 2021
@lsm5
Copy link
Member

lsm5 commented Mar 22, 2021

btw, the podman 3.0.0-rc2 mentioned is directly from CentOS Stream repos, not Kubic. @vrothberg @mheon @jnovy do you know what's the status of update on that one?

@vrothberg
Copy link
Member

btw, the podman 3.0.0-rc2 mentioned is directly from CentOS Stream repos, not Kubic. @vrothberg @mheon @jnovy do you know what's the status of update on that one?

No idea why Stream hasn't seen v3.0.1.

@vrothberg
Copy link
Member

@Luap99 do you know what's going on with the slirp processes?

@Luap99
Copy link
Member

Luap99 commented Mar 23, 2021

This looks pretty clear to me. The slirp process is terminated with a pipe file descriptor. One end is send to conmon the other to the slirp process. Once conmon exists slirp will exit as well. However the parent process (podman service) is still alive and thus linux is expecting it to read the process status with wait.

I am not sure what a good solution is, maybe fork exec the slirp process?

@giuseppe @AkihiroSuda WDYT?

@mheon
Copy link
Member

mheon commented Mar 23, 2021

We could consider having podman-system-service set the Subreaper PRCTL? Everything should be a direct child of system service so the zombies will reparent on us and we can wait for them.

@mheon
Copy link
Member

mheon commented Mar 23, 2021

(Well, everything that would cause this problem, I mean...)

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@mheon
Copy link
Member

mheon commented Apr 23, 2021

@rhatdan @baude We should probably prioritize this one higher, seems like a significant regression.

@rhatdan
Copy link
Member

rhatdan commented Apr 23, 2021

I agree this needs to be fixed ASAP. @Luap99 any chance you can look at this?

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented May 24, 2021

@Luap99 were you ever able to look at this?

@rhatdan
Copy link
Member

rhatdan commented May 24, 2021

@mheon do you have time to look at this?

@mheon
Copy link
Member

mheon commented May 24, 2021

I will try and find some time this sprint

@mheon mheon self-assigned this May 24, 2021
@Luap99
Copy link
Member

Luap99 commented May 24, 2021

I tried to use something like unix.Wait4(-1, nil, unix.WNOHANG, nil) to wait for all child processes in a extra goroutine. However, this did not work because it causes a race condition with cmd.Wait() from the os/exec package which is used in many places.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Jun 30, 2021

@Luap99 @mheon @AkihiroSuda @giuseppe @vrothberg We still have this issue, Ideas on a solution?

@Luap99 Luap99 assigned Luap99 and unassigned mheon Jul 2, 2021
@Luap99
Copy link
Member

Luap99 commented Jul 2, 2021

I take this.

Luap99 added a commit to Luap99/libpod that referenced this issue Jul 2, 2021
Add a new service reaper package. Podman currently does not reap all
child processes. The slirp4netns and rootlesskit processes are not
reaped. The is not a problem for local podman since the podman process
dies before the other processes and then init will reap them for us.

However with podman system service it is possible that the podman
process is still alive after slirp died. In this case podman has to reap
it or the slirp process will be a zombie until the service is stopped.

The service reaper will listen in an extra goroutine on SIGCHLD. Once it
receives this signal it will try to reap all pids that were added with
`AddPID()`. While I would like to just reap all children this is not
possible because many parts of the code use `os/exec` with `cmd.Wait()`.
If we reap before `cmd.Wait()` things can break, so reaping everything
is not an option.

[NO TESTS NEEDED]

Fixes containers#9777

Signed-off-by: Paul Holzinger <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants