Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman.service: KillMode=process leaks pause process, breaks subsequent API invocations #7021

Closed
martinpitt opened this issue Jul 20, 2020 · 2 comments · Fixed by #7023
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@martinpitt
Copy link
Contributor

martinpitt commented Jul 20, 2020

/kind bug
Description

podman.service, in particular the user one, uses KillMode=process. When the unit stops, this leaves behind a process

admin      20185  1.3  2.4 1281680 50484 ?       Ssl  05:31   0:00 /usr/bin/podman system service

Not only is this a resource leak, it actively breaks the next time the API gets invoked.

Steps to reproduce the issue:

  1. Make sure user's podman service is enabled:

    systemctl  --user enable --now podman.socket
    
  2. Make an API request to start podman.service:

    curl --unix $XDG_RUNTIME_DIR/podman/podman.sock http://localhost/
    
  3. Log out the user

  4. Ensure in loginctl (as root) that the user session is gone. But pgrep -au username still shows a leftover process:

# pgrep -au admin
17536 podman
  1. Log back in as the user

  2. Do another API request, same curl command

Describe the results you received:

In (6), the API request hangs. In journalctl you see an repeated restart/failure:

systemd[19935]: Starting Podman API Service...
systemd[19935]: podman.service: Succeeded.
systemd[19935]: podman.service: Unit process 19981 (podman pause) remains running after unit stopped.
systemd[19935]: Finished Podman API Service.
systemd[19935]: podman.service: Found left-over process 19981 (podman pause) in control group while starting unit. Ignoring.
systemd[19935]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

Describe the results you expected:

podman.service stops cleanly, and restarts work.

Additional information you deem important (e.g. issue happens only occasionally):

I noticed this when enabling the user-container tests in cockpit-podman in Fedora dist-git gating.
pull request

As a workaround, I drop KillMode from the unit:

# avoid leftover user podman processes between login sessions
mkdir -p /etc/systemd/user/podman.service.d
printf '[Service]\nKillMode=\n' > /etc/systemd/user/podman.service.d/cleanup.conf

Output of podman version:

Version:      2.1.0-dev
API Version:  1
Go Version:   go1.14.4
Built:        Thu Jan  1 00:00:00 1970
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.16.0-dev
  cgroupVersion: v2
  conmon:
    package: conmon-2.0.19-0.5.dev.giteff699e.fc33.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.19-dev, commit: 3c47d3797172bffa8ab02661ac4805b593cfb4ba'
  cpus: 3
  distribution:
    distribution: fedora
    version: "33"
  eventLogger: file
  hostname: localhost
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1001
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1001
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
  kernel: 5.8.0-0.rc5.1.fc33.x86_64
  linkmode: dynamic
  memFree: 261386240
  memTotal: 2075725824
  ociRuntime:
    name: crun
    package: crun-0.14-1.fc33.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 0.14
      commit: ebc56fc9bcce4b3208bb0079636c80545122bf58
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/1001/podman/podman.sock
  rootless: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.1.4-2.dev.git4c6befe.fc33.x86_64
    version: |-
      slirp4netns version 1.1.4+dev
      commit: 4c6befe05c3137232cf06a5c2879daf4c20be6b1
      libslirp: 4.3.1
      SLIRP_CONFIG_VERSION_MAX: 3
  swapFree: 0
  swapTotal: 0
  uptime: 32m 57.27s
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - registry.centos.org
  - docker.io
store:
  configFile: /home/admin/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs-1.1.0-6.dev.git50ab2c2.fc33.x86_64
      Version: |-
        fusermount3 version: 3.9.2
        fuse-overlayfs: version 1.1.0
        FUSE library version 3.9.2
        using FUSE kernel interface version 7.31
  graphRoot: /home/admin/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 3
  runRoot: /run/user/1001/containers
  volumePath: /home/admin/.local/share/containers/storage/volumes
version:
  APIVersion: 1
  Built: 0
  BuiltTime: Thu Jan  1 00:00:00 1970
  GitCommit: ""
  GoVersion: go1.14.4
  OsArch: linux/amd64
  Version: 2.1.0-dev

Package info (e.g. output of rpm -q podman or apt list podman):

podman-2.1.0-0.77.dev.git60127cf.fc33.x86_64

Additional environment details (AWS, VirtualBox, physical, etc.):

Local Fedora Rawhide VM (from dist-git gating)

@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 20, 2020
@vrothberg vrothberg self-assigned this Jul 20, 2020
@vrothberg
Copy link
Member

Thanks for opening the issue, @martinpitt! I'll take a look.

@vrothberg
Copy link
Member

Opened #7023 along with some other clean ups.

vrothberg added a commit to vrothberg/libpod that referenced this issue Jul 20, 2020
Do not set the killmode to process as it only kills the main process and
leaves other processes untouched.  Just remove the line and use the
default cgroup killmode which will kill all processes in the service's
cgroup.

Fixes: containers#7021
Signed-off-by: Valentin Rothberg <[email protected]>
vrothberg added a commit to vrothberg/libpod that referenced this issue Jul 21, 2020
Do not set the killmode to process as it only kills the main process and
leaves other processes untouched.  Just remove the line and use the
default cgroup killmode which will kill all processes in the service's
cgroup.

Fixes: containers#7021
Signed-off-by: Valentin Rothberg <[email protected]>
jhvst added a commit to jhvst/stateless-fcos that referenced this issue Apr 27, 2021
more about flaky cache:
- containers/podman#7021
- containers/podman#7294

using mv does not work across boots, running rm -rf seems too adventorous
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants