Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Podman lock contention when attempting to restart multiple containers #11940

Closed
gcs278 opened this issue Oct 12, 2021 · 28 comments
Closed

Podman lock contention when attempting to restart multiple containers #11940

gcs278 opened this issue Oct 12, 2021 · 28 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. needs-design-doc

Comments

@gcs278
Copy link

gcs278 commented Oct 12, 2021

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

With a restart policy as always or on-failed, podman seems to really struggle and potentially deadlock when it is restarting multiple containers that are constantly exiting. I first noticed this problem with using podman play kube where a couple containers wereconstantly dying and the restart policy was always. I then added an script with just exit 1 as the entrypoint and watched podman commands being to hang longer.

I started 8 instances of exit 1 and --restart=always containers via podman run and podman commands took around 60 seconds to return. After about a minute, podman seemed to deadlock. Podman commands weren't returning and I couldn't stop any of the dying containers. I rm -f /dev/shm/libpod_lock and did a pkill podman to release the deadlock.

This is a big problem for us, as we can't trust podman to restart containers without deadlocking. This seems related to #11589, but I thought it would be better to separately track since it's a different situation.

Steps to reproduce the issue:

podman run -d --restart=always --entrypoint="" image_name bash -c "exit 1"
podman run -d --restart=always --entrypoint="" image_name bash -c "exit 1"
podman run -d --restart=always --entrypoint="" image_name bash -c "exit 1"
podman run -d --restart=always --entrypoint="" image_name bash -c "exit 1"
podman run -d --restart=always --entrypoint="" image_name bash -c "exit 1"
podman run -d --restart=always --entrypoint="" image_name bash -c "exit 1"
podman run -d --restart=always --entrypoint="" image_name bash -c "exit 1"
podman run -d --restart=always --entrypoint="" image_name bash -c "exit 1"

  1. Check podman commands like podman ps. See if podman deadlocks

Describe the results you received:
Podman gets extremely sluggish and then deadlocks

Describe the results you expected:
Podman wouldn't deadlock

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

podman version 3.2.3

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.21.3
  cgroupControllers:
  - cpuset
  - cpu
  - cpuacct
  - blkio
  - memory
  - devices
  - freezer
  - net_cls
  - perf_event
  - net_prio
  - hugetlb
  - pids
  - rdma
  cgroupManager: systemd
  cgroupVersion: v1
  conmon:
    package: conmon-2.0.29-1.module+el8.4.0+11822+6cc1e7d7.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.29, commit: ae467a0c8001179d4d0adf4ada381108a893d7ec'
  cpus: 10
  distribution:
    distribution: '"rhel"'
    version: "8.2"
  eventLogger: file
  hostname: rhel82
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 4.18.0-193.el8.x86_64
  linkmode: dynamic
  memFree: 136581120
  memTotal: 3884625920
  ociRuntime:
    name: runc
    package: runc-1.0.0-74.rc95.module+el8.4.0+11822+6cc1e7d7.x86_64
    path: /usr/bin/runc
    version: |-
      runc version spec: 1.0.2-dev
      go: go1.15.13
      libseccomp: 2.4.1
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 3990089728
  swapTotal: 4190105600
  uptime: 2159h 52m 24.42s (Approximately 89.96 days)
registries:
  registry:5000:
    Blocked: false
    Insecure: true
    Location: registry:5000
    MirrorByDigestOnly: false
    Mirrors: []
    Prefix: registry:5000
  search: ""
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 16
    paused: 0
    running: 0
    stopped: 16
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageStore:
    number: 77
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 3.2.3
  Built: 1627570963
  BuiltTime: Thu Jul 29 11:02:43 2021
  GitCommit: ""
  GoVersion: go1.15.7
  OsArch: linux/amd64
  Version: 3.2.3

Package info (e.g. output of rpm -q podman or apt list podman):

podman-3.2.3-0.11.module+el8.4.0+12050+ef972f71.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 12, 2021
@mheon
Copy link
Member

mheon commented Oct 12, 2021

This doesn't seem like a deadlock - it more seems like Podman is constantly attempting to restart containers, resulting in at least one container having its lock taken at all times, making ps take a long time to finish as it waits to acquire locks. After 5 minutes, I haven't been able to replicate a deadlock, though podman ps is taking upwards of a minute to successfully execute. It is absolutely blowing up the load average as well - loading 8 cores to ~80% I think this is a rather inherent limitation of our daemonless architecture as each command needs to launch a Podman cleanup process to handle restart, which is resulting in a massive process storm - it's why we strongly recommend using systemd-managed containers instead.

Is this a particularly slow system you're testing on? It could explain why things appear to deadlock. I'm fairly convinced there's no actual deadlock here, just a severely taxed system.

@gcs278
Copy link
Author

gcs278 commented Oct 12, 2021

Thanks for looking into this @mheon. Yea it's a dedicated server with 48 cores, the deadlocking is somewhat inconsistent for me. I tried it again and couldn't get it to deadlock, but other times, it deadlocks after the first couple restart cycle on 8 containers. I would let podman commands for 5-10 minutes before removing the lock file and killing processes.

I'm using podman play so I don't think there is an option for using systemd with podman play.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@vrothberg
Copy link
Member

vrothberg commented Mar 22, 2022

FWIW, I think that podman ps is way too expensive. The lock of a single container is acquired and released ~ a dozen times just to query certain data (e.g., state, mappings, root FS, etc.). I think we need to optimize querying that data and put it into a single locked function (rather than N locked ones).

@vrothberg vrothberg self-assigned this Mar 22, 2022
@vrothberg
Copy link
Member

I'll take a stab at it.

@vrothberg
Copy link
Member

FWIW, I think that podman ps is way too expensive. The lock of a single container is acquired and released ~ a dozen times just to query certain data (e.g., state, mappings, root FS, etc.). I think we need to optimize querying that data and put it into a single locked function (rather than N locked ones).

Scratch that ... these operations are batched.

@vrothberg vrothberg removed their assignment Mar 22, 2022
@mheon
Copy link
Member

mheon commented Mar 25, 2022

I was seeing this earlier this week in a slightly different context (podman ps and podman rm -af), so I took a further look. Current observations support it being contention of the container locks, which is exacerbated by the amount of parallel processes we run . I believe our algorithm is CPU Cores * 3 +1, which means that on my system, I have 25 threads going for both podman ps and podman rm, each contending for CPU time, and each aggressively trying to take locks for containers they are operating on. In short, we aren't waiting on a single lock for a minute, we're waiting on a hundred locks for a second or two each. I don't really know if we can improve this easily.

One thought I have is to print results as they come, instead of all at once when the command is done. This isn't perfect, but it would be a lot more clear to the user what is happening (at least, it will be obvious that the commands are not deadlocked)

@mheon
Copy link
Member

mheon commented Mar 25, 2022

Other possible thought: randomize the order in which we act on containers. podman ps and podman rm were operating on the same set of containers in the same order, with one being a lot slower than the other, so ps was run second but caught up quickly and ended up waiting on locks until rm finished. Random ordering much improves our odds of getting containers that aren't in contention.

@mheon
Copy link
Member

mheon commented Mar 25, 2022

I added a bit of randomization to the ordering, but it wasn't enough - no appreciable increase in performance, there are still too many collisions (25 parallel jobs over 200 test containers meant ps and stop, for example, are each working on 1/8 of total containers at any given time - high odds for collisions, which cause lock contention, which cause ps to slow down....)

@vrothberg
Copy link
Member

@mheon, that is a great trail your on.

Maybe we should think in terms of a work pool rather in terms of workers per caller. Could we have a global shared semaphore to limit the number of parallel batch workers? That would limit lock contention etc. AFAIK the locks are already fair.

@mheon
Copy link
Member

mheon commented Mar 25, 2022

We do have a semaphore right now, but it's per-process, not global. Making it global is potentially interesting, if we can get a MP-safe shared-memory semaphore.

@mheon
Copy link
Member

mheon commented Mar 25, 2022

Shared semaphore looks viable. My only concern is making sure that crashes and SIGKILL don't affect us - if, say, podman stop is running and using all available jobs, and then gets a SIGKILL, we want the semaphore to be released back to its maximum value.

@rhatdan
Copy link
Member

rhatdan commented May 17, 2022

@mheon Any movement on this?

@mheon
Copy link
Member

mheon commented May 18, 2022

Negative. Might be worth discussing at the cabal if we have time? I don't have a solid feel on how to fix this.

@tyler92
Copy link
Contributor

tyler92 commented Jul 12, 2022

I have investigated this issue (it reproduces in my case too). Simple program based on shm_lock code shows the following picture:

LockID = 1 (Pod)              owner PID = 462221
LockID = 2 (infra container)  owner PID = 462221
LockID = 3,(app container)    owner PID = 462207

462207 is process, that is started when restart is occurred - podman container cleanup
462221 is any other process, in my case it's podman pod rm -f -a

And these processes are deadlocked because they are waiting each other (lock order problem).
The simplest way to reproduce is run the following script:

#!/bin/bash

set -o errexit

for x in {1..10000};
    do echo "* $x *"
    podman play kube ./my-pod.yaml
    podman trace pod rm -f -a
    podman trace rm -a
done

where my-pod.yaml looks like:

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: my-pod
  name: my-pod
spec:
  containers:
  - name: app
    image: debian
    imagePullPolicy: Never
    command:
    - /bin/sleep
    args:
    - 0.001
  hostNetwork: true
  restartPolicy: Always

@tyler92
Copy link
Contributor

tyler92 commented Jul 12, 2022

So it looks like we should lock of a container's Pod before lock a container. Is it a good idea?

@mheon
Copy link
Member

mheon commented Jul 12, 2022 via email

@tyler92
Copy link
Contributor

tyler92 commented Jul 13, 2022

No problem: #14921

@umohnani8 umohnani8 changed the title Podman deadlocks when attempting to restart multiple containers Podman lock contention when attempting to restart multiple containers Jul 21, 2022
@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Aug 23, 2022

@mheon Any progress on this?

@mheon
Copy link
Member

mheon commented Aug 23, 2022

Negative. I don't think we have a good solution yet.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@vrothberg
Copy link
Member

I'm using podman play so I don't think there is an option for using systemd with podman play.

@gcs278, running kube pay under systemd is working now. The podman-kube@ systemd template works but I find Quadlet to be better suited:

FWIW, I had another look at the issue. I couldn't see any deadlocks and ps performs much better than back in October '21. Podman's deamonless architecture makes it subject to lock contention which is hitting pretty hard with --restart=always and a failing containers.

@vrothberg
Copy link
Member

@rhatdan @mheon I feel like we can close this issue at this point. One thing to consider is to change kube play to stop defaulting to --restart=always in containers. I know it's K8s compat but I find it less appealing for the Podman use cases.

@vrothberg
Copy link
Member

Cc: @Luap99 @giuseppe

@rhatdan
Copy link
Member

rhatdan commented Jun 20, 2023

Its funny that we just has a discussion with BU Student where restart always might come in handy. Imagine you have two or more containers in a pod or multiple pods that require services from each other. In Compose you can set what containers need to come up first before a second container starts.

In podman we sequentially start the containers, and if Container A requires Container B, then when container A fails we failed, without starting Container B. If they all started simultaneously then Container A could fail, container B would succeed, and when Container A restarted Container B would be running, and we would get to a good state. I think current design is Contaner A keeps restarting, and Container B does not ever get a chance. I think if we fix this simultanious start, then restart always will make some sense.

@vrothberg
Copy link
Member

I'll take a stab and close the issue. As mentioned in #11940 (comment), things have improved considerably since the initial report in Oct '21. Feel free to drop a comment or reopen if you think otherwise.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 20, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. needs-design-doc
Projects
None yet
Development

No branches or pull requests

6 participants