Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when RestartPolicy is used #14921

Closed
tyler92 opened this issue Jul 13, 2022 · 4 comments
Closed

Deadlock when RestartPolicy is used #14921

tyler92 opened this issue Jul 13, 2022 · 4 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue

Comments

@tyler92
Copy link
Contributor

tyler92 commented Jul 13, 2022

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

Sometimes podman is unable to do anything, even podman info. It is caused if there is any container with RestartPolicy=Always, that is restarting very often. It looks like deadlock and I've investigated it.

Steps to reproduce the issue:

  1. Create the following kube yaml:
apiVersion: v1
kind: Pod
metadata:
  labels:
    app: my-pod
  name: my-pod
spec:
  containers:
  - name: app
    image: debian
    imagePullPolicy: Never
    command:
    - /bin/sleep
    args:
    - 0.001
  hostNetwork: true
  restartPolicy: Always
  1. Run the following script:
#!/bin/bash

set -o errexit

for x in {1..10000};
    do echo "* $x *"
    podman play kube ./my-pod.yaml
    podman trace pod rm -f -a
    podman trace rm -a
done
  1. Observe script output until deadlock.

Describe the results you received:
The script hangs and does not exit.

Describe the results you expected:
The script will exit with success exit code.

Additional information you deem important (e.g. issue happens only occasionally):
This issue happens occasionally and probability depends on restarts rate. Also, I added log messages to SHMLock::Lock and SHMLock::Unlock functions (before and after operations) and what I got:

Process 'podman pod rm -f -a':

Lock 0                                       
Locked 0                                     
Lock 1                                       
Locked 1                                     
Lock 2

Process 'podman ... container cleanup ...':

Lock 2
Locked 2
Lock 1
Locked 1
Unlock 1
Unlocked 1
Lock 1

So, first process locked Lock 1 and trying to lock Lock 2, but second process locked Lock 2 and trying to lock Lock 1. Two mutexes are locking in inverse order.

Output of podman version:

Client:       Podman Engine
Version:      4.1.1
API Version:  4.1.1
Go Version:   go1.18rc1
Built:        Thu Jan  1 03:00:00 1970
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.26.1
  cgroupControllers: []
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: conmon_100:2.1.0-2_amd64
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.1.0, commit: '
  cpuUtilization:
    idlePercent: 91.36
    systemPercent: 2.69
    userPercent: 5.95
  cpus: 8
  distribution:
    codename: focal
    distribution: ubuntu
    version: "20.04"
  eventLogger: journald
  hostname: misha-pc
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.4.0-121-generic
  linkmode: dynamic
  logDriver: journald
  memFree: 1091043328
  memTotal: 16425828352
  networkBackend: cni
  ociRuntime:
    name: crun
    package: crun_100:1.2-2_amd64
    path: /usr/bin/crun
    version: |-
      crun version UNKNOWN
      commit: ea1fe3938eefa14eb707f1d22adff4db670645d6
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns_100:1.1.8-4_amd64
    version: |-
      slirp4netns version 1.1.8
      commit: unknown
      libslirp: 4.3.1-git
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.4.3
  swapFree: 1420554240
  swapTotal: 2147479552
  uptime: 173h 26m 33.94s (Approximately 7.21 days)
plugins:
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - docker.io
  - quay.io
store:
  configFile: /home/misha/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/misha/.local/share/containers/storage
  graphRootAllocated: 205349208064
  graphRootUsed: 185804062720
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 7
  runRoot: /run/user/1000/containers
  volumePath: /home/misha/.local/share/containers/storage/volumes
version:
  APIVersion: 4.1.1
  Built: 0
  BuiltTime: Thu Jan  1 03:00:00 1970
  GitCommit: ""
  GoVersion: go1.18rc1
  Os: linux
  OsArch: linux/amd64
  Version: 4.1.1

Package info (e.g. output of rpm -q podman or apt list podman):

manual build

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 13, 2022
@tyler92
Copy link
Contributor Author

tyler92 commented Jul 13, 2022

I tried to add lock for Pod before locking container and this bug is not reproduces. But it's dirty hack to check my theory.

@mheon
Copy link
Member

mheon commented Jul 13, 2022

I'll look at this today.

tyler92 added a commit to tyler92/podman that referenced this issue Jul 19, 2022
There was a deadlock between two concurrent processes: play kube and cleanup,
that is called after container exit when RestartPolicy is used. Before the fix,
the cleanup command didn't lock Pod's lock, so there was a possibility of
obtaining two locks in different order in two processes.

[NO NEW TESTS NEEDED]

Closes containers#14921

Signed-off-by: Mikhail Khachayants <[email protected]>
tyler92 added a commit to tyler92/podman that referenced this issue Jul 19, 2022
There was a deadlock between two concurrent processes: play kube and cleanup,
that is called after container exit when RestartPolicy is used. Before the fix,
the cleanup command didn't lock Pod's lock, so there was a possibility of
obtaining two locks in different order in two processes.

[NO NEW TESTS NEEDED]

Closes containers#14921

Signed-off-by: Mikhail Khachayants <[email protected]>
@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@tyler92
Copy link
Contributor Author

tyler92 commented Aug 15, 2022

This issue is not reproduced now with the 'main' branch. But I still have other deadlocks, I'll report separate issues. Thanks.

@rhatdan rhatdan closed this as completed Aug 15, 2022
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 19, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue
Projects
None yet
3 participants