libpod: do not lock all containers on pod rm #14976

giuseppe · 2022-07-19T15:46:27Z

do not attempt to lock all containers on pod rm since it can cause
deadlocks when other podman cleanup processes are attempting to lock
the same containers in a different order.

[NO NEW TESTS NEEDED]

Closes: #14929

Signed-off-by: Giuseppe Scrivano [email protected]

Does this PR introduce a user-facing change?

solved a race condition in `podman rm` and `podman pod rm` that could cause adeadlock of the Podman process

openshift-ci · 2022-07-19T15:46:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [giuseppe]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vrothberg

Just nits, LGTM ... reproducer does not kick in anymore

libpod/runtime_pod_linux.go

vrothberg · 2022-07-19T15:58:08Z

It should also fix #14921

tyler92 · 2022-07-20T06:48:16Z

I launched long-time test with this patch (test based on #14921 description) and I got the following situation:

cleanup process tries to lock Pod
rm -a process tries to stop container in infinite loop (and Pod lock is locked by this process)

# podman ps -a
CONTAINER ID  IMAGE                               COMMAND     CREATED       STATUS           PORTS       NAMES
c8dbac5f8977  localhost/podman-pause:4.2.0-dev-0              12 hours ago  Up 12 hours ago              1e3e24676088-infra
bca34d1e0625  docker.io/library/debian:latest     0.001       12 hours ago  stopping                     test-container

It's definitely not the same situation, but also looks like a deadlock.

giuseppe · 2022-07-20T08:26:38Z

cleanup process tries to lock Pod

rm -a process tries to stop container in infinite loop (and Pod lock is locked by this process)

I think we still need the patch to first lock the pod on container cleanup

EDIT: could you share the reproducer you are using?

tyler92 · 2022-07-20T08:45:53Z

I think we still need the patch to first lock the pod on container cleanup

See #14969

could you share the reproducer you are using?

Create the following kube yaml (pull image in advance):

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: my-pod
  name: my-pod
spec:
  containers:
  - name: app
    image: debian
    imagePullPolicy: Never
    command:
    - /bin/sleep
    args:
    - 0.001
  hostNetwork: true
  restartPolicy: Always

Run the following script:

#!/bin/bash

set -o errexit

for x in {1..10000};
    do echo "* $x *"
    podman play kube ./my-pod.yaml
    podman pod rm -f -a
    podman rm -a
done

Observe script output until deadlock.

Without your patch issue is reproduced after several minutes.
With your path a little bit more time is required (I was able to reproduce issue twice).

tyler92 · 2022-07-20T08:49:57Z

rm -a process tries to stop container in infinite loop

Stack with this infinite loop:

libpod.(*Container).WaitForExit (container_api.go:565) github.com/containers/podman/v4/libpod
libpod.(*Container).Wait (container_api.go:495) github.com/containers/podman/v4/libpod
libpod.(*ConmonOCIRuntime).UpdateContainerStatus (oci_conmon_linux.go:343) github.com/containers/podman/v4/libpod
libpod.(*ConmonOCIRuntime).KillContainer (oci_conmon_linux.go:399) github.com/containers/podman/v4/libpod
libpod.(*ConmonOCIRuntime).StopContainer (oci_conmon_linux.go:433) github.com/containers/podman/v4/libpod
libpod.(*Container).stop (container_internal.go:1291) github.com/containers/podman/v4/libpod
libpod.(*Runtime).removeContainer (runtime_ctr.go:708) github.com/containers/podman/v4/libpod
libpod.(*Runtime).removePod.func1 (runtime_pod_linux.go:236) github.com/containers/podman/v4/libpod
libpod.(*Runtime).removePod (runtime_pod_linux.go:237) github.com/containers/podman/v4/libpod
libpod.(*Runtime).RemovePod (runtime_pod.go:46) github.com/containers/podman/v4/libpod
abi.(*ContainerEngine).PodRm (pods.go:274) github.com/containers/podman/v4/pkg/domain/infra/abi
pods.removePods (rm.go:94) github.com/containers/podman/v4/cmd/podman/pods
pods.rm (rm.go:86) github.com/containers/podman/v4/cmd/podman/pods
cobra.(*Command).execute (command.go:872) github.com/spf13/cobra
cobra.(*Command).ExecuteC (command.go:990) github.com/spf13/cobra
cobra.(*Command).Execute (command.go:918) github.com/spf13/cobra
cobra.(*Command).ExecuteContext (command.go:911) github.com/spf13/cobra
main.Execute (root.go:99) main
main.main (main.go:40) main
runtime.main (proc.go:250) runtime
runtime.goexit (asm_amd64.s:1571) runtime

giuseppe · 2022-07-20T08:52:33Z

thanks! Could you also share the stack trace for the other process?

tyler92 · 2022-07-20T09:09:00Z

Could you also share the stack trace for the other process?

To be honest I don't how can I do this. If I attach to process via GDB or GoLand - there is no Go functions in stask - just C code (thread mutex locking)

Thread 10 (LWP 2040534 "podman"):
#0  0x00007f8753f39c9b in ?? ()
#1  0x00000000004387b9 in runtime.newAllocBits (nelems=18446744073709551104, ~r0=<optimized out>) at /home/misha/go-18/go1.18rc1/src/runtime/mheap.go:2057
#2  0x7bbc9aee71ee8e00 in ?? ()
#3  0x000000c00052f040 in ?? ()
#4  0x000000c0003a7d90 in ?? ()
#5  0x00007f872c0d3040 in ?? ()
#6  0x000000c0003a8000 in ?? ()
#7  0x0000000000000030 in ?? ()
#8  0x000000c00052f040 in ?? ()
#9  0x00007f872cedde07 in ?? ()
#10 0x0000000001415c10 in take_mutex (mutex=0x7f872c0d3040) at shm_lock.c:27
#11 0x00000000014163d2 in lock_semaphore (shm=<optimized out>, sem_index=<optimized out>) at shm_lock.c:516
#12 0x0000000001415b1e in _cgo_884153080b96_Cfunc_lock_semaphore (v=0xc0003a7d90) at cgo-gcc-prolog:165
#13 0x0000000000479544 in runtime.asmcgocall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:821
#14 0x0000000000000001 in ?? ()
#15 0x000000c000bd1d00 in ?? ()
#16 0x000000c0003a7758 in ?? ()
#17 0x000000000047ba66 in time.now () at /home/misha/go-18/go1.18rc1/src/runtime/time_linux_amd64.s:52
#18 0x00000000004776c9 in runtime.systemstack () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:469
#19 0x00007f8724ff896f in ?? ()
#20 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=<error reading variable: access outside bounds of object referenced via synthetic pointer>, wtyp=0 '\000', list=..., fd=..., n=<optimized out>, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#21 0x0000000000000000 in ?? ()

giuseppe · 2022-07-20T09:18:10Z

To be honest I don't how can I do this. If I attach to process via GDB or GoLand - there is no Go functions in stask - just C code (thread mutex locking)

what do you get if you run t apply all bt under gdb?

tyler92 · 2022-07-20T09:27:41Z

Output of thread apply all bt:

(gdb) thread apply all bt

Thread 14 (LWP 2040712 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044c58d in runtime.mPark () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1449
#4 runtime.stopm () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2228
#5 0x000000000044da65 in runtime.findrunnable (gp=, inheritTime=) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2804
#6 0x000000000044e999 in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3187
#7 0x000000000044eeed in runtime.park_m (gp=0xc0000c1380) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
#8 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#9 0x00007f8727ffe95f in ?? ()
#10 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#11 0x0000000000000000 in ?? ()

Thread 13 (LWP 2040538 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044c58d in runtime.mPark () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1449
#4 runtime.stopm () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2228
#5 0x000000000044da65 in runtime.findrunnable (gp=, inheritTime=) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2804
#6 0x000000000044e999 in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3187
#7 0x000000000044eeed in runtime.park_m (gp=0xc000582340) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
#8 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#9 0x00007f8726ffca4f in ?? ()
#10 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#11 0x0000000000000000 in ?? ()

Thread 12 (LWP 2040537 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044c58d in runtime.mPark () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1449
#4 runtime.stopm () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2228
#5 0x000000000044da65 in runtime.findrunnable (gp=, inheritTime=) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2804
#6 0x000000000044e999 in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3187
#7 0x000000000044eeed in runtime.park_m (gp=0xc00052eea0) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
#8 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#9 0x00007f87277fd95f in ?? ()
#10 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#11 0x0000000000000000 in ?? ()

Thread 11 (LWP 2040535 "podman"):
#0 runtime.epollwait () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:699
#1 0x0000000000441e5c in runtime.netpoll (delay=, ~r0=...) at /home/misha/go-18/go1.18rc1/src/runtime/netpoll_epoll.go:126
#2 0x000000000044d793 in runtime.findrunnable (gp=, inheritTime=) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2767
#3 0x000000000044e999 in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3187
#4 0x000000000044eeed in runtime.park_m (gp=0xc00052eea0) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
--Type for more, q to quit, c to continue without paging--c
#5 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#6 0x00007f8707ffe96f in ?? ()
#7 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#8 0x0000000000000000 in ?? ()

Thread 10 (LWP 2040534 "podman"):
#0 0x00007f8753f39c9b in ?? ()
#1 0x00000000004387b9 in runtime.newAllocBits (nelems=18446744073709551104, ~r0=) at /home/misha/go-18/go1.18rc1/src/runtime/mheap.go:2057
#2 0x7bbc9aee71ee8e00 in ?? ()
#3 0x000000c00052f040 in ?? ()
#4 0x000000c0003a7d90 in ?? ()
#5 0x00007f872c0d3040 in ?? ()
#6 0x000000c0003a8000 in ?? ()
#7 0x0000000000000030 in ?? ()
#8 0x000000c00052f040 in ?? ()
#9 0x00007f872cedde07 in ?? ()
#10 0x0000000001415c10 in take_mutex (mutex=0x7f872c0d3040) at shm_lock.c:27
#11 0x00000000014163d2 in lock_semaphore (shm=, sem_index=) at shm_lock.c:516
#12 0x0000000001415b1e in _cgo_884153080b96_Cfunc_lock_semaphore (v=0xc0003a7d90) at cgo-gcc-prolog:165
#13 0x0000000000479544 in runtime.asmcgocall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:821
#14 0x0000000000000001 in ?? ()
#15 0x000000c000bd1d00 in ?? ()
#16 0x000000c0003a7758 in ?? ()
#17 0x000000000047ba66 in time.now () at /home/misha/go-18/go1.18rc1/src/runtime/time_linux_amd64.s:52
#18 0x00000000004776c9 in runtime.systemstack () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:469
#19 0x00007f8724ff896f in ?? ()
#20 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#21 0x0000000000000000 in ?? ()

Thread 9 (LWP 2040533 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419805 in runtime.notetsleep_internal (n=0x24c4e60 <runtime.sig>, ns=-1, ~r0=) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:182
#3 0x0000000000419925 in runtime.notetsleepg (n=0x24c4e60 <runtime.sig>, ns=-1, ~r0=) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:236
#4 0x0000000000475d4f in os/signal.signal_recv (~r0=) at /home/misha/go-18/go1.18rc1/src/runtime/sigqueue.go:151
#5 0x00000000007a4319 in os/signal.loop () at /home/misha/go-18/go1.18rc1/src/os/signal/signal_unix.go:23
#6 0x0000000000479881 in runtime.goexit () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:1571
#7 0x0000000000000000 in ?? ()

Thread 8 (LWP 2040532 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044c58d in runtime.mPark () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1449
#4 runtime.stopm () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2228
#5 0x000000000044da65 in runtime.findrunnable (gp=, inheritTime=) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2804
#6 0x000000000044e999 in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3187
#7 0x000000000044eeed in runtime.park_m (gp=0xc0000c1520) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
#8 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#9 0x00007f8725ffa96f in ?? ()
#10 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#11 0x0000000000000000 in ?? ()

Thread 7 (LWP 2040531 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044c58d in runtime.mPark () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1449
#4 runtime.stopm () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2228
#5 0x000000000044da65 in runtime.findrunnable (gp=, inheritTime=) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2804
#6 0x000000000044e999 in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3187
#7 0x000000000044eeed in runtime.park_m (gp=0xc000502340) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
#8 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#9 0x00007f87277fd95f in ?? ()
#10 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#11 0x0000000000000000 in ?? ()

Thread 6 (LWP 2040530 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044c58d in runtime.mPark () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1449
#4 runtime.stopm () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2228
#5 0x000000000044da65 in runtime.findrunnable (gp=, inheritTime=) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2804
#6 0x000000000044e999 in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3187
#7 0x000000000044eeed in runtime.park_m (gp=0xc0000c16c0) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
#8 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#9 0x00007f87277fd95f in ?? ()
#10 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#11 0x0000000000000000 in ?? ()

Thread 5 (LWP 2040529 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044c471 in runtime.templateThread () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2206
#4 0x000000000044b073 in runtime.mstart1 () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1418
#5 0x000000000044afb9 in runtime.mstart0 () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1376
#6 0x00000000004775c5 in runtime.mstart () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:367
#7 0x0000000001410472 in crosscall_amd64 () at gcc_amd64.S:40
#8 0x00007f8726ffcec0 in ?? ()
#9 0x00007ffd267b8540 in ?? ()
#10 0x00007ffd267b844f in ?? ()
#11 0x00007ffd267b844e in ?? ()
#12 0x000000c000003860 in ?? ()
#13 0x00000000004775c0 in ?? ()
#14 0x000000000140fe34 in threadentry (v=) at gcc_linux_amd64.c:92
#15 0x00007f8753f37609 in ?? ()
#16 0x0000000000000000 in ?? ()

Thread 4 (LWP 2040528 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044c58d in runtime.mPark () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1449
#4 runtime.stopm () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2228
#5 0x000000000044da65 in runtime.findrunnable (gp=, inheritTime=) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2804
#6 0x000000000044e999 in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3187
#7 0x000000000044eeed in runtime.park_m (gp=0xc0000031e0) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
#8 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#9 0x00007ffd267b829f in ?? ()
#10 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#11 0x0000000000000000 in ?? ()

Thread 3 (LWP 2040527 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044c58d in runtime.mPark () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1449
#4 runtime.stopm () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2228
#5 0x000000000044da65 in runtime.findrunnable (gp=, inheritTime=) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2804
#6 0x000000000044e999 in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3187
#7 0x000000000044eeed in runtime.park_m (gp=0xc000582000) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
#8 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#9 0x00007ffd267b834f in ?? ()
#10 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#11 0x0000000000000000 in ?? ()

Thread 2 (LWP 2040526 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x00000000004421af in runtime.futexsleep (addr=, val=, ns=) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:72
#2 0x0000000000419745 in runtime.notetsleep_internal (n=0x2495918 <runtime.sched+248>, ns=90050320, ~r0=) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:201
#3 0x0000000000419894 in runtime.notetsleep (n=0xfffffffffffffdfc, ns=0, ~r0=) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:224
#4 0x00000000004534a9 in runtime.sysmon () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:5102
#5 0x000000000044b073 in runtime.mstart1 () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1418
#6 0x000000000044afb9 in runtime.mstart0 () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1376
#7 0x00000000004775c5 in runtime.mstart () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:367
#8 0x0000000001410472 in crosscall_amd64 () at gcc_amd64.S:40
#9 0x00007f872cd67ec0 in ?? ()
#10 0x00007ffd267b84a0 in ?? ()
#11 0x00007ffd267b83af in ?? ()
#12 0x00007ffd267b83ae in ?? ()
#13 0x000000c0000029c0 in ?? ()
#14 0x00000000004775c0 in ?? ()
#15 0x000000000140fe34 in threadentry (v=) at gcc_linux_amd64.c:92
#16 0x00007f8753f37609 in ?? ()
#17 0x0000000000000000 in ?? ()

Thread 1 (LWP 2040524 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044b165 in runtime.mPark () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1449
#4 0x000000000044ccc5 in runtime.stoplockedm () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2422
#5 0x000000000044e79d in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3119
#6 0x000000000044eeed in runtime.park_m (gp=0xc000bda9c0) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
#7 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#8 0x000000000047bf45 in runtime.newproc (fn=0x0) at :1
#9 0x000000000248ece0 in ?? ()
#10 0x0000000000000000 in ?? ()

giuseppe · 2022-07-20T09:34:11Z

I am trying your reproducer locally and I am at iteration * 641 * without any luck reproducing it yet.

Could you try with delve (delve attach $PID)? Do you get any more useful information out of it?

Are there only two podman processes running?

tyler92 · 2022-07-20T09:38:11Z

I am at iteration * 641 * without any luck reproducing it yet.

In my case it's 1700 iteration

Are there only two podman processes running?

Yes

Could you try with delve

A little bit later I will try it

vrothberg · 2022-07-20T10:07:38Z

I ran my reproducer for hours and nothing has happened.

giuseppe · 2022-07-20T10:23:19Z

same here, cannot end up in that state.

Might be something unrelated

giuseppe · 2022-07-20T10:52:33Z

tests are green

vrothberg

LGTM
@mheon PTAL

mheon · 2022-07-20T17:20:26Z

libpod/runtime_pod_linux.go

 		}
 	}
+	if removalErr != nil {
+		return removalErr
+	}

 	// We're going to be removing containers.
 	// If we are Cgroupfs cgroup driver, to avoid races, we need to hit


Should this block be moved up, to ensure we're still preventing cleanup processes?

I think that would require to lock all containers at the same time which would reintroduce the deadlock.

@giuseppe WDYT?

could we just remove this hack?

Since there is no locking now for removing containers, the cleanup processes should not hang for too long so it is fine if they run.

I've pushed a new version without this code block; let's see if it causes any issue in the CI.

do not attempt to lock all containers on pod rm since it can cause deadlocks when other podman cleanup processes are attempting to lock the same containers in a different order. [NO NEW TESTS NEEDED] Closes: containers#14929 Signed-off-by: Giuseppe Scrivano <[email protected]>

giuseppe · 2022-07-21T20:20:41Z

CI is green again

vrothberg

LGTM

mheon · 2022-07-22T17:24:19Z

Let's merge
/lgtm

openshift-ci bot added the release-note label Jul 19, 2022

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 19, 2022

giuseppe mentioned this pull request Jul 19, 2022

deadlock(?) in podman rm(?) #14929

Closed

vrothberg reviewed Jul 19, 2022

View reviewed changes

libpod/runtime_pod_linux.go Show resolved Hide resolved

libpod/runtime_pod_linux.go Outdated Show resolved Hide resolved

tyler92 mentioned this pull request Jul 19, 2022

WIP: fix deadlock between play kube and cleanup #14969

Closed

giuseppe force-pushed the do-not-lock-containers-pod-rm branch from 6925454 to 5b2a562 Compare July 20, 2022 08:34

vrothberg reviewed Jul 20, 2022

View reviewed changes

mheon reviewed Jul 20, 2022

View reviewed changes

giuseppe force-pushed the do-not-lock-containers-pod-rm branch from 5b2a562 to 3dde31b Compare July 21, 2022 07:11

giuseppe force-pushed the do-not-lock-containers-pod-rm branch from 3dde31b to af118f7 Compare July 21, 2022 07:17

vrothberg reviewed Jul 22, 2022

View reviewed changes

openshift-ci bot assigned mheon Jul 22, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2022

openshift-merge-robot merged commit 05618a5 into containers:main Jul 22, 2022

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libpod: do not lock all containers on pod rm #14976

libpod: do not lock all containers on pod rm #14976

giuseppe commented Jul 19, 2022

openshift-ci bot commented Jul 19, 2022

vrothberg left a comment

vrothberg commented Jul 19, 2022

tyler92 commented Jul 20, 2022

giuseppe commented Jul 20, 2022 •

edited

Loading

tyler92 commented Jul 20, 2022 •

edited

Loading

tyler92 commented Jul 20, 2022

giuseppe commented Jul 20, 2022

tyler92 commented Jul 20, 2022

giuseppe commented Jul 20, 2022

tyler92 commented Jul 20, 2022

giuseppe commented Jul 20, 2022

tyler92 commented Jul 20, 2022 •

edited

Loading

vrothberg commented Jul 20, 2022

giuseppe commented Jul 20, 2022

giuseppe commented Jul 20, 2022

vrothberg left a comment

mheon Jul 20, 2022

vrothberg Jul 21, 2022

giuseppe Jul 21, 2022

giuseppe commented Jul 21, 2022

vrothberg left a comment

mheon commented Jul 22, 2022

libpod: do not lock all containers on pod rm #14976

libpod: do not lock all containers on pod rm #14976

Conversation

giuseppe commented Jul 19, 2022

Does this PR introduce a user-facing change?

openshift-ci bot commented Jul 19, 2022

vrothberg left a comment

Choose a reason for hiding this comment

vrothberg commented Jul 19, 2022

tyler92 commented Jul 20, 2022

giuseppe commented Jul 20, 2022 • edited Loading

tyler92 commented Jul 20, 2022 • edited Loading

tyler92 commented Jul 20, 2022

giuseppe commented Jul 20, 2022

tyler92 commented Jul 20, 2022

giuseppe commented Jul 20, 2022

tyler92 commented Jul 20, 2022

giuseppe commented Jul 20, 2022

tyler92 commented Jul 20, 2022 • edited Loading

vrothberg commented Jul 20, 2022

giuseppe commented Jul 20, 2022

giuseppe commented Jul 20, 2022

vrothberg left a comment

Choose a reason for hiding this comment

mheon Jul 20, 2022

Choose a reason for hiding this comment

vrothberg Jul 21, 2022

Choose a reason for hiding this comment

giuseppe Jul 21, 2022

Choose a reason for hiding this comment

giuseppe commented Jul 21, 2022

vrothberg left a comment

Choose a reason for hiding this comment

mheon commented Jul 22, 2022

giuseppe commented Jul 20, 2022 •

edited

Loading

tyler92 commented Jul 20, 2022 •

edited

Loading

tyler92 commented Jul 20, 2022 •

edited

Loading