Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libpod: do not lock all containers on pod rm #14976

Merged

Conversation

giuseppe
Copy link
Member

do not attempt to lock all containers on pod rm since it can cause
deadlocks when other podman cleanup processes are attempting to lock
the same containers in a different order.

[NO NEW TESTS NEEDED]

Closes: #14929

Signed-off-by: Giuseppe Scrivano [email protected]

Does this PR introduce a user-facing change?

solved a race condition in `podman rm` and `podman pod rm` that could cause adeadlock of the Podman process

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 19, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 19, 2022
Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just nits, LGTM ... reproducer does not kick in anymore

libpod/runtime_pod_linux.go Show resolved Hide resolved
libpod/runtime_pod_linux.go Outdated Show resolved Hide resolved
@vrothberg
Copy link
Member

It should also fix #14921

@tyler92
Copy link
Contributor

tyler92 commented Jul 20, 2022

I launched long-time test with this patch (test based on #14921 description) and I got the following situation:

  1. cleanup process tries to lock Pod
  2. rm -a process tries to stop container in infinite loop (and Pod lock is locked by this process)
# podman ps -a
CONTAINER ID  IMAGE                               COMMAND     CREATED       STATUS           PORTS       NAMES
c8dbac5f8977  localhost/podman-pause:4.2.0-dev-0              12 hours ago  Up 12 hours ago              1e3e24676088-infra
bca34d1e0625  docker.io/library/debian:latest     0.001       12 hours ago  stopping                     test-container

It's definitely not the same situation, but also looks like a deadlock.

@giuseppe
Copy link
Member Author

giuseppe commented Jul 20, 2022

  • cleanup process tries to lock Pod
  • rm -a process tries to stop container in infinite loop (and Pod lock is locked by this process)

I think we still need the patch to first lock the pod on container cleanup

EDIT: could you share the reproducer you are using?

@giuseppe giuseppe force-pushed the do-not-lock-containers-pod-rm branch from 6925454 to 5b2a562 Compare July 20, 2022 08:34
@tyler92
Copy link
Contributor

tyler92 commented Jul 20, 2022

I think we still need the patch to first lock the pod on container cleanup

See #14969

could you share the reproducer you are using?

  1. Create the following kube yaml (pull image in advance):
apiVersion: v1
kind: Pod
metadata:
  labels:
    app: my-pod
  name: my-pod
spec:
  containers:
  - name: app
    image: debian
    imagePullPolicy: Never
    command:
    - /bin/sleep
    args:
    - 0.001
  hostNetwork: true
  restartPolicy: Always
  1. Run the following script:
#!/bin/bash

set -o errexit

for x in {1..10000};
    do echo "* $x *"
    podman play kube ./my-pod.yaml
    podman pod rm -f -a
    podman rm -a
done
  1. Observe script output until deadlock.

Without your patch issue is reproduced after several minutes.
With your path a little bit more time is required (I was able to reproduce issue twice).

@tyler92
Copy link
Contributor

tyler92 commented Jul 20, 2022

rm -a process tries to stop container in infinite loop

Stack with this infinite loop:

libpod.(*Container).WaitForExit (container_api.go:565) github.com/containers/podman/v4/libpod
libpod.(*Container).Wait (container_api.go:495) github.com/containers/podman/v4/libpod
libpod.(*ConmonOCIRuntime).UpdateContainerStatus (oci_conmon_linux.go:343) github.com/containers/podman/v4/libpod
libpod.(*ConmonOCIRuntime).KillContainer (oci_conmon_linux.go:399) github.com/containers/podman/v4/libpod
libpod.(*ConmonOCIRuntime).StopContainer (oci_conmon_linux.go:433) github.com/containers/podman/v4/libpod
libpod.(*Container).stop (container_internal.go:1291) github.com/containers/podman/v4/libpod
libpod.(*Runtime).removeContainer (runtime_ctr.go:708) github.com/containers/podman/v4/libpod
libpod.(*Runtime).removePod.func1 (runtime_pod_linux.go:236) github.com/containers/podman/v4/libpod
libpod.(*Runtime).removePod (runtime_pod_linux.go:237) github.com/containers/podman/v4/libpod
libpod.(*Runtime).RemovePod (runtime_pod.go:46) github.com/containers/podman/v4/libpod
abi.(*ContainerEngine).PodRm (pods.go:274) github.com/containers/podman/v4/pkg/domain/infra/abi
pods.removePods (rm.go:94) github.com/containers/podman/v4/cmd/podman/pods
pods.rm (rm.go:86) github.com/containers/podman/v4/cmd/podman/pods
cobra.(*Command).execute (command.go:872) github.com/spf13/cobra
cobra.(*Command).ExecuteC (command.go:990) github.com/spf13/cobra
cobra.(*Command).Execute (command.go:918) github.com/spf13/cobra
cobra.(*Command).ExecuteContext (command.go:911) github.com/spf13/cobra
main.Execute (root.go:99) main
main.main (main.go:40) main
runtime.main (proc.go:250) runtime
runtime.goexit (asm_amd64.s:1571) runtime

@giuseppe
Copy link
Member Author

thanks! Could you also share the stack trace for the other process?

@tyler92
Copy link
Contributor

tyler92 commented Jul 20, 2022

Could you also share the stack trace for the other process?

To be honest I don't how can I do this. If I attach to process via GDB or GoLand - there is no Go functions in stask - just C code (thread mutex locking)

Thread 10 (LWP 2040534 "podman"):
#0  0x00007f8753f39c9b in ?? ()
#1  0x00000000004387b9 in runtime.newAllocBits (nelems=18446744073709551104, ~r0=<optimized out>) at /home/misha/go-18/go1.18rc1/src/runtime/mheap.go:2057
#2  0x7bbc9aee71ee8e00 in ?? ()
#3  0x000000c00052f040 in ?? ()
#4  0x000000c0003a7d90 in ?? ()
#5  0x00007f872c0d3040 in ?? ()
#6  0x000000c0003a8000 in ?? ()
#7  0x0000000000000030 in ?? ()
#8  0x000000c00052f040 in ?? ()
#9  0x00007f872cedde07 in ?? ()
#10 0x0000000001415c10 in take_mutex (mutex=0x7f872c0d3040) at shm_lock.c:27
#11 0x00000000014163d2 in lock_semaphore (shm=<optimized out>, sem_index=<optimized out>) at shm_lock.c:516
#12 0x0000000001415b1e in _cgo_884153080b96_Cfunc_lock_semaphore (v=0xc0003a7d90) at cgo-gcc-prolog:165
#13 0x0000000000479544 in runtime.asmcgocall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:821
#14 0x0000000000000001 in ?? ()
#15 0x000000c000bd1d00 in ?? ()
#16 0x000000c0003a7758 in ?? ()
#17 0x000000000047ba66 in time.now () at /home/misha/go-18/go1.18rc1/src/runtime/time_linux_amd64.s:52
#18 0x00000000004776c9 in runtime.systemstack () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:469
#19 0x00007f8724ff896f in ?? ()
#20 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=<error reading variable: access outside bounds of object referenced via synthetic pointer>, wtyp=0 '\000', list=..., fd=..., n=<optimized out>, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#21 0x0000000000000000 in ?? ()

@giuseppe
Copy link
Member Author

To be honest I don't how can I do this. If I attach to process via GDB or GoLand - there is no Go functions in stask - just C code (thread mutex locking)

what do you get if you run t apply all bt under gdb?

@tyler92
Copy link
Contributor

tyler92 commented Jul 20, 2022

Output of thread apply all bt:

(gdb) thread apply all bt

Thread 14 (LWP 2040712 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044c58d in runtime.mPark () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1449
#4 runtime.stopm () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2228
#5 0x000000000044da65 in runtime.findrunnable (gp=, inheritTime=) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2804
#6 0x000000000044e999 in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3187
#7 0x000000000044eeed in runtime.park_m (gp=0xc0000c1380) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
#8 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#9 0x00007f8727ffe95f in ?? ()
#10 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#11 0x0000000000000000 in ?? ()

Thread 13 (LWP 2040538 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044c58d in runtime.mPark () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1449
#4 runtime.stopm () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2228
#5 0x000000000044da65 in runtime.findrunnable (gp=, inheritTime=) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2804
#6 0x000000000044e999 in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3187
#7 0x000000000044eeed in runtime.park_m (gp=0xc000582340) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
#8 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#9 0x00007f8726ffca4f in ?? ()
#10 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#11 0x0000000000000000 in ?? ()

Thread 12 (LWP 2040537 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044c58d in runtime.mPark () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1449
#4 runtime.stopm () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2228
#5 0x000000000044da65 in runtime.findrunnable (gp=, inheritTime=) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2804
#6 0x000000000044e999 in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3187
#7 0x000000000044eeed in runtime.park_m (gp=0xc00052eea0) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
#8 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#9 0x00007f87277fd95f in ?? ()
#10 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#11 0x0000000000000000 in ?? ()

Thread 11 (LWP 2040535 "podman"):
#0 runtime.epollwait () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:699
#1 0x0000000000441e5c in runtime.netpoll (delay=, ~r0=...) at /home/misha/go-18/go1.18rc1/src/runtime/netpoll_epoll.go:126
#2 0x000000000044d793 in runtime.findrunnable (gp=, inheritTime=) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2767
#3 0x000000000044e999 in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3187
#4 0x000000000044eeed in runtime.park_m (gp=0xc00052eea0) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
--Type for more, q to quit, c to continue without paging--c
#5 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#6 0x00007f8707ffe96f in ?? ()
#7 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#8 0x0000000000000000 in ?? ()

Thread 10 (LWP 2040534 "podman"):
#0 0x00007f8753f39c9b in ?? ()
#1 0x00000000004387b9 in runtime.newAllocBits (nelems=18446744073709551104, ~r0=) at /home/misha/go-18/go1.18rc1/src/runtime/mheap.go:2057
#2 0x7bbc9aee71ee8e00 in ?? ()
#3 0x000000c00052f040 in ?? ()
#4 0x000000c0003a7d90 in ?? ()
#5 0x00007f872c0d3040 in ?? ()
#6 0x000000c0003a8000 in ?? ()
#7 0x0000000000000030 in ?? ()
#8 0x000000c00052f040 in ?? ()
#9 0x00007f872cedde07 in ?? ()
#10 0x0000000001415c10 in take_mutex (mutex=0x7f872c0d3040) at shm_lock.c:27
#11 0x00000000014163d2 in lock_semaphore (shm=, sem_index=) at shm_lock.c:516
#12 0x0000000001415b1e in _cgo_884153080b96_Cfunc_lock_semaphore (v=0xc0003a7d90) at cgo-gcc-prolog:165
#13 0x0000000000479544 in runtime.asmcgocall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:821
#14 0x0000000000000001 in ?? ()
#15 0x000000c000bd1d00 in ?? ()
#16 0x000000c0003a7758 in ?? ()
#17 0x000000000047ba66 in time.now () at /home/misha/go-18/go1.18rc1/src/runtime/time_linux_amd64.s:52
#18 0x00000000004776c9 in runtime.systemstack () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:469
#19 0x00007f8724ff896f in ?? ()
#20 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#21 0x0000000000000000 in ?? ()

Thread 9 (LWP 2040533 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419805 in runtime.notetsleep_internal (n=0x24c4e60 <runtime.sig>, ns=-1, ~r0=) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:182
#3 0x0000000000419925 in runtime.notetsleepg (n=0x24c4e60 <runtime.sig>, ns=-1, ~r0=) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:236
#4 0x0000000000475d4f in os/signal.signal_recv (~r0=) at /home/misha/go-18/go1.18rc1/src/runtime/sigqueue.go:151
#5 0x00000000007a4319 in os/signal.loop () at /home/misha/go-18/go1.18rc1/src/os/signal/signal_unix.go:23
#6 0x0000000000479881 in runtime.goexit () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:1571
#7 0x0000000000000000 in ?? ()

Thread 8 (LWP 2040532 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044c58d in runtime.mPark () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1449
#4 runtime.stopm () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2228
#5 0x000000000044da65 in runtime.findrunnable (gp=, inheritTime=) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2804
#6 0x000000000044e999 in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3187
#7 0x000000000044eeed in runtime.park_m (gp=0xc0000c1520) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
#8 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#9 0x00007f8725ffa96f in ?? ()
#10 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#11 0x0000000000000000 in ?? ()

Thread 7 (LWP 2040531 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044c58d in runtime.mPark () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1449
#4 runtime.stopm () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2228
#5 0x000000000044da65 in runtime.findrunnable (gp=, inheritTime=) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2804
#6 0x000000000044e999 in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3187
#7 0x000000000044eeed in runtime.park_m (gp=0xc000502340) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
#8 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#9 0x00007f87277fd95f in ?? ()
#10 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#11 0x0000000000000000 in ?? ()

Thread 6 (LWP 2040530 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044c58d in runtime.mPark () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1449
#4 runtime.stopm () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2228
#5 0x000000000044da65 in runtime.findrunnable (gp=, inheritTime=) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2804
#6 0x000000000044e999 in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3187
#7 0x000000000044eeed in runtime.park_m (gp=0xc0000c16c0) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
#8 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#9 0x00007f87277fd95f in ?? ()
#10 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#11 0x0000000000000000 in ?? ()

Thread 5 (LWP 2040529 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044c471 in runtime.templateThread () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2206
#4 0x000000000044b073 in runtime.mstart1 () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1418
#5 0x000000000044afb9 in runtime.mstart0 () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1376
#6 0x00000000004775c5 in runtime.mstart () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:367
#7 0x0000000001410472 in crosscall_amd64 () at gcc_amd64.S:40
#8 0x00007f8726ffcec0 in ?? ()
#9 0x00007ffd267b8540 in ?? ()
#10 0x00007ffd267b844f in ?? ()
#11 0x00007ffd267b844e in ?? ()
#12 0x000000c000003860 in ?? ()
#13 0x00000000004775c0 in ?? ()
#14 0x000000000140fe34 in threadentry (v=) at gcc_linux_amd64.c:92
#15 0x00007f8753f37609 in ?? ()
#16 0x0000000000000000 in ?? ()

Thread 4 (LWP 2040528 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044c58d in runtime.mPark () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1449
#4 runtime.stopm () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2228
#5 0x000000000044da65 in runtime.findrunnable (gp=, inheritTime=) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2804
#6 0x000000000044e999 in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3187
#7 0x000000000044eeed in runtime.park_m (gp=0xc0000031e0) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
#8 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#9 0x00007ffd267b829f in ?? ()
#10 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#11 0x0000000000000000 in ?? ()

Thread 3 (LWP 2040527 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044c58d in runtime.mPark () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1449
#4 runtime.stopm () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2228
#5 0x000000000044da65 in runtime.findrunnable (gp=, inheritTime=) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2804
#6 0x000000000044e999 in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3187
#7 0x000000000044eeed in runtime.park_m (gp=0xc000582000) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
#8 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#9 0x00007ffd267b834f in ?? ()
#10 0x0000000000800000 in google.golang.org/protobuf/proto.UnmarshalOptions.unmarshalList (o=..., b=, wtyp=0 '\000', list=..., fd=..., n=, err=...) at /home/misha/work/relax/podman/vendor/google.golang.org/protobuf/proto/decode_gen.go:298
#11 0x0000000000000000 in ?? ()

Thread 2 (LWP 2040526 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x00000000004421af in runtime.futexsleep (addr=, val=, ns=) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:72
#2 0x0000000000419745 in runtime.notetsleep_internal (n=0x2495918 <runtime.sched+248>, ns=90050320, ~r0=) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:201
#3 0x0000000000419894 in runtime.notetsleep (n=0xfffffffffffffdfc, ns=0, ~r0=) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:224
#4 0x00000000004534a9 in runtime.sysmon () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:5102
#5 0x000000000044b073 in runtime.mstart1 () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1418
#6 0x000000000044afb9 in runtime.mstart0 () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1376
#7 0x00000000004775c5 in runtime.mstart () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:367
#8 0x0000000001410472 in crosscall_amd64 () at gcc_amd64.S:40
#9 0x00007f872cd67ec0 in ?? ()
#10 0x00007ffd267b84a0 in ?? ()
#11 0x00007ffd267b83af in ?? ()
#12 0x00007ffd267b83ae in ?? ()
#13 0x000000c0000029c0 in ?? ()
#14 0x00000000004775c0 in ?? ()
#15 0x000000000140fe34 in threadentry (v=) at gcc_linux_amd64.c:92
#16 0x00007f8753f37609 in ?? ()
#17 0x0000000000000000 in ?? ()

Thread 1 (LWP 2040524 "podman"):
#0 runtime.futex () at /home/misha/go-18/go1.18rc1/src/runtime/sys_linux_amd64.s:553
#1 0x0000000000442136 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4699971) at /home/misha/go-18/go1.18rc1/src/runtime/os_linux.go:66
#2 0x0000000000419607 in runtime.notesleep (n=0xfffffffffffffe00) at /home/misha/go-18/go1.18rc1/src/runtime/lock_futex.go:159
#3 0x000000000044b165 in runtime.mPark () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:1449
#4 0x000000000044ccc5 in runtime.stoplockedm () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:2422
#5 0x000000000044e79d in runtime.schedule () at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3119
#6 0x000000000044eeed in runtime.park_m (gp=0xc000bda9c0) at /home/misha/go-18/go1.18rc1/src/runtime/proc.go:3336
#7 0x0000000000477643 in runtime.mcall () at /home/misha/go-18/go1.18rc1/src/runtime/asm_amd64.s:425
#8 0x000000000047bf45 in runtime.newproc (fn=0x0) at :1
#9 0x000000000248ece0 in ?? ()
#10 0x0000000000000000 in ?? ()

@giuseppe
Copy link
Member Author

I am trying your reproducer locally and I am at iteration * 641 * without any luck reproducing it yet.

Could you try with delve (delve attach $PID)? Do you get any more useful information out of it?

Are there only two podman processes running?

@tyler92
Copy link
Contributor

tyler92 commented Jul 20, 2022

I am at iteration * 641 * without any luck reproducing it yet.

In my case it's 1700 iteration

Are there only two podman processes running?

Yes

Could you try with delve

A little bit later I will try it

@vrothberg
Copy link
Member

I ran my reproducer for hours and nothing has happened.

@giuseppe
Copy link
Member Author

same here, cannot end up in that state.

Might be something unrelated

@giuseppe
Copy link
Member Author

tests are green

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
@mheon PTAL

}
}
if removalErr != nil {
return removalErr
}

// We're going to be removing containers.
// If we are Cgroupfs cgroup driver, to avoid races, we need to hit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this block be moved up, to ensure we're still preventing cleanup processes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would require to lock all containers at the same time which would reintroduce the deadlock.

@giuseppe WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we just remove this hack?

Since there is no locking now for removing containers, the cleanup processes should not hang for too long so it is fine if they run.

I've pushed a new version without this code block; let's see if it causes any issue in the CI.

@giuseppe giuseppe force-pushed the do-not-lock-containers-pod-rm branch from 5b2a562 to 3dde31b Compare July 21, 2022 07:11
do not attempt to lock all containers on pod rm since it can cause
deadlocks when other podman cleanup processes are attempting to lock
the same containers in a different order.

[NO NEW TESTS NEEDED]

Closes: containers#14929

Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe giuseppe force-pushed the do-not-lock-containers-pod-rm branch from 3dde31b to af118f7 Compare July 21, 2022 07:17
@giuseppe
Copy link
Member Author

CI is green again

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mheon
Copy link
Member

mheon commented Jul 22, 2022

Let's merge
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2022
@openshift-merge-robot openshift-merge-robot merged commit 05618a5 into containers:main Jul 22, 2022
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

deadlock(?) in podman rm(?)
5 participants