WIP: fix deadlock between play kube and cleanup #14969

tyler92 · 2022-07-19T08:17:04Z

There was a deadlock between two concurrent processes: play kube and cleanup,
that is called after container exit when RestartPolicy is used. Before the fix,
the cleanup command didn't lock Pod's lock, so there was a possibility of
obtaining two locks in different order in two processes.

[NO NEW TESTS NEEDED]

Closes #14921

Signed-off-by: Mikhail Khachayants [email protected]

Does this PR introduce a user-facing change?

None

openshift-ci · 2022-07-19T08:17:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tyler92
To complete the pull request process, please assign vrothberg after the PR has been reviewed.
You can assign the PR to them by writing /assign @vrothberg in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tyler92 · 2022-07-19T08:22:34Z

This is my idea how to fix #14921. Actually in my tests this change works properly. But I'm no very familiar with Podman's code and maybe this is incorrect approach from architecture point of view.

Please let me known if this PR make sense or we should reject it.

tyler92 · 2022-07-19T08:27:37Z

/release-note-none

vrothberg · 2022-07-19T08:32:40Z

What is the reproducer for the issue? I tried with the reproducer for #14929 but I still see deadlock.

tyler92 · 2022-07-19T08:35:11Z

#14929 and #14921 look like different issues. Steps to reproduce this PR:

Create the following kube yaml:

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: my-pod
  name: my-pod
spec:
  containers:
  - name: app
    image: debian
    imagePullPolicy: Never
    command:
    - /bin/sleep
    args:
    - 0.001
  hostNetwork: true
  restartPolicy: Always

Run the following script:

#!/bin/bash

set -o errexit

for x in {1..10000};
    do echo "* $x *"
    podman play kube ./my-pod.yaml
    podman pod rm -f -a
    podman rm -a
done

Observe script output until deadlock.

mheon · 2022-07-19T13:31:52Z

I don't like this because it serializes cleanups across the pod, which could have perf implications. I'm trying to track down exactly where Cleanup locks the pod, thus far without success, to make a more precise fix.

vrothberg · 2022-07-19T14:03:15Z

Look at @giuseppe's analysis in #14929 (comment). It is the very same problem.

podman pod rm keeps all containers locked simultaneously. During cleanup, a container may take the lock of its dependency container and "boom". So podman pod rm takes lock A, container cleanup takes lock B and then they try to take the other one.

There was a deadlock between two concurrent processes: play kube and cleanup, that is called after container exit when RestartPolicy is used. Before the fix, the cleanup command didn't lock Pod's lock, so there was a possibility of obtaining two locks in different order in two processes. [NO NEW TESTS NEEDED] Closes containers#14921 Signed-off-by: Mikhail Khachayants <[email protected]>

tyler92 · 2022-07-19T14:26:27Z

I'm trying to track down exactly where Cleanup locks the pod, thus far without success, to make a more precise fix.

A little bit tungled way:

Container.Cleanup locks container ->
Pod.stopIfOnlyInfraRemains locks Pod ->
Pod.stopWithTimeout locks containers ->
Container.Cleanup locks container

I don't like this because it serializes cleanups across the pod, which could have perf implications

I agree with you, so please consider my PR as just an idea of potential fix.

vrothberg · 2022-07-19T14:32:56Z

I think we need to solve the problem in podman pod rm #14929 (comment). IMHO this is the source of the problem: unless the order of locking there gets fixed, any code path of a container getting the lock of its dependency has the potential to dead lock. AFAICS, there are more such code paths than (*Container).Cleanup().

vrothberg · 2022-07-19T14:39:25Z

@tyler92 stopIfOnlyInfraRemains is not taking the pod lock while the container is locked. Have a look at the comment here https://github.com/containers/podman/blob/main/libpod/container_internal.go#L1955. If that would be the case, the issue would not be racy but 100 percent reproducible.

tyler92 · 2022-07-19T18:07:50Z

I've launched long stress test with #14976 and scripts from #14921, currently the problem is not reproduced, but I want to execute test in loop for several hours.

Ok, #14976 looks good for maitaners, so should I close this PR?

rhatdan · 2022-08-02T14:42:11Z

I think I will close this, as it is fixed.

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Jul 19, 2022

tyler92 force-pushed the fix-deadlock-on-cleanup branch from ddcc796 to ba7a72b Compare July 19, 2022 08:25

openshift-ci bot added release-note-none and removed do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Jul 19, 2022

tyler92 mentioned this pull request Jul 19, 2022

deadlock(?) in podman rm(?) #14929

Closed

tyler92 force-pushed the fix-deadlock-on-cleanup branch from ba7a72b to e7dfdba Compare July 19, 2022 14:19

tyler92 mentioned this pull request Jul 20, 2022

libpod: do not lock all containers on pod rm #14976

Merged

rhatdan closed this Aug 2, 2022

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 20, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: fix deadlock between play kube and cleanup #14969

WIP: fix deadlock between play kube and cleanup #14969

tyler92 commented Jul 19, 2022 •

edited

Loading

openshift-ci bot commented Jul 19, 2022

tyler92 commented Jul 19, 2022

tyler92 commented Jul 19, 2022

vrothberg commented Jul 19, 2022

tyler92 commented Jul 19, 2022 •

edited

Loading

mheon commented Jul 19, 2022

vrothberg commented Jul 19, 2022

tyler92 commented Jul 19, 2022 •

edited

Loading

vrothberg commented Jul 19, 2022

vrothberg commented Jul 19, 2022 •

edited

Loading

tyler92 commented Jul 19, 2022 •

edited

Loading

rhatdan commented Aug 2, 2022

WIP: fix deadlock between play kube and cleanup #14969

WIP: fix deadlock between play kube and cleanup #14969

Conversation

tyler92 commented Jul 19, 2022 • edited Loading

Does this PR introduce a user-facing change?

openshift-ci bot commented Jul 19, 2022

tyler92 commented Jul 19, 2022

tyler92 commented Jul 19, 2022

vrothberg commented Jul 19, 2022

tyler92 commented Jul 19, 2022 • edited Loading

mheon commented Jul 19, 2022

vrothberg commented Jul 19, 2022

tyler92 commented Jul 19, 2022 • edited Loading

vrothberg commented Jul 19, 2022

vrothberg commented Jul 19, 2022 • edited Loading

tyler92 commented Jul 19, 2022 • edited Loading

rhatdan commented Aug 2, 2022

tyler92 commented Jul 19, 2022 •

edited

Loading

tyler92 commented Jul 19, 2022 •

edited

Loading

tyler92 commented Jul 19, 2022 •

edited

Loading

vrothberg commented Jul 19, 2022 •

edited

Loading

tyler92 commented Jul 19, 2022 •

edited

Loading