Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: fix deadlock between play kube and cleanup #14969

Closed
wants to merge 1 commit into from

Conversation

tyler92
Copy link
Contributor

@tyler92 tyler92 commented Jul 19, 2022

There was a deadlock between two concurrent processes: play kube and cleanup,
that is called after container exit when RestartPolicy is used. Before the fix,
the cleanup command didn't lock Pod's lock, so there was a possibility of
obtaining two locks in different order in two processes.

[NO NEW TESTS NEEDED]

Closes #14921

Signed-off-by: Mikhail Khachayants [email protected]

Does this PR introduce a user-facing change?

None

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Jul 19, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 19, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tyler92
To complete the pull request process, please assign vrothberg after the PR has been reviewed.
You can assign the PR to them by writing /assign @vrothberg in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tyler92
Copy link
Contributor Author

tyler92 commented Jul 19, 2022

This is my idea how to fix #14921. Actually in my tests this change works properly. But I'm no very familiar with Podman's code and maybe this is incorrect approach from architecture point of view.

Please let me known if this PR make sense or we should reject it.

@tyler92 tyler92 force-pushed the fix-deadlock-on-cleanup branch from ddcc796 to ba7a72b Compare July 19, 2022 08:25
@tyler92
Copy link
Contributor Author

tyler92 commented Jul 19, 2022

/release-note-none

@openshift-ci openshift-ci bot added release-note-none and removed do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Jul 19, 2022
@vrothberg
Copy link
Member

What is the reproducer for the issue? I tried with the reproducer for #14929 but I still see deadlock.

@tyler92
Copy link
Contributor Author

tyler92 commented Jul 19, 2022

#14929 and #14921 look like different issues. Steps to reproduce this PR:

  1. Create the following kube yaml:
apiVersion: v1
kind: Pod
metadata:
  labels:
    app: my-pod
  name: my-pod
spec:
  containers:
  - name: app
    image: debian
    imagePullPolicy: Never
    command:
    - /bin/sleep
    args:
    - 0.001
  hostNetwork: true
  restartPolicy: Always
  1. Run the following script:
#!/bin/bash

set -o errexit

for x in {1..10000};
    do echo "* $x *"
    podman play kube ./my-pod.yaml
    podman pod rm -f -a
    podman rm -a
done
  1. Observe script output until deadlock.

@mheon
Copy link
Member

mheon commented Jul 19, 2022

I don't like this because it serializes cleanups across the pod, which could have perf implications. I'm trying to track down exactly where Cleanup locks the pod, thus far without success, to make a more precise fix.

@vrothberg
Copy link
Member

Look at @giuseppe's analysis in #14929 (comment). It is the very same problem.

podman pod rm keeps all containers locked simultaneously. During cleanup, a container may take the lock of its dependency container and "boom". So podman pod rm takes lock A, container cleanup takes lock B and then they try to take the other one.

There was a deadlock between two concurrent processes: play kube and cleanup,
that is called after container exit when RestartPolicy is used. Before the fix,
the cleanup command didn't lock Pod's lock, so there was a possibility of
obtaining two locks in different order in two processes.

[NO NEW TESTS NEEDED]

Closes containers#14921

Signed-off-by: Mikhail Khachayants <[email protected]>
@tyler92 tyler92 force-pushed the fix-deadlock-on-cleanup branch from ba7a72b to e7dfdba Compare July 19, 2022 14:19
@tyler92
Copy link
Contributor Author

tyler92 commented Jul 19, 2022

I'm trying to track down exactly where Cleanup locks the pod, thus far without success, to make a more precise fix.

A little bit tungled way:

Container.Cleanup locks container ->
Pod.stopIfOnlyInfraRemains locks Pod ->
Pod.stopWithTimeout locks containers ->
Container.Cleanup locks container

I don't like this because it serializes cleanups across the pod, which could have perf implications

I agree with you, so please consider my PR as just an idea of potential fix.

@vrothberg
Copy link
Member

I think we need to solve the problem in podman pod rm #14929 (comment). IMHO this is the source of the problem: unless the order of locking there gets fixed, any code path of a container getting the lock of its dependency has the potential to dead lock. AFAICS, there are more such code paths than (*Container).Cleanup().

@vrothberg
Copy link
Member

vrothberg commented Jul 19, 2022

@tyler92 stopIfOnlyInfraRemains is not taking the pod lock while the container is locked. Have a look at the comment here https://github.com/containers/podman/blob/main/libpod/container_internal.go#L1955. If that would be the case, the issue would not be racy but 100 percent reproducible.

@tyler92
Copy link
Contributor Author

tyler92 commented Jul 19, 2022

I've launched long stress test with #14976 and scripts from #14921, currently the problem is not reproduced, but I want to execute test in loop for several hours.

Ok, #14976 looks good for maitaners, so should I close this PR?

@rhatdan
Copy link
Member

rhatdan commented Aug 2, 2022

I think I will close this, as it is fixed.

@rhatdan rhatdan closed this Aug 2, 2022
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 20, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock when RestartPolicy is used
4 participants