Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman wait: new timeout, possibly deadlock #14761

Closed
edsantiago opened this issue Jun 28, 2022 · 11 comments · Fixed by #14830
Closed

podman wait: new timeout, possibly deadlock #14761

edsantiago opened this issue Jun 28, 2022 · 11 comments · Fixed by #14830
Assignees
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote

Comments

@edsantiago
Copy link
Member

Started right after #14685, but please don't anybody get tunnel vision on it: correlation causation etc.

# podman-remote run ...
# podman-remote kill ...
# podman-remote wait ...
timeout: sending signal TERM to command ?/var/tmp/go/src/github.com/containers/podman/bin/podman-remote?

Seen root and rootless; fedora 35, 36, and ubuntu. So far, podman-remote only.

Once it triggers, the entire system is unusable, podman-everything hangs, and tests die after the Cirrus timeout.

[sys] 101 podman stop - unlock while waiting for timeout

(Labeling remote, and waiting to see if stupid bot removes the tag. I'm betting it will.)

@edsantiago edsantiago added flakes Flakes from Continuous Integration remote Problem is in podman-remote labels Jun 28, 2022
@github-actions github-actions bot removed the remote Problem is in podman-remote label Jun 28, 2022
@edsantiago edsantiago added the remote Problem is in podman-remote label Jun 28, 2022
@edsantiago
Copy link
Member Author

Sigh. Does anyone know why the bot removes my remote label?

@Luap99
Copy link
Member

Luap99 commented Jun 28, 2022

Sigh. Does anyone know why the bot removes my remote label?

Yes, because the action for some reason automatically removes the label if the regex is not matched and at least from my quick look there is no reason to turn this off.

@edsantiago
Copy link
Member Author

Ohhhhhhhh..... this:

remote:
# we cannot use multiline regex so we check for serviceIsRemote in podman info
- 'serviceIsRemote:\strue'

...which, in its documentation, states

Should the regular expression not match, the label will be removed.

...which seems stupid to me: if the reporter has taken the time to explicitly set a label, an inflexible rule must not override. This is such an obvious bug that there's already an issue open for it. Unfortunately, it's been ignored for two years.

Oh well. Thanks for the pointer @Luap99. I guess we have to live with that for now.

@vrothberg
Copy link
Member

@mheon it looks very similar to what I've been observing in the gitlab PR

[+0543s] # $ /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_EjlR kill stopme
[+0543s] # stopme
[+0543s] # $ /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_EjlR wait stopme
[+0543s] # timeout: sending signal TERM to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman-remote’
[+0543s] # [ rc=124 (** EXPECTED 0 **) ]
[+0543s] # *** TIMED OUT ***
[+0543s] # # [teardown]
[+0543s] # $ /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_EjlR pod rm -t 0 --all --force --ignore
[+0543s] # $ /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_EjlR rm -t 0 --all --force --ignore
[+0543s] # timeout: sending signal TERM to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman-remote’
[+0543s] # [ rc=124 ]

A container gets killed and all subsequent attempts to wait for it or even to remove it time out.

@vrothberg
Copy link
Member

I extracted the following reproducer:

echo "..."                                                                                                                                
date                                                                                                                                      
echo run                                                                                                                                  
$PODMAN run -d --replace --name=123 alpine sh -c "trap 'echo Received SIGTERM, ignoring' SIGTERM; echo READY; while :; do sleep 0.2; done"
echo stop                                                                                                                                 
$PODMAN stop -t 3 123 &                                                                                                                   
echo kill                                                                                                                                 
$PODMAN kill 123                                                                                                                          
echo wait                                                                                                                                 
$PODMAN wait 123                                                                                                                          

Works with local podman. Failed on the 2nd run with podman-remote. podman ps etc. hangs which lets me believe there is some deadlock. I did not further analyze yet.

@vrothberg
Copy link
Member

Note that the concurrent stop and kill trigger the deadlock. It somehow aligns with the observations in the CI where all subsequent attempts to remove the container time out.

@vrothberg
Copy link
Member

Started looking into it again. Just saw the following error on the server side:

ERRO[0022] Waiting for container 0fd9adb2417e2b380ad4d21209055c2918983fcc71163c1c8934c8c11a4c7433 to exit: getting exit code of container 0fd9adb2417e2b380ad4d21209055c2918983fcc71163c1c8934c8c11a4c7433 from DB: no such exit code

No analyses yet, just want to share breadcrumbs.

@vrothberg
Copy link
Member

vrothberg commented Jul 4, 2022

Another thing that looks suspicious when running podman-remote:

WARN[0023] StopSignal SIGTERM failed to stop container 123 in 3 seconds, resorting to SIGKILL

The default stop timeout is 10 seconds. Not yet sure why it's 3 for remote.

Please ignore: it was a testing fart on my end.

@vrothberg
Copy link
Member

The deadlock happens when the container is in "stopping" state during kill.

@vrothberg
Copy link
Member

Ah, got it. I'll wrap up a PR.

The problem was that kill didn't collect the exit code. Neither does stop when kill kicks in in the "stopping" state.

@vrothberg
Copy link
Member

vrothberg commented Jul 4, 2022

-> #14821

vrothberg added a commit to vrothberg/libpod that referenced this issue Jul 4, 2022
Make sure to record the exit code after killing a container.
Otherwise, a concurrent `stop` may not record the exit code
and yield the container unusable.

Fixes: containers#14761
Signed-off-by: Valentin Rothberg <[email protected]>
vrothberg added a commit to vrothberg/libpod that referenced this issue Jul 5, 2022
Make sure `Sync()` handles state transitions and exit codes correctly.
The function was only being called when batching which could render
containers in an unusable state when running concurrently with other
state-altering functions/commands since the state must be re-read from
the database before acting upon it.

Fixes: containers#14761
Signed-off-by: Valentin Rothberg <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 20, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants