podman wait: new timeout, possibly deadlock #14761

edsantiago · 2022-06-28T17:46:39Z

Started right after #14685, but please don't anybody get tunnel vision on it: correlation causation etc.

# podman-remote run ...
# podman-remote kill ...
# podman-remote wait ...
timeout: sending signal TERM to command ?/var/tmp/go/src/github.com/containers/podman/bin/podman-remote?

Seen root and rootless; fedora 35, 36, and ubuntu. So far, podman-remote only.

Once it triggers, the entire system is unusable, podman-everything hangs, and tests die after the Cirrus timeout.

[sys] 101 podman stop - unlock while waiting for timeout

fedora-35 : sys remote fedora-35 root host [remote]
- PR Use Regexp in volume ls --filter name #14597
  - 06-27 16:49
fedora-36 : sys remote fedora-36 rootless host [remote]
- PR Show starting state when machine is starting #14742
  - 06-27 11:46
- PR Fix runtime check during restore #14717
  - 06-28 12:24
ubuntu-2110 : sys remote ubuntu-2110 root host [remote]
- PR Implement CatchAll and StopCatch in signal_unix.go #14733
  - 06-27 10:37

(Labeling remote, and waiting to see if stupid bot removes the tag. I'm betting it will.)

The text was updated successfully, but these errors were encountered:

edsantiago · 2022-06-28T17:48:13Z

Sigh. Does anyone know why the bot removes my remote label?

Luap99 · 2022-06-28T17:50:17Z

Sigh. Does anyone know why the bot removes my remote label?

Yes, because the action for some reason automatically removes the label if the regex is not matched and at least from my quick look there is no reason to turn this off.

edsantiago · 2022-06-28T18:33:34Z

Ohhhhhhhh..... this:

podman/.github/issue-labeler.yml

Lines 11 to 13 in d095053

    
           remote: 
        
             # we cannot use multiline regex so we check for serviceIsRemote in podman info 
        
             - 'serviceIsRemote:\strue'

...which, in its documentation, states

Should the regular expression not match, the label will be removed.

...which seems stupid to me: if the reporter has taken the time to explicitly set a label, an inflexible rule must not override. This is such an obvious bug that there's already an issue open for it. Unfortunately, it's been ignored for two years.

Oh well. Thanks for the pointer @Luap99. I guess we have to live with that for now.

vrothberg · 2022-07-01T12:41:54Z

@mheon it looks very similar to what I've been observing in the gitlab PR

[+0543s] # $ /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_EjlR kill stopme
[+0543s] # stopme
[+0543s] # $ /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_EjlR wait stopme
[+0543s] # timeout: sending signal TERM to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman-remote’
[+0543s] # [ rc=124 (** EXPECTED 0 **) ]
[+0543s] # *** TIMED OUT ***
[+0543s] # # [teardown]
[+0543s] # $ /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_EjlR pod rm -t 0 --all --force --ignore
[+0543s] # $ /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url unix:/tmp/podman_tmp_EjlR rm -t 0 --all --force --ignore
[+0543s] # timeout: sending signal TERM to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman-remote’
[+0543s] # [ rc=124 ]

A container gets killed and all subsequent attempts to wait for it or even to remove it time out.

vrothberg · 2022-07-01T12:55:27Z

I extracted the following reproducer:

echo "..."                                                                                                                                
date                                                                                                                                      
echo run                                                                                                                                  
$PODMAN run -d --replace --name=123 alpine sh -c "trap 'echo Received SIGTERM, ignoring' SIGTERM; echo READY; while :; do sleep 0.2; done"
echo stop                                                                                                                                 
$PODMAN stop -t 3 123 &                                                                                                                   
echo kill                                                                                                                                 
$PODMAN kill 123                                                                                                                          
echo wait                                                                                                                                 
$PODMAN wait 123

Works with local podman. Failed on the 2nd run with podman-remote. podman ps etc. hangs which lets me believe there is some deadlock. I did not further analyze yet.

vrothberg · 2022-07-01T12:56:32Z

Note that the concurrent stop and kill trigger the deadlock. It somehow aligns with the observations in the CI where all subsequent attempts to remove the container time out.

vrothberg · 2022-07-04T12:35:58Z

Started looking into it again. Just saw the following error on the server side:

ERRO[0022] Waiting for container 0fd9adb2417e2b380ad4d21209055c2918983fcc71163c1c8934c8c11a4c7433 to exit: getting exit code of container 0fd9adb2417e2b380ad4d21209055c2918983fcc71163c1c8934c8c11a4c7433 from DB: no such exit code

No analyses yet, just want to share breadcrumbs.

vrothberg · 2022-07-04T12:49:08Z

Another thing that looks suspicious when running podman-remote:

WARN[0023] StopSignal SIGTERM failed to stop container 123 in 3 seconds, resorting to SIGKILL

~~The default stop timeout is 10 seconds. Not yet sure why it's 3 for remote.~~

Please ignore: it was a testing fart on my end.

vrothberg · 2022-07-04T13:07:47Z

The deadlock happens when the container is in "stopping" state during kill.

vrothberg · 2022-07-04T13:28:07Z

Ah, got it. I'll wrap up a PR.

The problem was that kill didn't collect the exit code. Neither does stop when kill kicks in in the "stopping" state.

vrothberg · 2022-07-04T14:14:21Z

~~-> #14821~~

Make sure to record the exit code after killing a container. Otherwise, a concurrent `stop` may not record the exit code and yield the container unusable. Fixes: containers#14761 Signed-off-by: Valentin Rothberg <[email protected]>

Make sure `Sync()` handles state transitions and exit codes correctly. The function was only being called when batching which could render containers in an unusable state when running concurrently with other state-altering functions/commands since the state must be re-read from the database before acting upon it. Fixes: containers#14761 Signed-off-by: Valentin Rothberg <[email protected]>

edsantiago added flakes Flakes from Continuous Integration remote Problem is in podman-remote labels Jun 28, 2022

github-actions bot removed the remote Problem is in podman-remote label Jun 28, 2022

edsantiago added the remote Problem is in podman-remote label Jun 28, 2022

edsantiago assigned mheon and vrothberg Jun 28, 2022

vrothberg mentioned this issue Jul 4, 2022

kill: record exit code #14821

Closed

vrothberg mentioned this issue Jul 5, 2022

Sync: handle exit file #14830

Merged

openshift-ci bot closed this as completed in #14830 Jul 5, 2022

edsantiago mentioned this issue Jul 7, 2022

podman-remote wait: Error getting exit code from DB: no such exit code #14859

Closed

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 20, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

podman wait: new timeout, possibly deadlock #14761

podman wait: new timeout, possibly deadlock #14761

edsantiago commented Jun 28, 2022

edsantiago commented Jun 28, 2022

Luap99 commented Jun 28, 2022

edsantiago commented Jun 28, 2022

vrothberg commented Jul 1, 2022

vrothberg commented Jul 1, 2022

vrothberg commented Jul 1, 2022

vrothberg commented Jul 4, 2022

vrothberg commented Jul 4, 2022 •

edited

Loading

vrothberg commented Jul 4, 2022

vrothberg commented Jul 4, 2022

vrothberg commented Jul 4, 2022 •

edited

Loading

podman wait: new timeout, possibly deadlock #14761

podman wait: new timeout, possibly deadlock #14761

Comments

edsantiago commented Jun 28, 2022

[sys] 101 podman stop - unlock while waiting for timeout

edsantiago commented Jun 28, 2022

Luap99 commented Jun 28, 2022

edsantiago commented Jun 28, 2022

vrothberg commented Jul 1, 2022

vrothberg commented Jul 1, 2022

vrothberg commented Jul 1, 2022

vrothberg commented Jul 4, 2022

vrothberg commented Jul 4, 2022 • edited Loading

vrothberg commented Jul 4, 2022

vrothberg commented Jul 4, 2022

vrothberg commented Jul 4, 2022 • edited Loading

vrothberg commented Jul 4, 2022 •

edited

Loading

vrothberg commented Jul 4, 2022 •

edited

Loading