Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: run --restart=always, then wait: timeout #23473

Closed
edsantiago opened this issue Aug 1, 2024 · 9 comments
Closed

remote: run --restart=always, then wait: timeout #23473

edsantiago opened this issue Aug 1, 2024 · 9 comments
Assignees
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote

Comments

@edsantiago
Copy link
Member

[+0113s] not ok 110 [030] podman run --restart=always -- wait
...
<+     > # # podman-remote run -d --restart=always --name=c_t110-iolrdtkp quay.io/libpod/testimage:20240123 false
<+131ms> # eaa555d6a4f6dba6cdac3b177a514e853731c2bdc39ed41feb6f252be111ce66
         #
<+009ms> # # podman-remote wait c_t110-iolrdtkp
<+0020s> # timeout: sending signal TERM to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman-remote’
<+005ms> # [ rc=124 (** EXPECTED 0 **) ]
         # *** TIMED OUT ***
x x x x x x
sys(4) remote(4) fedora-40-aarch64(2) root(3) host(4) sqlite(4)
fedora-40(1) rootless(1)
debian-13(1)
@edsantiago edsantiago added flakes Flakes from Continuous Integration remote Problem is in podman-remote labels Aug 1, 2024
@Luap99
Copy link
Member

Luap99 commented Aug 1, 2024

I looked into that as it happened a lot in my parallel PR as well but I didn't find any obvious. Overall the wait code is rather ugly as it is polling the status so in theory it could be possible that it misses an exit on a fast restarting container but I don't think this is the case for 20s.

And I tried reproducing for a while without out luck.

@edsantiago
Copy link
Member Author

Soooo..... this is kube play, not run restart always, but it just really strikes me as being possibly the same thing:

not ok 216 |700| podman kube play --wait with siginterrupt in 12276ms
         # tags: ci:parallel
         # (from function `bail-now' in file test/system/[helpers.bash, line 192](https://github.com/containers/podman/blob/5d977e84aa11042549f6bf1bbc752d0b5472ccb4/test/system/helpers.bash#L192),
         #  from function `die' in file test/system/[helpers.bash, line 905](https://github.com/containers/podman/blob/5d977e84aa11042549f6bf1bbc752d0b5472ccb4/test/system/helpers.bash#L905),
         #  from function `run_podman' in file test/system/[helpers.bash, line 507](https://github.com/containers/podman/blob/5d977e84aa11042549f6bf1bbc752d0b5472ccb4/test/system/helpers.bash#L507),
         #  in test file test/system/[700-play.bats, line 652](https://github.com/containers/podman/blob/5d977e84aa11042549f6bf1bbc752d0b5472ccb4/test/system/700-play.bats#L652))
         #   `PODMAN_TIMEOUT=2 run_podman 124 kube play --wait $fname' failed
         #
<+     > # # podman-remote --url unix:///tmp/CI_95jg/podman_tmp_ivMv kube play --wait /tmp/CI_95jg/podman_bats.TUY18V/play_kube_wait_QqFONf.yaml
<+0012s> # Use ctrl+c to clean up or wait for pods to exit
         # Pod:
         # 12a782b3e3a069933e1444f37ae5e3b0401e744b0c9567208a63f4b9c719db48
         # Container:
         # ba7f57f0a7304b4b6ab3453888eabf0ff5400e9654eda59a30bb9b420b61ad01
         #
         # timeout: sending signal TERM to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman-remote’
         #
         # Cleaning up containers, pods, and volumes...
         # Pods stopped:
         # 12a782b3e3a069933e1444f37ae5e3b0401e744b0c9567208a63f4b9c719db48
         # Pods removed:
         # 12a782b3e3a069933e1444f37ae5e3b0401e744b0c9567208a63f4b9c719db48
         # Secrets removed:
         # Volumes removed:
         # timeout: sending signal KILL to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman-remote’
<+005ms> # [ rc=137 (** EXPECTED 124 **) ]

I've lumped it into this issue. [remote rawhide] root](https://api.cirrus-ci.com/v1/artifact/task/5363842849439744/html/sys-remote-rawhide-root-host-sqlite.log.html#t--00216p)

@Luap99
Copy link
Member

Luap99 commented Aug 1, 2024

Yes you are right kube play --wait uses the same wait API internally possible that they have the same cause.
Although the output there clearly shows that kube got the TERM and then cleans up but somehow hangs and gets killed instead of exiting.

@cevich
Copy link
Member

cevich commented Aug 8, 2024

@edsantiago
Copy link
Member Author

Current flake list. There is one non-remote instance (debian), all others are remote. I'm tempted to remove the remote label but will wait to see if it happens again on a non-remote test.

x x x x x x
sys(11) remote(10) fedora-40-aarch64(7) root(10) host(11) sqlite(11)
podman(1) debian-13(2) rootless(1)
fedora-40(1)
rawhide(1)

@Luap99
Copy link
Member

Luap99 commented Aug 12, 2024

I will try to reproduce this, last time I tried to run this for hours in a loop without any errors. Not sure if there is some magic requirement I was missing to trigger it.

@Luap99
Copy link
Member

Luap99 commented Aug 13, 2024

$ bin/podman run --name c1 -d --restart on-failure  quay.io/libpod/testimage:20240123 false
930bb7f5bab705b0838c0a73a8ea8acd18a4d31c361ce566ab7c5969e45f1fd0
$ time bin/podman wait c1
1

real    0m20.492s
user    0m0.077s
sys     0m0.164s

--restart=always has some extra special logic to "make it work" better but using on-failure shows that the current logic is clearly not working properly

@Luap99 Luap99 self-assigned this Aug 13, 2024
Luap99 added a commit to Luap99/libpod that referenced this issue Aug 15, 2024
The current code did several complicated state checks that simply do not
work properly on a fast restarting container. It uses a special case for
--restart=always but forgot to take care of --restart=on-failure which
always hang for 20s until it run into the timeout.

The old logic also used to call CheckConmonRunning() but synced the
state before which means it may check a new conmon every time and thus
misses exits.

To fix the new the code is much simpler. Check the conmon pid, if it is
no longer running then get then check exit file and get exit code.

This is related to containers#23473 but I am not sure if this fixes it because we
cannot reproduce.

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99
Copy link
Member

Luap99 commented Aug 15, 2024

Let me know if you see any hangs after #23601

@Luap99
Copy link
Member

Luap99 commented Sep 4, 2024

Last logged instance on the flake tracker was 8-14 so I am going to close this.

@Luap99 Luap99 closed this as completed Sep 4, 2024
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Dec 4, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Dec 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote
Projects
None yet
Development

No branches or pull requests

3 participants