remote: run --restart=always, then wait: timeout #23473

edsantiago · 2024-08-01T12:18:33Z

[+0113s] not ok 110 [030] podman run --restart=always -- wait
...
<+     > # # podman-remote run -d --restart=always --name=c_t110-iolrdtkp quay.io/libpod/testimage:20240123 false
<+131ms> # eaa555d6a4f6dba6cdac3b177a514e853731c2bdc39ed41feb6f252be111ce66
         #
<+009ms> # # podman-remote wait c_t110-iolrdtkp
<+0020s> # timeout: sending signal TERM to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman-remote’
<+005ms> # [ rc=124 (** EXPECTED 0 **) ]
         # *** TIMED OUT ***

debian-13 : sys remote debian-13 root host sqlite [remote]
- 07-31 17:52 in [sys] [030] podman run --restart=always -- wait
fedora-40 : sys remote fedora-40 rootless host sqlite [remote]
- PR WIP: system test parallelization: two-pass approach #23275
  - 07-31 08:24 in [sys] [030] podman run --restart=always -- wait
fedora-40-aarch64 : sys remote fedora-40-aarch64 root host sqlite [remote]
- PR Use new syntax for selinux options in quadlet #23444
  - 07-30 06:44 in [sys] [030] podman run --restart=always -- wait
  - 07-22 07:48 in [sys] [030] podman run --restart=always -- wait

x	x	x	x	x	x
sys(4)	remote(4)	fedora-40-aarch64(2)	root(3)	host(4)	sqlite(4)
		fedora-40(1)	rootless(1)
		debian-13(1)

The text was updated successfully, but these errors were encountered:

Luap99 · 2024-08-01T12:54:05Z

I looked into that as it happened a lot in my parallel PR as well but I didn't find any obvious. Overall the wait code is rather ugly as it is polling the status so in theory it could be possible that it misses an exit on a fast restarting container but I don't think this is the case for 20s.

And I tried reproducing for a while without out luck.

edsantiago · 2024-08-01T16:39:40Z

Soooo..... this is kube play, not run restart always, but it just really strikes me as being possibly the same thing:

not ok 216 |700| podman kube play --wait with siginterrupt in 12276ms
         # tags: ci:parallel
         # (from function `bail-now' in file test/system/[helpers.bash, line 192](https://github.com/containers/podman/blob/5d977e84aa11042549f6bf1bbc752d0b5472ccb4/test/system/helpers.bash#L192),
         #  from function `die' in file test/system/[helpers.bash, line 905](https://github.com/containers/podman/blob/5d977e84aa11042549f6bf1bbc752d0b5472ccb4/test/system/helpers.bash#L905),
         #  from function `run_podman' in file test/system/[helpers.bash, line 507](https://github.com/containers/podman/blob/5d977e84aa11042549f6bf1bbc752d0b5472ccb4/test/system/helpers.bash#L507),
         #  in test file test/system/[700-play.bats, line 652](https://github.com/containers/podman/blob/5d977e84aa11042549f6bf1bbc752d0b5472ccb4/test/system/700-play.bats#L652))
         #   `PODMAN_TIMEOUT=2 run_podman 124 kube play --wait $fname' failed
         #
<+     > # # podman-remote --url unix:///tmp/CI_95jg/podman_tmp_ivMv kube play --wait /tmp/CI_95jg/podman_bats.TUY18V/play_kube_wait_QqFONf.yaml
<+0012s> # Use ctrl+c to clean up or wait for pods to exit
         # Pod:
         # 12a782b3e3a069933e1444f37ae5e3b0401e744b0c9567208a63f4b9c719db48
         # Container:
         # ba7f57f0a7304b4b6ab3453888eabf0ff5400e9654eda59a30bb9b420b61ad01
         #
         # timeout: sending signal TERM to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman-remote’
         #
         # Cleaning up containers, pods, and volumes...
         # Pods stopped:
         # 12a782b3e3a069933e1444f37ae5e3b0401e744b0c9567208a63f4b9c719db48
         # Pods removed:
         # 12a782b3e3a069933e1444f37ae5e3b0401e744b0c9567208a63f4b9c719db48
         # Secrets removed:
         # Volumes removed:
         # timeout: sending signal KILL to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman-remote’
<+005ms> # [ rc=137 (** EXPECTED 124 **) ]

I've lumped it into this issue. [remote rawhide] root](https://api.cirrus-ci.com/v1/artifact/task/5363842849439744/html/sys-remote-rawhide-root-host-sqlite.log.html#t--00216p)

Luap99 · 2024-08-01T17:11:37Z

Yes you are right kube play --wait uses the same wait API internally possible that they have the same cause.
Although the output there clearly shows that kube got the TERM and then cleans up but somehow hangs and gets killed instead of exiting.

cevich · 2024-08-08T15:30:04Z

Hit this just now: https://api.cirrus-ci.com/v1/artifact/task/4618842666696704/html/sys-remote-fedora-40-aarch64-root-host-sqlite.log.html

edsantiago · 2024-08-12T15:59:37Z

Current flake list. There is one non-remote instance (debian), all others are remote. I'm tempted to remove the remote label but will wait to see if it happens again on a non-remote test.

debian-13 : sys podman debian-13 root host sqlite
- PR WIP: system test parallelization: two-pass approach #23275
  - 08-01 13:01 in [sys] [030] podman run --restart=always -- wait
debian-13 : sys remote debian-13 root host sqlite [remote]
- 07-31 17:52 in [sys] [030] podman run --restart=always -- wait
fedora-40 : sys remote fedora-40 rootless host sqlite [remote]
- PR WIP: system test parallelization: two-pass approach #23275
  - 07-31 08:24 in [sys] [030] podman run --restart=always -- wait
fedora-40-aarch64 : sys remote fedora-40-aarch64 root host sqlite [remote]
- PR Reduce python APIv2 test net dependency #23538
  - 08-08 11:09 in [sys] [030] podman run --restart=always -- wait
- PR Quadlet override service name #23449
  - 08-07 11:19 in [sys] [030] podman run --restart=always -- wait
- PR Use new syntax for selinux options in quadlet #23444
  - 07-30 06:44 in [sys] [030] podman run --restart=always -- wait
  - 08-11 10:11 in [sys] [030] podman run --restart=always -- wait
  - 08-06 22:16 in [sys] [030] podman run --restart=always -- wait
  - 08-06 21:00 in [sys] [030] podman run --restart=always -- wait
  - 08-02 08:06 in [sys] [030] podman run --restart=always -- wait
rawhide : sys remote rawhide root host sqlite [remote]
- PR WIP: system test parallelization: two-pass approach #23275
  - 07-29 16:23 in [sys] |700| podman kube play --wait with siginterrupt

x	x	x	x	x	x
sys(11)	remote(10)	fedora-40-aarch64(7)	root(10)	host(11)	sqlite(11)
	podman(1)	debian-13(2)	rootless(1)
		fedora-40(1)
		rawhide(1)

Luap99 · 2024-08-12T16:15:20Z

I will try to reproduce this, last time I tried to run this for hours in a loop without any errors. Not sure if there is some magic requirement I was missing to trigger it.

Luap99 · 2024-08-13T11:35:40Z

$ bin/podman run --name c1 -d --restart on-failure  quay.io/libpod/testimage:20240123 false
930bb7f5bab705b0838c0a73a8ea8acd18a4d31c361ce566ab7c5969e45f1fd0
$ time bin/podman wait c1
1

real    0m20.492s
user    0m0.077s
sys     0m0.164s

--restart=always has some extra special logic to "make it work" better but using on-failure shows that the current logic is clearly not working properly

The current code did several complicated state checks that simply do not work properly on a fast restarting container. It uses a special case for --restart=always but forgot to take care of --restart=on-failure which always hang for 20s until it run into the timeout. The old logic also used to call CheckConmonRunning() but synced the state before which means it may check a new conmon every time and thus misses exits. To fix the new the code is much simpler. Check the conmon pid, if it is no longer running then get then check exit file and get exit code. This is related to containers#23473 but I am not sure if this fixes it because we cannot reproduce. Signed-off-by: Paul Holzinger <[email protected]>

Luap99 · 2024-08-15T14:04:52Z

Let me know if you see any hangs after #23601

Luap99 · 2024-09-04T10:24:14Z

Last logged instance on the flake tracker was 8-14 so I am going to close this.

edsantiago added flakes Flakes from Continuous Integration remote Problem is in podman-remote labels Aug 1, 2024

Luap99 self-assigned this Aug 13, 2024

Luap99 mentioned this issue Aug 13, 2024

libpod: simplify WaitForExit() #23601

Merged

Luap99 closed this as completed Sep 4, 2024

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Dec 4, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remote: run --restart=always, then wait: timeout #23473

remote: run --restart=always, then wait: timeout #23473

edsantiago commented Aug 1, 2024

Luap99 commented Aug 1, 2024

edsantiago commented Aug 1, 2024

Luap99 commented Aug 1, 2024

cevich commented Aug 8, 2024

edsantiago commented Aug 12, 2024

Luap99 commented Aug 12, 2024

Luap99 commented Aug 13, 2024

Luap99 commented Aug 15, 2024

Luap99 commented Sep 4, 2024

remote: run --restart=always, then wait: timeout #23473

remote: run --restart=always, then wait: timeout #23473

Comments

edsantiago commented Aug 1, 2024

Luap99 commented Aug 1, 2024

edsantiago commented Aug 1, 2024

Luap99 commented Aug 1, 2024

cevich commented Aug 8, 2024

edsantiago commented Aug 12, 2024

Luap99 commented Aug 12, 2024

Luap99 commented Aug 13, 2024

Luap99 commented Aug 15, 2024

Luap99 commented Sep 4, 2024