podman healthcheck + sdnotify: Error: container is stopped #22760

edsantiago · 2024-05-20T17:14:31Z

<+015ms> # # podman run --name IK28z8EJtw --health-cmd=touch /terminate --sdnotify=healthy quay.io/libpod/testimage:20240123 sh -c while test \! -e /terminate; do sleep 0.1; done; echo finished
<+572ms> # finished
         # Error: container is stopped
<+004ms> # [ rc=126 (** EXPECTED 0 **) ]

Reproducer fails within seconds on 1mt:

# while :;do bin/podman run --health-cmd="touch /terminate" --sdnotify=healthy quay.io/libpod/testimage:20240123 sh -c "while test \! -e /terminate; do sleep 0.1; done; echo finished" || break; done
finished
finished
finished
finished
finished
finished
finished
finished
finished
finished
Error: container is stopped

While I'm at it, this is probably a bug too: (above command, with run --rm, fails instantly):

# bin/podman run --rm --health-cmd="touch /terminate" --sdnotify=healthy quay.io/libpod/testimage:20240123 sh -c "while test \! -e /terminate; do sleep 0.1; done; echo finished"
finished
# echo $?
127

Almost certainly related to #22658. @giuseppe PTAL. Only seen on aarch64, but that's consistent with the previous flake that you fixed in your PR.

fedora-40-aarch64 : sys podman fedora-40-aarch64 root host sqlite
- PR Use a defined constant instead of a hard-coded magic value #22731
  - 05-16 18:10 in [sys] [260] sdnotify : healthy
  - 05-16 16:49 in [sys] [260] sdnotify : healthy

x	x	x	x	x	x
sys(2)	podman(2)	fedora-40-aarch64(2)	root(2)	host(2)	sqlite(2)

The text was updated successfully, but these errors were encountered:

edsantiago · 2024-05-20T17:33:51Z

OBTW vim /usr/share/containers/storage.conf and comment out the thinpool line, otherwise lots of nasty warnings

giuseppe · 2024-05-20T19:37:26Z

thanks for the report. The --sdnotify=healthy feature has a race condition in its implementation as we release the lock on the container middle way https://github.com/containers/podman/blob/main/libpod/container_internal.go#L1316-L1323. When the lock is released, the cleanup process deletes the container. I am not sure yet how this can be solved, except trying to not report "container is stopped" as an error.
The issue is even more evident with --rm since the container is gone once the lock is released (and the cleanup process was faster). Maybe the easiest for now is to just disallow --rm and --sdnotify=healthy.

giuseppe · 2024-05-20T20:28:27Z

opened a tentative PR: #22764

Marked as a Draft as I want to test it better. It doesn't solve the root issue as it would still fail if the healthcheck change takes much longer than a waiting interval

wait for another interval when the container transitioned to "stopped" to give more time to the healthcheck status to change. Closes: containers#22760 Signed-off-by: Giuseppe Scrivano <[email protected]>

edsantiago added the flakes Flakes from Continuous Integration label May 20, 2024

edsantiago assigned giuseppe May 20, 2024

giuseppe mentioned this issue May 20, 2024

libpod: wait another interval for healthcheck #22764

Merged

giuseppe added the jira label May 21, 2024

openshift-merge-bot bot closed this as completed in #22764 May 28, 2024

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 27, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

podman healthcheck + sdnotify: Error: container is stopped #22760

podman healthcheck + sdnotify: Error: container is stopped #22760

edsantiago commented May 20, 2024

edsantiago commented May 20, 2024

giuseppe commented May 20, 2024

giuseppe commented May 20, 2024

podman healthcheck + sdnotify: Error: container is stopped #22760

podman healthcheck + sdnotify: Error: container is stopped #22760

Comments

edsantiago commented May 20, 2024

edsantiago commented May 20, 2024

giuseppe commented May 20, 2024

giuseppe commented May 20, 2024