test/system/260-sdnotify.bats: fix test flake #18319

vrothberg · 2023-04-24T12:09:38Z

The exec session somestimes exits with 137 as the exec session races with the cleanup process of the exiting container. Fix the flake by running a detached exec session.

Fixes: #10825

Does this PR introduce a user-facing change?

None

The `exec` session somestimes exits with 137 as the exec session races with the cleanup process of the exiting container. Fix the flake by running a detached exec session. Fixes: containers#10825 Signed-off-by: Valentin Rothberg <[email protected]>

openshift-ci · 2023-04-24T12:09:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vrothberg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Luap99

LGTM but I want @edsantiago to have a look.

edsantiago · 2023-04-24T13:29:48Z

I keep staring at it but I'm not sure I fully understand it. Would I be correct in saying that this is the timeline in question?

t0       container starts, loops waiting for /stop
t1                exec touch /stop     (begins)
t2       container sees /stop, exits
t3                exec touch /stop      (still running, but container stops and cleans up, so it gets killed)

Is that accurate? If so, then it's safe to say that this following, my previous mental model, is inaccurate?

t0     container starts, loops waiting for /stop
t1              exec touch /stop   (atomic, begins and terminates nice and cleanly)
t2     container sees /stop, exits

...and then if so, there are a LOT of other places where I use that convention for signaling containers to terminate. This PR doesn't even address the most recently reported flake. Shouldn't all of those be updated to exec -d? Or, to take a step back? is there a better non-signal way to tell containers to exit cleanly?

test/system/030-run.bats:371:    run_podman exec $cid touch /tmp/stop
test/system/065-cp.bats:896:    run_podman exec cpcontainer touch /tmp/file.txt
test/system/065-cp.bats:917:    run_podman exec cpcontainer touch /tmp/empty.txt
test/system/075-exec.bats:83:    run_podman exec $cid touch /stop
test/system/075-exec.bats:100:    run_podman exec --user=$(id -un) $cid touch /tmp/stop
test/system/130-kill.bats:75:    run_podman exec $cid touch /stop
test/system/220-healthcheck.bats:52:    run_podman exec healthcheck_c touch /uh-oh
test/system/220-healthcheck.bats:94:    run_podman exec $ctr touch /uh-oh
test/system/220-healthcheck.bats:136:        run_podman exec $ctr touch /uh-oh
test/system/250-systemd.bats:341:    run_podman exec $cname touch /uh-oh
test/system/260-sdnotify.bats:138:    run_podman exec $cid touch /stop
test/system/260-sdnotify.bats:178:    run_podman exec $cid touch /stop
test/system/260-sdnotify.bats:231:    run_podman exec test_pod-test touch /tearsinrain
test/system/260-sdnotify.bats:342:    run_podman exec $container_a /bin/touch /stop
test/system/320-system-df.bats:127:    run_podman exec c2 touch /stop
test/system/410-selinux.bats:107:    run_podman exec myc touch /stop

vrothberg · 2023-04-24T13:47:08Z

Is that accurate? If so, then it's safe to say that this following, my previous mental model, is inaccurate?

I believe so, yes. podman exec is not atomic in the sense that the container is locked for the lifespan of the exec session.

This PR doesn't even address #10825 (comment).

Can you elaborate? That is literally the flake I wanted to fix.

Shouldn't all of those be updated to exec -d?

Good thinking. It sounds logical that these other tests are also subject to this race.

Or, to take a step back? is there a better non-signal way to tell containers to exit cleanly?

Wait for a 2nd signal and exit 0 could be an alternative approach. Then we replace the exec touch /stop with a single podman kill -s 2ndSignal.

Luap99 · 2023-04-24T13:54:03Z

exec is never atomic, it just spawns a process. We have no control how the kernel scheduler runs them. I don't even think podman kill the exec session, when the main container process (pid 1 in the pid namespace) exits the kernel will SIGKILL all processes in this namespace.

edsantiago · 2023-04-24T13:54:15Z

Can you elaborate? That is literally the flake I wanted to fix.

Gah. Sorry, I got myself turned around. The flake not fixed is the one in comment 0, the originally-reported one in the "sdnotify : container" test.

vrothberg · 2023-04-24T13:57:28Z

Can you elaborate? That is literally the flake I wanted to fix.

Gah. Sorry, I got myself turned around. The flake not fixed is the one in comment 0, the originally-reported one in the "sdnotify : container" test.

Ah, thanks! That is good I think since it fits well into the exec stop race theory and your words of wisdom that we should take a step back and look at all candidates. For now I think exec -d is a good way to address the race. I will take another look tomorrow morning.

edsantiago · 2023-04-24T13:59:08Z

/lgtm

I'm going to take a hard look at the rest of the instances, with my new understanding in mind. Thanks to everyone for setting me straight.

vrothberg · 2023-04-24T14:00:32Z

Appreciate it, thanks, @edsantiago

Luap99 · 2023-04-24T14:06:51Z

@edsantiago One thing you could try if you really want to avoid signals is to bind mount a directory and touch the file from the host there.

edsantiago · 2023-04-24T14:07:39Z

@Luap99 good suggestion, thank you

openshift-ci bot added the release-note-none label Apr 24, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 24, 2023

vrothberg mentioned this pull request Apr 24, 2023

podman exec sometimes exits 137 #10825

Closed

Luap99 reviewed Apr 24, 2023

View reviewed changes

openshift-ci bot assigned edsantiago Apr 24, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 24, 2023

openshift-ci bot merged commit 9f4f429 into containers:main Apr 24, 2023

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 26, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test/system/260-sdnotify.bats: fix test flake #18319

test/system/260-sdnotify.bats: fix test flake #18319

vrothberg commented Apr 24, 2023

openshift-ci bot commented Apr 24, 2023

Luap99 left a comment

edsantiago commented Apr 24, 2023

vrothberg commented Apr 24, 2023

Luap99 commented Apr 24, 2023

edsantiago commented Apr 24, 2023

vrothberg commented Apr 24, 2023

edsantiago commented Apr 24, 2023

vrothberg commented Apr 24, 2023

Luap99 commented Apr 24, 2023

edsantiago commented Apr 24, 2023

test/system/260-sdnotify.bats: fix test flake #18319

test/system/260-sdnotify.bats: fix test flake #18319

Conversation

vrothberg commented Apr 24, 2023

Does this PR introduce a user-facing change?

openshift-ci bot commented Apr 24, 2023

Luap99 left a comment

Choose a reason for hiding this comment

edsantiago commented Apr 24, 2023

vrothberg commented Apr 24, 2023

Luap99 commented Apr 24, 2023

edsantiago commented Apr 24, 2023

vrothberg commented Apr 24, 2023

edsantiago commented Apr 24, 2023

vrothberg commented Apr 24, 2023

Luap99 commented Apr 24, 2023

edsantiago commented Apr 24, 2023