Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/system/260-sdnotify.bats: fix test flake #18319

Merged
merged 1 commit into from
Apr 24, 2023

Conversation

vrothberg
Copy link
Member

The exec session somestimes exits with 137 as the exec session races with the cleanup process of the exiting container. Fix the flake by running a detached exec session.

Fixes: #10825

Does this PR introduce a user-facing change?

None

The `exec` session somestimes exits with 137 as the exec session races
with the cleanup process of the exiting container.  Fix the flake by
running a detached exec session.

Fixes: containers#10825
Signed-off-by: Valentin Rothberg <[email protected]>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 24, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vrothberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 24, 2023
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I want @edsantiago to have a look.

@edsantiago
Copy link
Member

I keep staring at it but I'm not sure I fully understand it. Would I be correct in saying that this is the timeline in question?

t0       container starts, loops waiting for /stop
t1                exec touch /stop     (begins)
t2       container sees /stop, exits
t3                exec touch /stop      (still running, but container stops and cleans up, so it gets killed)

Is that accurate? If so, then it's safe to say that this following, my previous mental model, is inaccurate?

t0     container starts, loops waiting for /stop
t1              exec touch /stop   (atomic, begins and terminates nice and cleanly)
t2     container sees /stop, exits

...and then if so, there are a LOT of other places where I use that convention for signaling containers to terminate. This PR doesn't even address the most recently reported flake. Shouldn't all of those be updated to exec -d? Or, to take a step back? is there a better non-signal way to tell containers to exit cleanly?

test/system/030-run.bats:371:    run_podman exec $cid touch /tmp/stop
test/system/065-cp.bats:896:    run_podman exec cpcontainer touch /tmp/file.txt
test/system/065-cp.bats:917:    run_podman exec cpcontainer touch /tmp/empty.txt
test/system/075-exec.bats:83:    run_podman exec $cid touch /stop
test/system/075-exec.bats:100:    run_podman exec --user=$(id -un) $cid touch /tmp/stop
test/system/130-kill.bats:75:    run_podman exec $cid touch /stop
test/system/220-healthcheck.bats:52:    run_podman exec healthcheck_c touch /uh-oh
test/system/220-healthcheck.bats:94:    run_podman exec $ctr touch /uh-oh
test/system/220-healthcheck.bats:136:        run_podman exec $ctr touch /uh-oh
test/system/250-systemd.bats:341:    run_podman exec $cname touch /uh-oh
test/system/260-sdnotify.bats:138:    run_podman exec $cid touch /stop
test/system/260-sdnotify.bats:178:    run_podman exec $cid touch /stop
test/system/260-sdnotify.bats:231:    run_podman exec test_pod-test touch /tearsinrain
test/system/260-sdnotify.bats:342:    run_podman exec $container_a /bin/touch /stop
test/system/320-system-df.bats:127:    run_podman exec c2 touch /stop
test/system/410-selinux.bats:107:    run_podman exec myc touch /stop

@vrothberg
Copy link
Member Author

Is that accurate? If so, then it's safe to say that this following, my previous mental model, is inaccurate?

I believe so, yes. podman exec is not atomic in the sense that the container is locked for the lifespan of the exec session.

This PR doesn't even address #10825 (comment).

Can you elaborate? That is literally the flake I wanted to fix.

Shouldn't all of those be updated to exec -d?

Good thinking. It sounds logical that these other tests are also subject to this race.

Or, to take a step back? is there a better non-signal way to tell containers to exit cleanly?

Wait for a 2nd signal and exit 0 could be an alternative approach. Then we replace the exec touch /stop with a single podman kill -s 2ndSignal.

@Luap99
Copy link
Member

Luap99 commented Apr 24, 2023

exec is never atomic, it just spawns a process. We have no control how the kernel scheduler runs them. I don't even think podman kill the exec session, when the main container process (pid 1 in the pid namespace) exits the kernel will SIGKILL all processes in this namespace.

@edsantiago
Copy link
Member

Can you elaborate? That is literally the flake I wanted to fix.

Gah. Sorry, I got myself turned around. The flake not fixed is the one in comment 0, the originally-reported one in the "sdnotify : container" test.

@vrothberg
Copy link
Member Author

Can you elaborate? That is literally the flake I wanted to fix.

Gah. Sorry, I got myself turned around. The flake not fixed is the one in comment 0, the originally-reported one in the "sdnotify : container" test.

Ah, thanks! That is good I think since it fits well into the exec stop race theory and your words of wisdom that we should take a step back and look at all candidates. For now I think exec -d is a good way to address the race. I will take another look tomorrow morning.

@edsantiago
Copy link
Member

/lgtm

I'm going to take a hard look at the rest of the instances, with my new understanding in mind. Thanks to everyone for setting me straight.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 24, 2023
@vrothberg
Copy link
Member Author

Appreciate it, thanks, @edsantiago

@openshift-ci openshift-ci bot merged commit 9f4f429 into containers:main Apr 24, 2023
@Luap99
Copy link
Member

Luap99 commented Apr 24, 2023

@edsantiago One thing you could try if you really want to avoid signals is to bind mount a directory and touch the file from the host there.

@edsantiago
Copy link
Member

@Luap99 good suggestion, thank you

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 26, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

podman exec sometimes exits 137
3 participants