remote: podman kill: loses exit code #9751

edsantiago · 2021-03-18T16:09:13Z

New flake. Only three instances in two months, but that's enough to make me worry:

[+0104s] not ok 51 podman stop - unlock while waiting for timeout
(from function `is' in file test/system/helpers.bash, line 406,
 in test file test/system/050-stop.bats, line 95)
  `is "$output" "137" "Exit code of killed container"' failed
# podman-remote rm --all --force
# podman-remote ps --all --external --format {{.ID}} {{.Names}}
# podman-remote images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
quay.io/libpod/testimage:20200929 766ff5a3a7e4
# podman-remote run --name stopme -d quay.io/libpod/testimage:20200929 sh -c trap 'echo Received SIGTERM, ignoring' SIGTERM; echo READY; while :; do sleep 1; done
2da656948070442ce2585b9f2a83d003b0a1c44ea5a67b47a3b7b4a9d1f4bc9d
# podman-remote ps -a
CONTAINER ID  IMAGE                              COMMAND               CREATED                 STATUS                     PORTS   NAMES
2da656948070  quay.io/libpod/testimage:20200929  sh -c trap 'echo ...  Less than a second ago  Up Less than a second ago          stopme
# podman-remote inspect --format {{.State.Status}} stopme
stopping
# podman-remote kill stopme
2da656948070442ce2585b9f2a83d003b0a1c44ea5a67b47a3b7b4a9d1f4bc9d
# podman-remote inspect --format {{.State.ExitCode}} stopme
0
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#|     FAIL: Exit code of killed container
#| expected: '137'
#|   actual: '0'
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

sys: podman stop - unlock while waiting for timeout

gce_instance:prior-fedora : sys remote fedora-32 root host

It's possible that the bug is in the test itself.

The text was updated successfully, but these errors were encountered:

edsantiago · 2021-03-30T19:38:49Z

I just saw it again, f33:

 ✗ podman stop - unlock while waiting for timeout
   (from function `is' in file /usr/share/podman/test/system/helpers.bash, line 411,
    in test file /usr/share/podman/test/system/050-stop.bats, line 138)
     `is "$output" "137" "Exit code of killed container"' failed
   # /usr/bin/podman-remote rm --all --force
   # /usr/bin/podman-remote ps --all --external --format {{.ID}} {{.Names}}
   # /usr/bin/podman-remote images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
   quay.io/libpod/testimage:20210223 4abe85964f5c
   # /usr/bin/podman-remote run --name stopme -d quay.io/libpod/testimage:20210223 sh -c trap 'echo Received SIGTERM, ignoring' SIGTERM; echo READY; while :; do sleep 1; done
   cd52dbb101531433498ca372ffd9db273684ecac76a1e091e078261a396084ac
   # /usr/bin/podman-remote ps -a
   CONTAINER ID  IMAGE                              COMMAND               CREATED                 STATUS    PORTS   NAMES
   cd52dbb10153  quay.io/libpod/testimage:20210223  sh -c trap 'echo ...  Less than a second ago  stopping          stopme
   # /usr/bin/podman-remote inspect --format {{.State.Status}} stopme
   stopping
   # /usr/bin/podman-remote kill stopme
   stopme
   # /usr/bin/podman-remote inspect --format {{.State.ExitCode}} stopme
   0
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #|     FAIL: Exit code of killed container
   #| expected: '137'
   #|   actual: '0'
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

edsantiago · 2021-04-05T13:06:11Z

I'm starting to think there's a more general race in podman-remote exit-status handling:

gce_instance:prior-ubuntu : sys remote ubuntu-2004 root host
- PR [NO TESTS NEEDED] Add ssh connection to root user #9920
  - 04-01 15:13

Trivial reproducer (assumes podman system service running elsewhere):

$ cat Containerfile
FROM quay.io/libpod/nosuchimage:nosuchtag
RUN echo hi
$ while :;do ../bin/podman-remote build -t build_test --pull-never . && break;done
...runs for a minute or two, then

STEP 1: FROM quay.io/libpod/nosuchimage:nosuchtag
Error: error creating build container: pull policy is "never" but "quay.io/libpod/nosuchimage:nosuchtag" could not be found locally

STEP 1: FROM quay.io/libpod/nosuchimage:nosuchtag
$ <--- back to prompt, with exit status 0. This should never happen.

My interpretation: the FROM, running on the server, is failing. The server aborts, the client aborts, but the client somehow doesn't get the error message nor exit status. The above is, so far, a 100% reliable reproducer, root and rootless.

edsantiago · 2021-04-08T18:01:17Z

Another one in kill

sys: podman stop - unlock while waiting for timeout

instance:GCEInstance : sys remote ubuntu-2004 root host
- PR Cirrus: Use Fedora 34beta images #9554
  - 04-07 16:26

zhangguanzhang · 2021-04-14T10:37:55Z

https://storage.googleapis.com/cirrus-ci-6707778565701632-fcae48/artifacts/containers/podman/6676193349992448/html/sys-remote-ubuntu-2004-root-host.log.html
once agin

edsantiago · 2021-04-14T11:45:08Z

And a few more:

sys: podman stop - unlock while waiting for timeout

instance:GCEInstance : sys remote ubuntu-2004 root host
- PR buildah-bud tests: simplify #9977
  - 04-08 17:51
- PR Remove in-memory state implementation #10014
  - 04-13 19:16
instance:GCEInstance : sys remote ubuntu-2010 root host
- PR Update manifest API endpoints #9969
  - 04-09 15:06
- PR Reorganize and overhaul Makefile & release archive workflows #9381
  - 04-12 11:23

I'm having trouble keeping track of which ones I've already posted, so I may have missed some

rhatdan · 2021-04-14T13:31:47Z

This is very strange, since podman and podman-remote kill are using the same code path for kill. One would think this is a race condition between kill and inspect, but it should be faster with local execution then remote. It could be the kill kills the process inside of the container but the cleanup call does not happen until after the inspect. I would figure this would be the same race on local versus remote.

@mheon WDYT?

rhatdan · 2021-04-14T13:34:41Z

We do have this code for Docker, we could do the same for libpod, not sure why we would not want to.

	// Docker waits for the container to stop if the signal is 0 or
	// SIGKILL.
	if !utils.IsLibpodRequest(r) {
		sig, err := signal.ParseSignalNameOrNumber(query.Signal)
		if err != nil {
			utils.InternalServerError(w, err)
			return
		}
		if sig == 0 || syscall.Signal(sig) == syscall.SIGKILL {
			opts := entities.WaitOptions{
				Condition: []define.ContainerStatus{define.ContainerStateExited, define.ContainerStateStopped},
				Interval:  time.Millisecond * 250,
			}
			if _, err := containerEngine.ContainerWait(r.Context(), []string{name}, opts); err != nil {
				utils.Error(w, "Something went wrong.", http.StatusInternalServerError, err)
				return
			}
		}
	}

mheon · 2021-04-14T13:37:35Z

The test as written is inherently racy. podman kill does not and has never made a promise that it waits until the container is dead. As such, once podman-remote kill exits, the signal has been sent, and the kernel will process it when it next wakes up the container. So the race here is if the container gets scheduled before Podman gets past the initial state check for the container. I imagine the container being a child of podman-remote somewhat reduces its priority to run, as it's backgrounded?

The test should be rewritten to use podman stop to remove races.

mheon · 2021-04-14T13:38:26Z

Alternatively, it needs something to force inspect to run after the container has stopped. Maybe a loop to check container state?

edsantiago · 2021-04-14T13:43:53Z

@mheon this is more than just podman-remote kill; or perhaps there's another different bug? Search in-page for 9920, that one manifests as podman-remote build.

mheon · 2021-04-14T13:46:10Z

That sounds like a separate bug? From where I stand, the original issue here with podman kill is definitely a race in the code. podman build may also be a race but since the commands are different it sounds distinct.

rhatdan · 2021-04-14T14:02:45Z

If we changed the test to do

podman kill
podman wait
podman inspect

The issue probably goes away correct?

edsantiago · 2021-04-14T14:04:09Z

I'll submit a PR for that. (And will file the build problem as a new issue)

Add 'podman wait' between kill & inspect. Fixes: containers#9751 Signed-off-by: Ed Santiago <[email protected]>

edsantiago added flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. remote Problem is in podman-remote labels Mar 18, 2021

edsantiago added a commit to edsantiago/libpod that referenced this issue Apr 14, 2021

System tests: fix racy podman-inspect

83930a0

Add 'podman wait' between kill & inspect. Fixes: containers#9751 Signed-off-by: Ed Santiago <[email protected]>

This was referenced Apr 14, 2021

System tests: fix racy podman-inspect #10028

Merged

podman-remote build: can fail but exit 0 (success) #10029

Closed

rhatdan assigned edsantiago Apr 14, 2021

openshift-merge-robot closed this as completed in #10028 Apr 14, 2021

jmguzik pushed a commit to jmguzik/podman that referenced this issue Apr 26, 2021

System tests: fix racy podman-inspect

d3b5c3f

Add 'podman wait' between kill & inspect. Fixes: containers#9751 Signed-off-by: Ed Santiago <[email protected]>

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remote: podman kill: loses exit code #9751

remote: podman kill: loses exit code #9751

edsantiago commented Mar 18, 2021 •

edited

Loading

edsantiago commented Mar 30, 2021

edsantiago commented Apr 5, 2021

edsantiago commented Apr 8, 2021

zhangguanzhang commented Apr 14, 2021

edsantiago commented Apr 14, 2021

rhatdan commented Apr 14, 2021

rhatdan commented Apr 14, 2021

mheon commented Apr 14, 2021

mheon commented Apr 14, 2021

edsantiago commented Apr 14, 2021

mheon commented Apr 14, 2021

rhatdan commented Apr 14, 2021

edsantiago commented Apr 14, 2021

remote: podman kill: loses exit code #9751

remote: podman kill: loses exit code #9751

Comments

edsantiago commented Mar 18, 2021 • edited Loading

sys: podman stop - unlock while waiting for timeout

edsantiago commented Mar 30, 2021

edsantiago commented Apr 5, 2021

edsantiago commented Apr 8, 2021

sys: podman stop - unlock while waiting for timeout

zhangguanzhang commented Apr 14, 2021

edsantiago commented Apr 14, 2021

sys: podman stop - unlock while waiting for timeout

rhatdan commented Apr 14, 2021

rhatdan commented Apr 14, 2021

mheon commented Apr 14, 2021

mheon commented Apr 14, 2021

edsantiago commented Apr 14, 2021

mheon commented Apr 14, 2021

rhatdan commented Apr 14, 2021

edsantiago commented Apr 14, 2021

edsantiago commented Mar 18, 2021 •

edited

Loading