podman exec sometimes exits 137 #10825

edsantiago · 2021-06-30T14:39:14Z

I see this once in a while. Too infrequently to have a good reproducer. Only rootless so far:

$ while :;do bats --filter container /usr/share/podman/test/system/*sdnot*bats || break;done
...mostly works... then:
 ✗ sdnotify : container
   (from function `die' in file /usr/share/podman/test/system/helpers.bash, line 413,
    from function `run_podman' in file /usr/share/podman/test/system/helpers.bash, line 221,
    in test file /usr/share/podman/test/system/260-sdnotify.bats, line 163)
     `run_podman exec $cid touch /stop' failed with status 137
   $ podman rm --all --force
   $ podman ps --all --external --format {{.ID}} {{.Names}}
   $ podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
   quay.io/libpod/testimage:20210610 9f9ec7f2fdef
   $ podman pull quay.io/libpod/fedora:31
   Trying to pull quay.io/libpod/fedora:31...
   Getting image source signatures
   Copying blob sha256:c28ace6b0c4ae099f6f81091731bdf41d9771d28bad96ae4a3507fe950560930
   Copying config sha256:a7a37f74ff864eec55891b64ad360d07020827486e30a92ea81d16459645b26a
   Writing manifest to image destination
   Storing signatures
   a7a37f74ff864eec55891b64ad360d07020827486e30a92ea81d16459645b26a
   $ podman run -d --sdnotify=container quay.io/libpod/fedora:31 sh -c printenv NOTIFY_SOCKET;echo READY;systemd-notify --ready;while ! test -f /stop;do sleep 0.1;done
   a98e2e8a856772144e2297d4dc3b2d21ffa7ab5ff6088a998168d08add9264f4
   $ podman logs a98e2e8a856772144e2297d4dc3b2d21ffa7ab5ff6088a998168d08add9264f4
   /tmp/podman_bats.HrMDlg/container.sock/notify
   READY
   $ podman logs a98e2e8a856772144e2297d4dc3b2d21ffa7ab5ff6088a998168d08add9264f4
   /tmp/podman_bats.HrMDlg/container.sock/notify
   READY
   $ podman exec a98e2e8a856772144e2297d4dc3b2d21ffa7ab5ff6088a998168d08add9264f4 touch /stop
   [ rc=137 (** EXPECTED 0 **) ]
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #| FAIL: exit code is 137; expected 0
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   # [teardown]
   $ podman pod rm --all --force
   $ podman rm --all --force
   a98e2e8a856772144e2297d4dc3b2d21ffa7ab5ff6088a998168d08add9264f4

1 test, 1 failure

mheon · 2021-06-30T14:45:16Z

Is this latest code, or a released version?

edsantiago · 2021-06-30T14:47:38Z

Sorry. The above is podman-3.3.0-0.20.dev.git599b7d7.fc35 but I've also seen it in f34 (bodhi-testing rpms). I do not see any instances in actual CI.

mheon · 2021-06-30T14:55:35Z

Alright. I made a pretty substantial change to how exit codes are retrieved in main, but it has not reached F34 yet so that suggests it was not my changes.

Luap99 · 2021-07-02T18:10:26Z

https://storage.googleapis.com/cirrus-ci-6707778565701632-fcae48/artifacts/containers/podman/6094258743541760/html/sys-remote-ubuntu-2104-root-host.log.html
Is this related? Also 137 exit code with podman exec in remote system test.

edsantiago · 2021-07-06T12:46:38Z

Interesting - yes, that's the same symptom, and now it's in CI. (FWIW I've started preserving CI logs of flakes, and this is the only instance of the touch /stop.*137 error I can find).

edsantiago · 2021-07-12T16:23:14Z

Another one: sys podman fedora-33 root host in #10880

github-actions · 2021-08-12T00:03:24Z

A friendly reminder that this issue had no activity for 30 days.

rhatdan · 2021-08-12T18:42:34Z

@flouthoc PTAL

flouthoc · 2021-08-16T04:41:50Z

@edsantiago are we experiencing this issue anymore, could you point me to any more such cases. Maybe the root cause is underlying infra running out of memory, I'll try recreating this.

edsantiago · 2021-08-16T13:19:50Z

I was able to reproduce the failure using the recipe in comment 0. It took about 45 minutes (I didn't add timestamps, I just checked that window periodically). podman-3.3.0-0.26.rc1.fc35 on 5.13.0-0.rc7.20210625git44db63d1ad8d.55.fc35

I don't see any new instances in CI, though; and I don't regularly monitor bodhi failures.

EDIT: in the (unlikely) case that error log is useful:

 ✗ sdnotify : container
   (from function `die' in file /usr/share/podman/test/system/helpers.bash, line 431,
    from function `run_podman' in file /usr/share/podman/test/system/helpers.bash, line 221,
    in test file /usr/share/podman/test/system/260-sdnotify.bats, line 163)
     `run_podman exec $cid touch /stop' failed with status 137
   $ podman rm --all --force
   $ podman ps --all --external --format {{.ID}} {{.Names}}
   $ podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
   quay.io/libpod/testimage:20210610 9f9ec7f2fdef
   $ podman pull quay.io/libpod/fedora:31
   Trying to pull quay.io/libpod/fedora:31...
   Getting image source signatures
   Copying blob sha256:c28ace6b0c4ae099f6f81091731bdf41d9771d28bad96ae4a3507fe950560930
   Copying blob sha256:c28ace6b0c4ae099f6f81091731bdf41d9771d28bad96ae4a3507fe950560930
   Copying config sha256:a7a37f74ff864eec55891b64ad360d07020827486e30a92ea81d16459645b26a
   Writing manifest to image destination
   Storing signatures
   a7a37f74ff864eec55891b64ad360d07020827486e30a92ea81d16459645b26a
   $ podman run -d --sdnotify=container quay.io/libpod/fedora:31 sh -c printenv NOTIFY_SOCKET;echo READY;systemd-notify --ready;while ! test -f /stop;do sleep 0.1;done
   52cff7db8ddd73a7fef9dc200d21a6e853f40b2e08c68273f6cc12b767b88562
   $ podman logs 52cff7db8ddd73a7fef9dc200d21a6e853f40b2e08c68273f6cc12b767b88562
   /tmp/podman_bats.njUKUq/container.sock/notify
   READY
   $ podman logs 52cff7db8ddd73a7fef9dc200d21a6e853f40b2e08c68273f6cc12b767b88562
   /tmp/podman_bats.njUKUq/container.sock/notify
   READY
   $ podman exec 52cff7db8ddd73a7fef9dc200d21a6e853f40b2e08c68273f6cc12b767b88562 touch /stop
   [ rc=137 (** EXPECTED 0 **) ]
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #| FAIL: exit code is 137; expected 0
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   # [teardown]
   $ podman pod rm --all --force
   $ podman rm --all --force
   52cff7db8ddd73a7fef9dc200d21a6e853f40b2e08c68273f6cc12b767b88562

github-actions · 2021-09-16T00:03:44Z

A friendly reminder that this issue had no activity for 30 days.

rhatdan · 2021-09-16T10:23:52Z

Exit Code 137 seems to mean that the process is killed by the sigkill. (Or google mentions that it might be the oom-killer, but probably same thing.)
Something is killing the exec process. Likely candidates would be Podman, Systemd, kernel.

flouthoc · 2021-09-16T10:55:04Z

Not sure if this is really frequent in CI as @edsantiago mentioned but I'll try to recreating this.

edsantiago · 2021-09-28T11:27:00Z

Seen September 20 in ubuntu-2104-rootless. No, it's not frequent, but it's still out there.

github-actions · 2021-10-29T00:04:03Z

A friendly reminder that this issue had no activity for 30 days.

edsantiago · 2021-10-29T00:15:01Z

sys: podman system df - with active containers and volumes

ubuntu-2104 : sys podman ubuntu-2104 rootless host
- PR [3.4] vendor c/[email protected] #11655
  - 09-20 09:50

sys: sdnotify : container

ubuntu-2104 : sys podman ubuntu-2104 rootless host
- PR Pod Rm Infra Handling Improvements #11851
  - 10-18 12:50

github-actions · 2021-11-29T00:04:13Z

A friendly reminder that this issue had no activity for 30 days.

edsantiago · 2021-11-29T13:31:16Z

My cirrus flake logs show no further instances of this since the 10-18 one. Closing; I'll reopen if it becomes a problem again.

edsantiago · 2022-01-04T18:10:53Z

Aaaand, reopening.

[sys] 235 sdnotify : container

fedora-35 : sys podman fedora-35 root host
- PR test/system: podman run with log-opt option #12725
  - 12-31 10:12

vrothberg · 2023-04-20T11:36:34Z

@containers/podman-maintainers since this is now being seen in the Real World, maybe it's a good candidate for Bug Week?

I created a new bugweek label. Feel free to label more issues.

vrothberg · 2023-04-20T11:36:43Z

Cc: @mheon

rhatdan · 2023-04-20T23:41:01Z

Since 137 indicates process killed by oom killer. Then I am not sure how we solve this problem. In testing potentially @Luap99 cleanup of leaks might help.

Luap99 · 2023-04-21T09:02:01Z

Assuming the memory reporting in the cirrus tasks are accurate these tests have plenty of free memory so the oom killer
shouldn't do anything here.

rhatdan · 2023-04-21T09:24:35Z

Could it be cgroups limiting memory triggering it? If tools are getting 137 exit codes, is there any way other then OOM Kill? Google seems to point at it?

rhatdan · 2023-04-21T09:27:02Z

Google exit code 137

First response is:

Exit code 137 means a container or pod is trying to use more memory than it's allowed. The process gets terminated to prevent memory usage ballooning indefinitely, which could cause your host system to become unstable.

vrothberg · 2023-04-24T09:40:40Z

The memory thing could totally be a red herring. Just kill a running container:

$ podman run alpine sleep infinity
$ echo $?                         
137

vrothberg · 2023-04-24T11:11:16Z

What I see in the journal is that the container dies during the 2nd exec. I do not have a good explanation yet for why though. The kube play session is run via time --foreground -v --kill=10 60 podman & but the test runs for a second, so way below 1 minute.

Luap99 · 2023-04-24T11:37:42Z

Could this be a very simple race with the touch /stop logic? The main container waits for the file then exits. When the container exits it will kill all exec sessions AFAIK. So there is a very tiny window where touch created the file then gets interrupted before it it exits 0, then the main container process sees the file and exits causes podman to kill the exec session.

vrothberg · 2023-04-24T12:05:05Z

Very good thinking, @Luap99. That indeed is a nice explanation of what might be happening. It has also happened in older versions of this very test and always on the touch /stop exec.

The `exec` session somestimes exits with 137 as the exec session races with the cleanup process of the exiting container. Fix the flake by running a detached exec session. Fixes: containers#10825 Signed-off-by: Valentin Rothberg <[email protected]>

vrothberg · 2023-04-24T12:10:15Z

Opened #18319

rhatdan · 2023-04-24T12:30:17Z

$ podman run alpine sleep infinity
$ echo $?                         
137

Could that be a kernel bug?

rhatdan · 2023-04-24T12:35:48Z

man timeout
...
137 if COMMAND (or timeout itself) is sent the KILL (9) signal (128+9)

Looks like this is expected behavior. Sadly I never knew this.
A SIGKILL process, exits with 137.

vrothberg · 2023-04-24T12:37:56Z

Yes, that's expected. 128+n with kill=9 makes the 137. I assume an OOM kill is just an ordinary kill which would explain the 137 in the OOM case.

Luap99 · 2023-04-24T12:46:09Z

A SIGKILL process, exits with 137

No it does not. If a process is killed by a signal it has no exit code at all. The parent reaps the child and can get the number of the signal instead of the exit code . It is the shell that returns 128 + number of signal back to us. It is by no means required to do that.

rhatdan · 2023-04-24T12:50:27Z

Well podman is catching the exit code and reporting it back via the exit code.

rhatdan · 2023-04-24T12:53:10Z

$ podman run --entrypoint sleep alpine 100; echo $?
137

Just wanted to make sure this was not the internal sh doing this.

rhatdan · 2023-04-24T12:55:29Z

This looks interesting:

./vendor/github.com/containers/storage/pkg/unshare/unshare_linux.go: exit(int(waitStatus.Signal()) + 128)

Luap99 · 2023-04-24T13:03:27Z

Yes it is common practice to return 128 + signal number but it by no means enforced by the kernel APIs.

rhatdan · 2023-04-24T13:08:14Z

Bottom line: 137 should be expected when containers which ignore SIGTERM will exit with 137 exit code.

Having a container spin-wait on a /stop file, then exit, is unsafe: 'podman exec $ctr touch /stop' can get sucked into container cleanup before the exec terminates, resulting in the podman-exec failing and hence the test failing. Most existing instances of this pattern are unnecessary. Replace those with just 'podman rm -f'. When necessary, use a variety of safer alternatives. Re-Closes: containers#10825 (already closed; this addresses remaining cases) Signed-off-by: Ed Santiago <[email protected]>

edsantiago added flakes Flakes from Continuous Integration rootless labels Jun 30, 2021

github-actions bot added the stale-issue label Aug 12, 2021

rhatdan removed the stale-issue label Aug 12, 2021

github-actions bot added the stale-issue label Sep 16, 2021

rhatdan removed the stale-issue label Sep 16, 2021

github-actions bot added the stale-issue label Oct 29, 2021

edsantiago removed the stale-issue label Oct 29, 2021

github-actions bot added the stale-issue label Nov 29, 2021

edsantiago closed this as completed Nov 29, 2021

edsantiago mentioned this issue Nov 29, 2021

podman exec -it failures #12423

Closed

edsantiago changed the title ~~[placeholder] podman exec sometimes exits 137~~ podman exec sometimes exits 137 Apr 20, 2023

vrothberg mentioned this issue Apr 24, 2023

test/system/260-sdnotify.bats: fix test flake #18319

Merged

openshift-ci bot closed this as completed in #18319 Apr 24, 2023

edsantiago mentioned this issue Apr 24, 2023

system tests: safer container-stop signaling #18323

Merged

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 26, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 26, 2023

podman exec sometimes exits 137 #10825

podman exec sometimes exits 137 #10825

Comments

edsantiago commented Jun 30, 2021

mheon commented Jun 30, 2021

edsantiago commented Jun 30, 2021

mheon commented Jun 30, 2021

Luap99 commented Jul 2, 2021

edsantiago commented Jul 6, 2021

edsantiago commented Jul 12, 2021

github-actions bot commented Aug 12, 2021

rhatdan commented Aug 12, 2021

flouthoc commented Aug 16, 2021

edsantiago commented Aug 16, 2021 • edited Loading

github-actions bot commented Sep 16, 2021

rhatdan commented Sep 16, 2021

flouthoc commented Sep 16, 2021

edsantiago commented Sep 28, 2021

github-actions bot commented Oct 29, 2021

edsantiago commented Oct 29, 2021

sys: podman system df - with active containers and volumes

sys: sdnotify : container

github-actions bot commented Nov 29, 2021

edsantiago commented Nov 29, 2021

edsantiago commented Jan 4, 2022

[sys] 235 sdnotify : container

vrothberg commented Apr 20, 2023

vrothberg commented Apr 20, 2023

rhatdan commented Apr 20, 2023

Luap99 commented Apr 21, 2023

rhatdan commented Apr 21, 2023

rhatdan commented Apr 21, 2023

vrothberg commented Apr 24, 2023

vrothberg commented Apr 24, 2023

Luap99 commented Apr 24, 2023

vrothberg commented Apr 24, 2023

vrothberg commented Apr 24, 2023

rhatdan commented Apr 24, 2023

rhatdan commented Apr 24, 2023

vrothberg commented Apr 24, 2023

Luap99 commented Apr 24, 2023

rhatdan commented Apr 24, 2023

rhatdan commented Apr 24, 2023

rhatdan commented Apr 24, 2023

Luap99 commented Apr 24, 2023

rhatdan commented Apr 24, 2023

edsantiago commented Aug 16, 2021 •

edited

Loading