Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman exec sometimes exits 137 #10825

Closed
edsantiago opened this issue Jun 30, 2021 · 58 comments · Fixed by #18319 or #18323
Closed

podman exec sometimes exits 137 #10825

edsantiago opened this issue Jun 30, 2021 · 58 comments · Fixed by #18319 or #18323
Labels
bugweek flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. rootless

Comments

@edsantiago
Copy link
Member

I see this once in a while. Too infrequently to have a good reproducer. Only rootless so far:

$ while :;do bats --filter container /usr/share/podman/test/system/*sdnot*bats || break;done
...mostly works... then:
 ✗ sdnotify : container
   (from function `die' in file /usr/share/podman/test/system/helpers.bash, line 413,
    from function `run_podman' in file /usr/share/podman/test/system/helpers.bash, line 221,
    in test file /usr/share/podman/test/system/260-sdnotify.bats, line 163)
     `run_podman exec $cid touch /stop' failed with status 137
   $ podman rm --all --force
   $ podman ps --all --external --format {{.ID}} {{.Names}}
   $ podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
   quay.io/libpod/testimage:20210610 9f9ec7f2fdef
   $ podman pull quay.io/libpod/fedora:31
   Trying to pull quay.io/libpod/fedora:31...
   Getting image source signatures
   Copying blob sha256:c28ace6b0c4ae099f6f81091731bdf41d9771d28bad96ae4a3507fe950560930
   Copying config sha256:a7a37f74ff864eec55891b64ad360d07020827486e30a92ea81d16459645b26a
   Writing manifest to image destination
   Storing signatures
   a7a37f74ff864eec55891b64ad360d07020827486e30a92ea81d16459645b26a
   $ podman run -d --sdnotify=container quay.io/libpod/fedora:31 sh -c printenv NOTIFY_SOCKET;echo READY;systemd-notify --ready;while ! test -f /stop;do sleep 0.1;done
   a98e2e8a856772144e2297d4dc3b2d21ffa7ab5ff6088a998168d08add9264f4
   $ podman logs a98e2e8a856772144e2297d4dc3b2d21ffa7ab5ff6088a998168d08add9264f4
   /tmp/podman_bats.HrMDlg/container.sock/notify
   READY
   $ podman logs a98e2e8a856772144e2297d4dc3b2d21ffa7ab5ff6088a998168d08add9264f4
   /tmp/podman_bats.HrMDlg/container.sock/notify
   READY
   $ podman exec a98e2e8a856772144e2297d4dc3b2d21ffa7ab5ff6088a998168d08add9264f4 touch /stop
   [ rc=137 (** EXPECTED 0 **) ]
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #| FAIL: exit code is 137; expected 0
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   # [teardown]
   $ podman pod rm --all --force
   $ podman rm --all --force
   a98e2e8a856772144e2297d4dc3b2d21ffa7ab5ff6088a998168d08add9264f4

1 test, 1 failure
@edsantiago edsantiago added flakes Flakes from Continuous Integration rootless labels Jun 30, 2021
@mheon
Copy link
Member

mheon commented Jun 30, 2021

Is this latest code, or a released version?

@edsantiago
Copy link
Member Author

Sorry. The above is podman-3.3.0-0.20.dev.git599b7d7.fc35 but I've also seen it in f34 (bodhi-testing rpms). I do not see any instances in actual CI.

@mheon
Copy link
Member

mheon commented Jun 30, 2021

Alright. I made a pretty substantial change to how exit codes are retrieved in main, but it has not reached F34 yet so that suggests it was not my changes.

@Luap99
Copy link
Member

Luap99 commented Jul 2, 2021

@edsantiago
Copy link
Member Author

Interesting - yes, that's the same symptom, and now it's in CI. (FWIW I've started preserving CI logs of flakes, and this is the only instance of the touch /stop.*137 error I can find).

@edsantiago
Copy link
Member Author

Another one: sys podman fedora-33 root host in #10880

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Aug 12, 2021

@flouthoc PTAL

@flouthoc
Copy link
Collaborator

@edsantiago are we experiencing this issue anymore, could you point me to any more such cases. Maybe the root cause is underlying infra running out of memory, I'll try recreating this.

@edsantiago
Copy link
Member Author

edsantiago commented Aug 16, 2021

I was able to reproduce the failure using the recipe in comment 0. It took about 45 minutes (I didn't add timestamps, I just checked that window periodically). podman-3.3.0-0.26.rc1.fc35 on 5.13.0-0.rc7.20210625git44db63d1ad8d.55.fc35

I don't see any new instances in CI, though; and I don't regularly monitor bodhi failures.

EDIT: in the (unlikely) case that error log is useful:

 ✗ sdnotify : container
   (from function `die' in file /usr/share/podman/test/system/helpers.bash, line 431,
    from function `run_podman' in file /usr/share/podman/test/system/helpers.bash, line 221,
    in test file /usr/share/podman/test/system/260-sdnotify.bats, line 163)
     `run_podman exec $cid touch /stop' failed with status 137
   $ podman rm --all --force
   $ podman ps --all --external --format {{.ID}} {{.Names}}
   $ podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
   quay.io/libpod/testimage:20210610 9f9ec7f2fdef
   $ podman pull quay.io/libpod/fedora:31
   Trying to pull quay.io/libpod/fedora:31...
   Getting image source signatures
   Copying blob sha256:c28ace6b0c4ae099f6f81091731bdf41d9771d28bad96ae4a3507fe950560930
   Copying blob sha256:c28ace6b0c4ae099f6f81091731bdf41d9771d28bad96ae4a3507fe950560930
   Copying config sha256:a7a37f74ff864eec55891b64ad360d07020827486e30a92ea81d16459645b26a
   Writing manifest to image destination
   Storing signatures
   a7a37f74ff864eec55891b64ad360d07020827486e30a92ea81d16459645b26a
   $ podman run -d --sdnotify=container quay.io/libpod/fedora:31 sh -c printenv NOTIFY_SOCKET;echo READY;systemd-notify --ready;while ! test -f /stop;do sleep 0.1;done
   52cff7db8ddd73a7fef9dc200d21a6e853f40b2e08c68273f6cc12b767b88562
   $ podman logs 52cff7db8ddd73a7fef9dc200d21a6e853f40b2e08c68273f6cc12b767b88562
   /tmp/podman_bats.njUKUq/container.sock/notify
   READY
   $ podman logs 52cff7db8ddd73a7fef9dc200d21a6e853f40b2e08c68273f6cc12b767b88562
   /tmp/podman_bats.njUKUq/container.sock/notify
   READY
   $ podman exec 52cff7db8ddd73a7fef9dc200d21a6e853f40b2e08c68273f6cc12b767b88562 touch /stop
   [ rc=137 (** EXPECTED 0 **) ]
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #| FAIL: exit code is 137; expected 0
   #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   # [teardown]
   $ podman pod rm --all --force
   $ podman rm --all --force
   52cff7db8ddd73a7fef9dc200d21a6e853f40b2e08c68273f6cc12b767b88562

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Sep 16, 2021

Exit Code 137 seems to mean that the process is killed by the sigkill. (Or google mentions that it might be the oom-killer, but probably same thing.)
Something is killing the exec process. Likely candidates would be Podman, Systemd, kernel.

@flouthoc
Copy link
Collaborator

Not sure if this is really frequent in CI as @edsantiago mentioned but I'll try to recreating this.

@edsantiago
Copy link
Member Author

Seen September 20 in ubuntu-2104-rootless. No, it's not frequent, but it's still out there.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

sys: podman system df - with active containers and volumes

sys: sdnotify : container

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

My cirrus flake logs show no further instances of this since the 10-18 one. Closing; I'll reopen if it becomes a problem again.

@edsantiago
Copy link
Member Author

Aaaand, reopening.

[sys] 235 sdnotify : container

@vrothberg
Copy link
Member

@containers/podman-maintainers since this is now being seen in the Real World, maybe it's a good candidate for Bug Week?

I created a new bugweek label. Feel free to label more issues.

@vrothberg
Copy link
Member

Cc: @mheon

@edsantiago edsantiago changed the title [placeholder] podman exec sometimes exits 137 podman exec sometimes exits 137 Apr 20, 2023
@rhatdan
Copy link
Member

rhatdan commented Apr 20, 2023

Since 137 indicates process killed by oom killer. Then I am not sure how we solve this problem. In testing potentially @Luap99 cleanup of leaks might help.

@Luap99
Copy link
Member

Luap99 commented Apr 21, 2023

Assuming the memory reporting in the cirrus tasks are accurate these tests have plenty of free memory so the oom killer
shouldn't do anything here.

@rhatdan
Copy link
Member

rhatdan commented Apr 21, 2023

Could it be cgroups limiting memory triggering it? If tools are getting 137 exit codes, is there any way other then OOM Kill? Google seems to point at it?

@rhatdan
Copy link
Member

rhatdan commented Apr 21, 2023

Google exit code 137

First response is:

Exit code 137 means a container or pod is trying to use more memory than it's allowed. The process gets terminated to prevent memory usage ballooning indefinitely, which could cause your host system to become unstable.

@vrothberg
Copy link
Member

The memory thing could totally be a red herring. Just kill a running container:

$ podman run alpine sleep infinity
$ echo $?                         
137                                                  

@vrothberg
Copy link
Member

What I see in the journal is that the container dies during the 2nd exec. I do not have a good explanation yet for why though. The kube play session is run via time --foreground -v --kill=10 60 podman & but the test runs for a second, so way below 1 minute.

@Luap99
Copy link
Member

Luap99 commented Apr 24, 2023

Could this be a very simple race with the touch /stop logic? The main container waits for the file then exits. When the container exits it will kill all exec sessions AFAIK. So there is a very tiny window where touch created the file then gets interrupted before it it exits 0, then the main container process sees the file and exits causes podman to kill the exec session.

@vrothberg
Copy link
Member

Very good thinking, @Luap99. That indeed is a nice explanation of what might be happening. It has also happened in older versions of this very test and always on the touch /stop exec.

vrothberg added a commit to vrothberg/libpod that referenced this issue Apr 24, 2023
The `exec` session somestimes exits with 137 as the exec session races
with the cleanup process of the exiting container.  Fix the flake by
running a detached exec session.

Fixes: containers#10825
Signed-off-by: Valentin Rothberg <[email protected]>
@vrothberg
Copy link
Member

Opened #18319

@rhatdan
Copy link
Member

rhatdan commented Apr 24, 2023

$ podman run alpine sleep infinity
$ echo $?                         
137    

Could that be a kernel bug?

@rhatdan
Copy link
Member

rhatdan commented Apr 24, 2023

man timeout
...
137 if COMMAND (or timeout itself) is sent the KILL (9) signal (128+9)

Looks like this is expected behavior. Sadly I never knew this.
A SIGKILL process, exits with 137.

@vrothberg
Copy link
Member

Yes, that's expected. 128+n with kill=9 makes the 137. I assume an OOM kill is just an ordinary kill which would explain the 137 in the OOM case.

@Luap99
Copy link
Member

Luap99 commented Apr 24, 2023

A SIGKILL process, exits with 137

No it does not. If a process is killed by a signal it has no exit code at all. The parent reaps the child and can get the number of the signal instead of the exit code . It is the shell that returns 128 + number of signal back to us. It is by no means required to do that.

@rhatdan
Copy link
Member

rhatdan commented Apr 24, 2023

Well podman is catching the exit code and reporting it back via the exit code.

@rhatdan
Copy link
Member

rhatdan commented Apr 24, 2023

$ podman run --entrypoint sleep alpine 100; echo $?
137

Just wanted to make sure this was not the internal sh doing this.

@rhatdan
Copy link
Member

rhatdan commented Apr 24, 2023

This looks interesting:

./vendor/github.com/containers/storage/pkg/unshare/unshare_linux.go: exit(int(waitStatus.Signal()) + 128)

@Luap99
Copy link
Member

Luap99 commented Apr 24, 2023

Yes it is common practice to return 128 + signal number but it by no means enforced by the kernel APIs.

@rhatdan
Copy link
Member

rhatdan commented Apr 24, 2023

Bottom line: 137 should be expected when containers which ignore SIGTERM will exit with 137 exit code.

edsantiago added a commit to edsantiago/libpod that referenced this issue Apr 24, 2023
Having a container spin-wait on a /stop file, then exit, is
unsafe: 'podman exec $ctr touch /stop' can get sucked into
container cleanup before the exec terminates, resulting in
the podman-exec failing and hence the test failing.

Most existing instances of this pattern are unnecessary.
Replace those with just 'podman rm -f'.

When necessary, use a variety of safer alternatives.

Re-Closes: containers#10825 (already closed; this addresses remaining cases)

Signed-off-by: Ed Santiago <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 26, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bugweek flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. rootless
Projects
None yet
8 participants