-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
podman exec sometimes exits 137 #10825
Comments
Is this latest code, or a released version? |
Sorry. The above is podman-3.3.0-0.20.dev.git599b7d7.fc35 but I've also seen it in f34 (bodhi-testing rpms). I do not see any instances in actual CI. |
Alright. I made a pretty substantial change to how exit codes are retrieved in main, but it has not reached F34 yet so that suggests it was not my changes. |
https://storage.googleapis.com/cirrus-ci-6707778565701632-fcae48/artifacts/containers/podman/6094258743541760/html/sys-remote-ubuntu-2104-root-host.log.html |
Interesting - yes, that's the same symptom, and now it's in CI. (FWIW I've started preserving CI logs of flakes, and this is the only instance of the |
Another one: sys podman fedora-33 root host in #10880 |
A friendly reminder that this issue had no activity for 30 days. |
@flouthoc PTAL |
@edsantiago are we experiencing this issue anymore, could you point me to any more such cases. Maybe the root cause is underlying infra running out of memory, I'll try recreating this. |
I was able to reproduce the failure using the recipe in comment 0. It took about 45 minutes (I didn't add timestamps, I just checked that window periodically). podman-3.3.0-0.26.rc1.fc35 on 5.13.0-0.rc7.20210625git44db63d1ad8d.55.fc35 I don't see any new instances in CI, though; and I don't regularly monitor bodhi failures. EDIT: in the (unlikely) case that error log is useful:
|
A friendly reminder that this issue had no activity for 30 days. |
Exit Code 137 seems to mean that the process is killed by the sigkill. (Or google mentions that it might be the oom-killer, but probably same thing.) |
Not sure if this is really frequent in CI as @edsantiago mentioned but I'll try to recreating this. |
Seen September 20 in ubuntu-2104-rootless. No, it's not frequent, but it's still out there. |
A friendly reminder that this issue had no activity for 30 days. |
sys: podman system df - with active containers and volumes
sys: sdnotify : container
|
A friendly reminder that this issue had no activity for 30 days. |
My cirrus flake logs show no further instances of this since the 10-18 one. Closing; I'll reopen if it becomes a problem again. |
Aaaand, reopening. [sys] 235 sdnotify : container
|
I created a new |
Cc: @mheon |
Since 137 indicates process killed by oom killer. Then I am not sure how we solve this problem. In testing potentially @Luap99 cleanup of leaks might help. |
Assuming the memory reporting in the cirrus tasks are accurate these tests have plenty of free memory so the oom killer |
Could it be cgroups limiting memory triggering it? If tools are getting 137 exit codes, is there any way other then OOM Kill? Google seems to point at it? |
Google exit code 137 First response is:
|
The memory thing could totally be a red herring. Just kill a running container:
|
What I see in the journal is that the container dies during the 2nd exec. I do not have a good explanation yet for why though. The |
Could this be a very simple race with the |
Very good thinking, @Luap99. That indeed is a nice explanation of what might be happening. It has also happened in older versions of this very test and always on the |
The `exec` session somestimes exits with 137 as the exec session races with the cleanup process of the exiting container. Fix the flake by running a detached exec session. Fixes: containers#10825 Signed-off-by: Valentin Rothberg <[email protected]>
Opened #18319 |
Could that be a kernel bug? |
man timeout Looks like this is expected behavior. Sadly I never knew this. |
Yes, that's expected. |
No it does not. If a process is killed by a signal it has no exit code at all. The parent reaps the child and can get the number of the signal instead of the exit code . It is the shell that returns 128 + number of signal back to us. It is by no means required to do that. |
Well podman is catching the exit code and reporting it back via the exit code. |
$ podman run --entrypoint sleep alpine 100; echo $? Just wanted to make sure this was not the internal sh doing this. |
This looks interesting: ./vendor/github.com/containers/storage/pkg/unshare/unshare_linux.go: exit(int(waitStatus.Signal()) + 128) |
Yes it is common practice to return 128 + signal number but it by no means enforced by the kernel APIs. |
Bottom line: 137 should be expected when containers which ignore SIGTERM will exit with 137 exit code. |
Having a container spin-wait on a /stop file, then exit, is unsafe: 'podman exec $ctr touch /stop' can get sucked into container cleanup before the exec terminates, resulting in the podman-exec failing and hence the test failing. Most existing instances of this pattern are unnecessary. Replace those with just 'podman rm -f'. When necessary, use a variety of safer alternatives. Re-Closes: containers#10825 (already closed; this addresses remaining cases) Signed-off-by: Ed Santiago <[email protected]>
I see this once in a while. Too infrequently to have a good reproducer. Only rootless so far:
The text was updated successfully, but these errors were encountered: