Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman run --rm does not work when lots of container process are killed at same time. #7051

Closed
coldbloodx opened this issue Jul 22, 2020 · 7 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue

Comments

@coldbloodx
Copy link

/kind bug

Description

Steps to reproduce the issue:

start some containers at same time.
2.
kill these containers at same time.

Describe the results you received:
some of containers are not removed and in INCORRECT stat.

Describe the results you expected:
all container should be removed, since the --rm is specified.

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

[boliu@lsf1x125 ~]$ rpm -qa |grep podman
podman-1.6.4-11.module+el8.2.0+6368+cf16aa14.x86_64
podman-docker-1.6.4-11.module+el8.2.0+6368+cf16aa14.noarch```

**Output of `podman info --debug`:**

[boliu@lsf1x125 ~]$ podman info --debug
debug:
compiler: gc
git commit: ""
go version: go1.13.4
podman version: 1.6.4
host:
BuildahVersion: 1.12.0-dev
CgroupVersion: v1
Conmon:
package: conmon-2.0.6-1.module+el8.2.0+6368+cf16aa14.x86_64
path: /usr/bin/conmon
version: 'conmon version 2.0.6, commit: 9adfe850ef954416ea5dd0438d428a60f2139473'
Distribution:
distribution: '"rhel"'
version: "8.1"
IDMappings:
gidmap:
- container_id: 0
host_id: 10007
size: 1
- container_id: 1
host_id: 100000
size: 65536
uidmap:
- container_id: 0
host_id: 34040
size: 1
- container_id: 1
host_id: 100000
size: 65536
MemFree: 4692107264
MemTotal: 8189198336
OCIRuntime:
name: runc
package: runc-1.0.0-65.rc10.module+el8.2.0+6368+cf16aa14.x86_64
path: /usr/bin/runc
version: 'runc version spec: 1.0.1-dev'
SwapFree: 4287361024
SwapTotal: 4294963200
arch: amd64
cpus: 4
eventlogger: journald
hostname: lsf1x125
kernel: 4.18.0-193.el8.x86_64
os: linux
rootless: true
slirp4netns:
Executable: /usr/bin/slirp4netns
Package: slirp4netns-0.4.2-3.git21fdece.module+el8.2.0+6368+cf16aa14.x86_64
Version: |-
slirp4netns version 0.4.2+dev
commit: 21fdece2737dc24ffa3f01a341b8a6854f8b13b4
uptime: 55h 51m 24.3s (Approximately 2.29 days)
registries:
blocked: null
insecure: null
search:

  • docker.io
    store:
    ConfigFile: /home/boliu/.config/containers/storage.conf
    ContainerStore:
    number: 89
    GraphDriverName: overlay
    GraphOptions:
    overlay.mount_program:
    Executable: /usr/bin/fuse-overlayfs
    Package: fuse-overlayfs-0.7.2-5.module+el8.2.0+6368+cf16aa14.x86_64
    Version: |-
    fuse-overlayfs: version 0.7.2
    FUSE library version 3.2.1
    using FUSE kernel interface version 7.26
    GraphRoot: /opt/boliu/podman/containers/storage
    GraphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
    ImageStore:
    number: 1
    RunRoot: /tmp/run-34040
    VolumePath: /opt/boliu/podman/containers/storage/volumes

**Package info (e.g. output of `rpm -q podman` or `apt list podman`):**

[boliu@lsf1x125 ~]$ rpm -qa |grep podman
podman-1.6.4-11.module+el8.2.0+6368+cf16aa14.x86_64
podman-docker-1.6.4-11.module+el8.2.0+6368+cf16aa14.noarch```

Additional environment details (AWS, VirtualBox, physical, etc.):
virtual machine on vmware host.

@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 22, 2020
@rhatdan
Copy link
Member

rhatdan commented Jul 22, 2020

You should be able to get a newer version of podman for RHEL8.2.1. Try to update the podman package and report if the issue is fixed.

RHEL8.2.1 was released yesterday, I believe.

@mheon
Copy link
Member

mheon commented Jul 22, 2020

If this does persist in 1.9, a proper reproducer would be greatly appreciated - how many containers are involved, does the command in question matter, what command(s) are used to remove them, etc.

@coldbloodx
Copy link
Author

@rhatdan @mheon,

The overall workflow is a little bit complicated.

we started an execution daemon(execution server -> res ) on host.
the execution daemon starts podman container through a python script(container starter -> cstarter).
when we kill the container, the execution daemon(or some other daemon) will call another python script to:
a . sending signals to the running process in container.
or
b. run 'podman kill '

to stop/kill container .

But both them will hit the issue above.
Some of containers can be killed and cleaned successfully.
Some of them remains in different state: "Created", "Up" or "Exited".

Here is an example
boliu@lsf1x125[conf]:$docker ps -a; lsrun -m " lsf1x127" docker ps -a; lsrun -m " lsf1x126" docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c7bb38d9137a localhost/centos:latest /bin/sh -c sleep ... 22 hours ago Exited (-1) 12 hours ago a.task.1
029fa9794390 localhost/centos:latest /bin/sh -c sleep ... 22 hours ago Up 22 hours ago a.task.2
23ca787023e5 localhost/centos:latest /bin/sh -c sleep ... 22 hours ago Exited (137) 22 hours ago a.task.6
268a07fe707e localhost/centos:latest /bin/sh -c sleep ... 22 hours ago Exited (137) 22 hours ago a.task.5

e.g. for 029fa9794390, the status is "Up" but actually, the state is not correct.

boliu@lsf1x125[conf]:$podman exec -it 029fa9794390 /bin/bash
Error: exec failed: cannot exec a container that has stopped: OCI runtime error

boliu@lsf1x125[conf]:$podman stop 029fa9794390
Error: timed out waiting for file /tmp/run-34040/libpod/tmp/exits/029fa97943905e9e9094bb2bb161718b29e10b48adb830888fe34cca8fae319c: internal libpod error

boliu@lsf1x125[conf]:$podman rm 029fa9794390
029fa97943905e9e9094bb2bb161718b29e10b48adb830888fe34cca8fae319c

In our container controller script log, we can see messages like below:

e. g. for container:

/sys/fs/cgroup/systemd/user.slice/user-34040.slice/[email protected]/user.slice/podman-1733134.scope/138ead2c5272d3b33d30ef962086cc144f2cfa37c1b54aa3b8d218c435cdfe21/cgroup.procs

The running container process has been killed, but the container remains.
And I try to remove the container by "podman stop " and "podman rm -f "
it gives me below errors:

2020-07-22 03:40:08,664 lsf-docker[1734843] DEBUG : err:
Error: could not get runtime: error generating default config from memory: cannot mkdir /run/user/0/libpod: mkdir /run/user/0/libpod: permission deniedError: could not get runtime: error generating default config from memory: cannot mkdir /run/user/0/libpod: mkdir /run/user/0/libpod: permission denied

But if I run "podman stop " and "podman rm -f " manually in a console,
the command can remove the container, but report internal libpod error.

For this error, I can make sure the environment variable: XDG_RUNTIME_DIR is unset, and uid/gid is not 0, and "HOME" is set to container owner, but the error still occurs, here is my code about it.

123 if uid == 0:
124 logger.log("uid==0, set uid to submitter uid/gid")
125 if os.environ.get("XDG_RUNTIME_DIR", None) != None:
126 del(os.environ["XDG_RUNTIME_DIR"])
128
129 import pwd
130 username = os.environ.get("SUB_USER")
131 if username == None:
132 return
133
134 passwd = pwd.getpwnam(username)
135 if passwd == None:
136 return
137
138 logger.log("home dir: %s" % os.environ.get("HOME"))
139 os.putenv("HOME", passwd.pw_dir)
141
142 os.setgid(passwd.pw_gid)
143 os.setuid(passwd.pw_uid)
146 logger.log("uid==%s, after set uid" % os.getuid())
147
148 cmd = "podman stop %s; podman rm -f %s" % (uuid, uuid)
....

any input?

Regards,
Leo.

@coldbloodx
Copy link
Author

BTW, I found podman core dump sometimes, here is back trace:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments
Core was generated by `/usr/bin/podman run --cidfile /tmp/lsf.ulaworks008.job.3844.1595425427.task.2.1'.
Program terminated with signal SIGABRT, Aborted.
#0 0x000055cf971df4c1 in runtime.raise ()
[Current thread is 1 (Thread 0x14a70c974700 (LWP 26050))]
Missing separate debuginfos, use: yum debuginfo-install podman-1.6.4-10.module_el8.2.0+305+5e198a41.x86_64
(gdb) bt
#0 0x000055cf971df4c1 in runtime.raise ()
#1 0x000055cf971c4f2b in runtime.dieFromSignal ()
#2 0x000055cf00000006 in ?? ()
#3 0xffffffffffffffff in ?? ()
#4 0x000000c0001dd9a8 in ?? ()
#5 0x000055cf971c54bd in runtime.sigfwdgo ()
#6 0x000000c000000006 in ?? ()
#7 0x0000000000000000 in ?? ()

Leo.

@mheon
Copy link
Member

mheon commented Jul 23, 2020

Looks like a sig-proxy race, sending to a container that's already dead can lead to a panic. We've fixed several races there since 1.6.4, so I'd be interested to see if it reproduces on 1.9.3 and/or master.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Aug 24, 2020

Reopen if this is no fixed in the upstream branch.

@rhatdan rhatdan closed this as completed Aug 24, 2020
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue
Projects
None yet
Development

No branches or pull requests

4 participants