podman run --rm does not work when lots of container process are killed at same time. #7051

coldbloodx · 2020-07-22T17:36:50Z

/kind bug

Description

Steps to reproduce the issue:

start some containers at same time.
2.
kill these containers at same time.

Describe the results you received:
some of containers are not removed and in INCORRECT stat.

Describe the results you expected:
all container should be removed, since the --rm is specified.

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

[boliu@lsf1x125 ~]$ rpm -qa |grep podman
podman-1.6.4-11.module+el8.2.0+6368+cf16aa14.x86_64
podman-docker-1.6.4-11.module+el8.2.0+6368+cf16aa14.noarch```

**Output of `podman info --debug`:**

[boliu@lsf1x125 ~]$ podman info --debug
debug:
compiler: gc
git commit: ""
go version: go1.13.4
podman version: 1.6.4
host:
BuildahVersion: 1.12.0-dev
CgroupVersion: v1
Conmon:
package: conmon-2.0.6-1.module+el8.2.0+6368+cf16aa14.x86_64
path: /usr/bin/conmon
version: 'conmon version 2.0.6, commit: 9adfe850ef954416ea5dd0438d428a60f2139473'
Distribution:
distribution: '"rhel"'
version: "8.1"
IDMappings:
gidmap:
- container_id: 0
host_id: 10007
size: 1
- container_id: 1
host_id: 100000
size: 65536
uidmap:
- container_id: 0
host_id: 34040
size: 1
- container_id: 1
host_id: 100000
size: 65536
MemFree: 4692107264
MemTotal: 8189198336
OCIRuntime:
name: runc
package: runc-1.0.0-65.rc10.module+el8.2.0+6368+cf16aa14.x86_64
path: /usr/bin/runc
version: 'runc version spec: 1.0.1-dev'
SwapFree: 4287361024
SwapTotal: 4294963200
arch: amd64
cpus: 4
eventlogger: journald
hostname: lsf1x125
kernel: 4.18.0-193.el8.x86_64
os: linux
rootless: true
slirp4netns:
Executable: /usr/bin/slirp4netns
Package: slirp4netns-0.4.2-3.git21fdece.module+el8.2.0+6368+cf16aa14.x86_64
Version: |-
slirp4netns version 0.4.2+dev
commit: 21fdece2737dc24ffa3f01a341b8a6854f8b13b4
uptime: 55h 51m 24.3s (Approximately 2.29 days)
registries:
blocked: null
insecure: null
search:

docker.io
store:
ConfigFile: /home/boliu/.config/containers/storage.conf
ContainerStore:
number: 89
GraphDriverName: overlay
GraphOptions:
overlay.mount_program:
Executable: /usr/bin/fuse-overlayfs
Package: fuse-overlayfs-0.7.2-5.module+el8.2.0+6368+cf16aa14.x86_64
Version: |-
fuse-overlayfs: version 0.7.2
FUSE library version 3.2.1
using FUSE kernel interface version 7.26
GraphRoot: /opt/boliu/podman/containers/storage
GraphStatus:
Backing Filesystem: xfs
Native Overlay Diff: "false"
Supports d_type: "true"
Using metacopy: "false"
ImageStore:
number: 1
RunRoot: /tmp/run-34040
VolumePath: /opt/boliu/podman/containers/storage/volumes


**Package info (e.g. output of `rpm -q podman` or `apt list podman`):**

[boliu@lsf1x125 ~]$ rpm -qa |grep podman
podman-1.6.4-11.module+el8.2.0+6368+cf16aa14.x86_64
podman-docker-1.6.4-11.module+el8.2.0+6368+cf16aa14.noarch```

Additional environment details (AWS, VirtualBox, physical, etc.):
virtual machine on vmware host.

The text was updated successfully, but these errors were encountered:

rhatdan · 2020-07-22T18:42:50Z

You should be able to get a newer version of podman for RHEL8.2.1. Try to update the podman package and report if the issue is fixed.

RHEL8.2.1 was released yesterday, I believe.

mheon · 2020-07-22T19:08:40Z

If this does persist in 1.9, a proper reproducer would be greatly appreciated - how many containers are involved, does the command in question matter, what command(s) are used to remove them, etc.

coldbloodx · 2020-07-23T06:09:32Z

@rhatdan @mheon,

The overall workflow is a little bit complicated.

we started an execution daemon(execution server -> res ) on host.
the execution daemon starts podman container through a python script(container starter -> cstarter).
when we kill the container, the execution daemon(or some other daemon) will call another python script to:
a . sending signals to the running process in container.
or
b. run 'podman kill '

to stop/kill container .

But both them will hit the issue above.
Some of containers can be killed and cleaned successfully.
Some of them remains in different state: "Created", "Up" or "Exited".

Here is an example
boliu@lsf1x125[conf]:$docker ps -a; lsrun -m " lsf1x127" docker ps -a; lsrun -m " lsf1x126" docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c7bb38d9137a localhost/centos:latest /bin/sh -c sleep ... 22 hours ago Exited (-1) 12 hours ago a.task.1
029fa9794390 localhost/centos:latest /bin/sh -c sleep ... 22 hours ago Up 22 hours ago a.task.2
23ca787023e5 localhost/centos:latest /bin/sh -c sleep ... 22 hours ago Exited (137) 22 hours ago a.task.6
268a07fe707e localhost/centos:latest /bin/sh -c sleep ... 22 hours ago Exited (137) 22 hours ago a.task.5

e.g. for 029fa9794390, the status is "Up" but actually, the state is not correct.

boliu@lsf1x125[conf]:$podman exec -it 029fa9794390 /bin/bash
Error: exec failed: cannot exec a container that has stopped: OCI runtime error

boliu@lsf1x125[conf]:$podman stop 029fa9794390
Error: timed out waiting for file /tmp/run-34040/libpod/tmp/exits/029fa97943905e9e9094bb2bb161718b29e10b48adb830888fe34cca8fae319c: internal libpod error

boliu@lsf1x125[conf]:$podman rm 029fa9794390
029fa97943905e9e9094bb2bb161718b29e10b48adb830888fe34cca8fae319c

In our container controller script log, we can see messages like below:

e. g. for container:

/sys/fs/cgroup/systemd/user.slice/user-34040.slice/[email protected]/user.slice/podman-1733134.scope/138ead2c5272d3b33d30ef962086cc144f2cfa37c1b54aa3b8d218c435cdfe21/cgroup.procs

The running container process has been killed, but the container remains.
And I try to remove the container by "podman stop " and "podman rm -f "
it gives me below errors:

2020-07-22 03:40:08,664 lsf-docker[1734843] DEBUG : err:
Error: could not get runtime: error generating default config from memory: cannot mkdir /run/user/0/libpod: mkdir /run/user/0/libpod: permission deniedError: could not get runtime: error generating default config from memory: cannot mkdir /run/user/0/libpod: mkdir /run/user/0/libpod: permission denied

But if I run "podman stop " and "podman rm -f " manually in a console,
the command can remove the container, but report internal libpod error.

For this error, I can make sure the environment variable: XDG_RUNTIME_DIR is unset, and uid/gid is not 0, and "HOME" is set to container owner, but the error still occurs, here is my code about it.

123 if uid == 0:
124 logger.log("uid==0, set uid to submitter uid/gid")
125 if os.environ.get("XDG_RUNTIME_DIR", None) != None:
126 del(os.environ["XDG_RUNTIME_DIR"])
128
129 import pwd
130 username = os.environ.get("SUB_USER")
131 if username == None:
132 return
133
134 passwd = pwd.getpwnam(username)
135 if passwd == None:
136 return
137
138 logger.log("home dir: %s" % os.environ.get("HOME"))
139 os.putenv("HOME", passwd.pw_dir)
141
142 os.setgid(passwd.pw_gid)
143 os.setuid(passwd.pw_uid)
146 logger.log("uid==%s, after set uid" % os.getuid())
147
148 cmd = "podman stop %s; podman rm -f %s" % (uuid, uuid)
....

any input?

Regards,
Leo.

coldbloodx · 2020-07-23T06:27:10Z

BTW, I found podman core dump sometimes, here is back trace:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments
Core was generated by `/usr/bin/podman run --cidfile /tmp/lsf.ulaworks008.job.3844.1595425427.task.2.1'.
Program terminated with signal SIGABRT, Aborted.
#0 0x000055cf971df4c1 in runtime.raise ()
[Current thread is 1 (Thread 0x14a70c974700 (LWP 26050))]
Missing separate debuginfos, use: yum debuginfo-install podman-1.6.4-10.module_el8.2.0+305+5e198a41.x86_64
(gdb) bt
#0 0x000055cf971df4c1 in runtime.raise ()
#1 0x000055cf971c4f2b in runtime.dieFromSignal ()
#2 0x000055cf00000006 in ?? ()
#3 0xffffffffffffffff in ?? ()
#4 0x000000c0001dd9a8 in ?? ()
#5 0x000055cf971c54bd in runtime.sigfwdgo ()
#6 0x000000c000000006 in ?? ()
#7 0x0000000000000000 in ?? ()

Leo.

mheon · 2020-07-23T13:19:38Z

Looks like a sig-proxy race, sending to a container that's already dead can lead to a panic. We've fixed several races there since 1.6.4, so I'd be interested to see if it reproduces on 1.9.3 and/or master.

github-actions · 2020-08-23T00:14:44Z

A friendly reminder that this issue had no activity for 30 days.

rhatdan · 2020-08-24T14:29:22Z

Reopen if this is no fixed in the upstream branch.

openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 22, 2020

github-actions bot added the stale-issue label Aug 23, 2020

rhatdan closed this as completed Aug 24, 2020

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

podman run --rm does not work when lots of container process are killed at same time. #7051

podman run --rm does not work when lots of container process are killed at same time. #7051

coldbloodx commented Jul 22, 2020

rhatdan commented Jul 22, 2020

mheon commented Jul 22, 2020

coldbloodx commented Jul 23, 2020

coldbloodx commented Jul 23, 2020

mheon commented Jul 23, 2020

github-actions bot commented Aug 23, 2020

rhatdan commented Aug 24, 2020

podman run --rm does not work when lots of container process are killed at same time. #7051

podman run --rm does not work when lots of container process are killed at same time. #7051

Comments

coldbloodx commented Jul 22, 2020

rhatdan commented Jul 22, 2020

mheon commented Jul 22, 2020

coldbloodx commented Jul 23, 2020

coldbloodx commented Jul 23, 2020

mheon commented Jul 23, 2020

github-actions bot commented Aug 23, 2020

rhatdan commented Aug 24, 2020