Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container storage creation fragile #7941

Closed
srcshelton opened this issue Oct 6, 2020 · 9 comments · Fixed by #7999
Closed

Container storage creation fragile #7941

srcshelton opened this issue Oct 6, 2020 · 9 comments · Fixed by #7999
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@srcshelton
Copy link
Contributor

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

podman is fragile, and there are various points where interrupting an operation with ctrl+c will leave podman's state undefined, and cause follow-on breakage.

podman should make every effort to be in a position to roll-forwards or roll-back on invocation from any state-change which was in progress but incomplete from a previous run (taking into account that it may be running in parallel with other instances of itself).

For example, I inadvertently interrupted a container starting, I assume during volume creation. I now get this:

Error: error creating container storage: the container name "buildweb-app-admin.webapp-config-1.55-r1" is already in use by "2c68b0f01168b42da187d9bfca25552913a2a6d8e8f3ccddc9c33a4a798e70e9". You have to remove that container to be able to reuse that name.: that name is already in use

# podman rm -v 2c68b0f01168b42da187d9bfca25552913a2a6d8e8f3ccddc9c33a4a798e70e9
Error: no container with name or ID 2c68b0f01168b42da187d9bfca25552913a2a6d8e8f3ccddc9c33a4a798e70e9 found: no such container

# podman rm -v buildweb-app-admin.webapp-config-1.55-r1
Error: no container with name or ID buildweb-app-admin.webapp-config-1.55-r1 found: no such container

... and no amount of system/volume pruning seems to be able to fix this issue :(

I can poke around on the filesystem to try to fix this (... or simply never use that container name ever again!!), but podman shouldn't let things get into this state.

Could locking be added to determine whether a partially-completed resource is under the control of a simultaneous still-running task, or is stale and should be removed?

Ideally, podman should be able to be interrupted at any point during execution, and still able to operate when re-run (or, at the very least, correctly clean-up any partial state on a system prune).

Output of podman version:

Version:      2.1.1
API Version:  2.0.0
Go Version:   go1.14.7
Git Commit:   9f6d6ba0b314d86521b66183c9ce48eaa2da1de2
Built:        Sat Oct  3 11:01:42 2020
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.16.1
  cgroupManager: cgroupfs
  cgroupVersion: v2
  conmon:
    package: Unknown
    path: /usr/bin/conmon
    version: 'conmon version 2.0.21, commit: 35a2fa83022e56e18af7e6a865ba5d7165fa2a4a'
  cpus: 8
  distribution:
    distribution: gentoo
    version: unknown
  eventLogger: file
  hostname: dellr330
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.8.9-gentoo
  linkmode: dynamic
  memFree: 2285371392
  memTotal: 8063447040
  ociRuntime:
    name: crun
    package: Unknown
    path: /usr/bin/crun
    version: |-
      crun version 0.15-dirty
      commit: 56ca95e61639510c7dbd39ff512f80f626404969
      spec: 1.0.0
      +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  rootless: false
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 25193656320
  swapTotal: 25769787392
  uptime: 182h 14m 13.03s (Approximately 7.58 days)
registries:
  search:
  - docker.io
  - docker.pkg.github.com
  - quay.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 16
    paused: 0
    running: 16
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.ignore_chown_errors: "false"
  graphRoot: /space/podman/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 264
  runRoot: /space/podman/run
  volumePath: /space/podman/volumes
version:
  APIVersion: 2.0.0
  Built: 1601719302
  BuiltTime: Sat Oct  3 11:01:42 2020
  GitCommit: 9f6d6ba0b314d86521b66183c9ce48eaa2da1de2
  GoVersion: go1.14.7
  OsArch: linux/amd64
  Version: 2.1.1

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?

Yes

@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 6, 2020
@mheon
Copy link
Member

mheon commented Oct 6, 2020

Are you certain you interrupted a container when it was starting? The container has somehow been removed from the database, which only ever happens during removal. Starting a container will never cause it to be removed from the database.

@srcshelton
Copy link
Contributor Author

Is it possible the database entry was lost, or was never written in the first place?

To confirm, I'm certain that this occurred during startup: this is a reasonably long-running build-container which I manually invoked, and then just as it was coming up I realised I'd missed something, and so ctrl+c'd it. Now I can't re-run the process as the container's name is blocked, even though it doesn't appear (by name or ID) to podman ps -a.

(This is via a build-script which armours calls to podman with trap '' INT and trap - INT because I'd previously found that interrupting calls to podman rm -v could also cause state breakage. But I've not turned-off ctrl+c processing for the actual run commands, as they legitimately may want to be interrupted!)

@srcshelton
Copy link
Contributor Author

Specifically (hurrah for scroll-back! :)

# ./gentoo-build-web.docker
INFO:  Resolving name 'app-admin/webapp-config' ...
INFO:  Setting build variables for package 'webapp-config' ...
WARN:  No 'package.use/package.use' override found in '/opt/containers/docker-gentoo-build/gentoo-web/etc/portage'

WARN:  No 'package.use/python_targets' override found in '/opt/containers/docker-gentoo-build/gentoo-web/etc/portage'

INFO:   * Building 'app-admin.webapp-config:1.55-r1' root image ...

095a3d7220d3cca58ca630388b681ec428dd9ccf8890175ddd30a800b9025f53
095a3d7220d3cca58ca630388b681ec428dd9ccf8890175ddd30a800b9025f53
Starting build container with command 'podman run --init --name buildweb-app-admin.webapp-config-1.55-r1 --pids-limit 1024 --ulimit nofile=1024:1024 --privileged --env COLUMNS=202 --env LINES=61 --env ROOT --env SYSROOT --env PORTAGE_CONFIGROOT --env TERM --env USE --env CURL_SSL=openssl --mount type=bind,source=/opt/containers/docker-gentoo-build/gentoo-web/etc/portage/package.accept_keywords,destination=/etc/portage/package.accept_keywords/package.accept_keywords,ro=true --mount type=bind,source=/opt/containers/docker-gentoo-build/gentoo-web/etc/portage/package.license,destination=/etc/portage/package.license,ro=true --mount type=bind,source=/opt/containers/docker-gentoo-build/gentoo-web/etc/portage/package.use/webapp.use,destination=/etc/portage/package.use/webapp.use,ro=true --mount type=bind,source=/opt/containers/docker-gentoo-build/gentoo-web/etc/portage/package.mask,destination=/etc/portage/package.mask,ro=true --mount type=bind,source=/etc/portage/repos.conf,destination=/etc/portage/repos.conf,ro=true --mount type=bind,source=/var/db/repo/container,destination=/var/db/repo/container,ro=true --mount type=bind,source=/var/db/repo/gentoo,destination=/var/db/repo/gentoo,ro=true --mount type=bind,source=/var/db/repo/srcshelton,destination=/var/db/repo/srcshelton,ro=true --mount type=bind,source=/var/cache/portage/dist,destination=/var/cache/portage/dist --mount type=bind,source=/var/log/portage,destination=/var/log/portage --mount type=bind,source=/var/cache/portage/pkg/amd64/xeon_e56.docker,destination=/var/cache/portage/pkg/amd64/docker gentoo-build:latest --pre-pkgs=sys-apps/help2man sys-devel/gcc sys-apps/busybox app-admin/eselect app-eselect/eselect-awk dev-lang/php sys-apps/gawk www-servers/lighttpd virtual/httpd-cgi virtual/httpd-fastcgi virtual/httpd-php --pre-use=-lib-only internal-glib python_targets_python3_8 curl_ssl_openssl --with-use=python_targets_python3_8 curl_ssl_openssl --post-pkgs=www-apps/wordpress mail-client/roundcube www-apps/phpsysinfo --post-use=internal-glib python_targets_python3_8 curl_ssl_openssl --usepkg=y --with-bdeps=n --with-bdeps-auto=n =app-admin/webapp-config-1.55-r1'
WARNING: The same type, major and minor should not be used for multiple devices.
WARNING: The same type, major and minor should not be used for multiple devices.
^C

@mheon
Copy link
Member

mheon commented Oct 6, 2020

Hm. Probably failed midway through creation, after c/storage created the container but before we actually added it to the DB as a container. Those are in very close proximity, so either something seriously slowed down our DB operations and it aborted mid-transaction or this was a very narrow timing window.

We could potentially alter the order of operations so that the storage is created after we are added to the DB, but that introduces another potential race (someone could try and start the container immediately after it was added to the DB but before storage was created, which would itself be an error. I'll think a bit more on this - this may be another way.

@srcshelton
Copy link
Contributor Author

I suspect I was unlucky with a narrow timing window - the system is reasonably powerful with multiple spinning disks, and wasn't heavily loaded.

I'm not familiar with the code, but is this a case where multiple DB states could be used (something along the lines of reserved, storage created, up?) with appropriate cleanup (e.g. if reserved more than a certain period of time previously, assume that something broke before it was able to progress?). Does the DB entry link to the controlling PID in relevant circumstances? Could/should it?

@srcshelton
Copy link
Contributor Author

srcshelton commented Oct 6, 2020

If it helps:

  • There's a overlay-containers/2c68b0f01168b42da187d9bfca25552913a2a6d8e8f3ccddc9c33a4a798e70e9containing only an (empty) userdata directory;

  • There's an overlay-containers/containers.json entry:

  {
    "id": "2c68b0f01168b42da187d9bfca25552913a2a6d8e8f3ccddc9c33a4a798e70e9",
    "names": [
      "buildweb-app-admin.webapp-config-1.55-r1"
    ],
    "image": "03f3c86d6d4702ee957c9fd10e3520cef99e4dc0918e69b1218f9534b904598c",
    "layer": "6bbc0dd52319fa788f415a9b60830d44f94d964afcfd749d25e568cd7c0ffb08",
    "metadata": "{\"image-name\":\"localhost/gentoo-build:latest\",\"image-id\":\"03f3c86d6d4702ee957c9fd10e3520cef99e4dc0918e69b1218f9534b904598c\",\"name\":\"buildweb-app-admin.webapp-config-1.55-r1\,
    "created": "2020-10-06T13:46:50.552895824Z",
    "flags": {
      "MountLabel": "",
      "ProcessLabel": ""
    }
  }
  • There is a directory overlay/6bbc0dd52319fa788f415a9b60830d44f94d964afcfd749d25e568cd7c0ffb08 which contains three empty and unmounted directories diff, merged, and work and populated files link (one ref) and lower (several refs).

Removing all of the above seems to have fixed the issue...

@mheon
Copy link
Member

mheon commented Oct 6, 2020

Looking at this further:

We cannot easily move things around - adding to the Libpod DB needs to be the last thing done, because most things we do before generate information that will need to be added to the container configuration, which can only be written once.

My current thinking is that it may be sufficient to intercept SIGINT and SIGTERM during container creation, and delay them until after the function is run. This does not help with the SIGKILL case, but if things have gotten bad enough to merit a SIGKILL we're probably not going to be able to clean up properly regardless. We hold off on exiting until container creation is finished, and then step out afterwards.

@srcshelton
Copy link
Contributor Author

My current thinking is that it may be sufficient to intercept SIGINT and SIGTERM during container creation, and delay them until after the function is run. This does not help with the SIGKILL case, but if things have gotten bad enough to merit a SIGKILL we're probably not going to be able to clean up properly regardless. We hold off on exiting until container creation is finished, and then step out afterwards.

I think that this sounds like a good solution - if the problem's still happening after then it can be looked at in more detail, but this will hopefully prevent the majority (if not all) of the incidences of this problem in the first place!

mheon added a commit to mheon/libpod that referenced this issue Oct 12, 2020
Expand the use of the Shutdown package such that we now use it
to handle signals any time we run Libpod. From there, add code to
container creation to use the Inhibit function to prevent a
shutdown from occuring during the critical parts of container
creation.

We also need to turn off signal handling when --sig-proxy is
invoked - we don't want to catch the signals ourselves then, but
instead to forward them into the container via the existing
sig-proxy handler.

Fixes containers#7941

Signed-off-by: Matthew Heon <[email protected]>
@pciavald
Copy link

pciavald commented Aug 10, 2022

Got the same issue in podman 3.0.1 on debian 11, removing the reference from /home/user/.local/share/containers/storage/overlay-containers/containers.json fixed it. May be related to #2553

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 20, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants