Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

power-loss while creating containers may leave podman (storage) in a broken state #8005

Closed
w4tsn opened this issue Oct 13, 2020 · 21 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue

Comments

@w4tsn
Copy link
Contributor

w4tsn commented Oct 13, 2020

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

If power-loss occurs in a small time window while creating containers with podman, the container storage is broken and no containers can be started nor created anymore. Only a podman system prune -a seems to resolve the issue while all other prune commands don't.

Steps to reproduce the issue (maybe in general):

  1. Setup systemd units to create containers on boot
  2. Disconnect power source while containers are created on boot
  3. Restart and observe unit / containers

Steps to reproduce the issue (specifically):

Following are the specific steps with regards to my actual setup. This might make a difference, since the Raspberry Pi 3B+ has few resources which causes image pull and container creation to take some time (especially when starting 5 containers in parallel) which could widen the time window for corruption.

  1. Install latest Fedora IoT on a Raspberry Pi 3B+
  2. Setup some containers starting via systemd on boot, including a pod. Use unit-dependency to start the pod at first, then one container with a boot time of more than a minute (e.g. node-red) and four other containers after that
  3. As soon as the containers are being created after successful boot disconnect the power source

Describe the results you received:

After successful reboot after the power-loss all podman container units fail to start with the following error message:

Error: readlink /var/lib/containers/storage/overlay/l/ORYZLEWFSIV3UXAUDOB4OAH6SW: no such file or directory

Describe the results you expected:

I expect all containers to be created normally. My systemd units remove any left-over containers before attempting to create the new ones. This should work in any case, even on power loss. Podman should not enter a state where I have to manually issue a podman system prune -a or other intervention when something fails at container creation.

Additional information you deem important (e.g. issue happens only occasionally):

I'm starting 5 containers in parallel, which slows down container creation quite a bit on a raspberry Pi 3B+ which could widen a potential time window for corruption.

Output of podman version:

Version:      2.1.1
API Version:  2.0.0
Go Version:   go1.14.9
Built:        Wed Sep 30 21:31:36 2020
OS/Arch:      linux/arm64

Output of podman info --debug:

host:
  arch: arm64
  buildahVersion: 1.16.1
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.0.21-2.fc32.aarch64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.21, commit: 5c1a09d48bd2b912c29efe00ec956c8f84ae26b9'
  cpus: 4
  distribution:
    distribution: fedora
    version: "32"
  eventLogger: journald
  hostname: localhost
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.8.13-200.fc32.aarch64
  linkmode: dynamic
  memFree: 11911168
  memTotal: 981143552
  ociRuntime:
    name: crun
    package: crun-0.15-5.fc32.aarch64
    path: /usr/bin/crun
    version: |-
      crun version 0.15
      commit: 56ca95e61639510c7dbd39ff512f80f626404969
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  rootless: false
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 370003968
  swapTotal: 466997248
  uptime: 3h 14m 42.71s (Approximately 0.12 days)
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - registry.centos.org
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 6
    paused: 0
    running: 6
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageStore:
    number: 6
  runRoot: /var/run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 2.0.0
  Built: 1601494296
  BuiltTime: Wed Sep 30 21:31:36 2020
  GitCommit: ""
  GoVersion: go1.14.9
  OsArch: linux/arm64
  Version: 2.1.1

Package info (e.g. output of rpm -q podman or apt list podman):

podman-2.1.1-7.fc32.aarch64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

I'm using the aarch64 variant on a Raspberry Pi 3B+ (limited resources) running Fedora IoT 32. The containers are created automatically on boot via systemd units. The units first try to remove any existing container via optional command and then run a podman container command with --systemd flag.

@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 13, 2020
@vrothberg
Copy link
Member

@mheon PTAL

@w4tsn
Copy link
Contributor Author

w4tsn commented Oct 13, 2020

may be related to #7941

@mheon
Copy link
Member

mheon commented Oct 13, 2020 via email

@mheon
Copy link
Member

mheon commented Oct 13, 2020

Unfortunately, I don't think we've ever had a good explanation as to the exact cause of this. I know a fix was landed earlier (around a year ago) when CRI-O hit this, but it appears that it did not fully resolve the issue.

@mheon
Copy link
Member

mheon commented Oct 13, 2020

Unlikely to be related to #7941

@w4tsn
Copy link
Contributor Author

w4tsn commented Oct 26, 2020

Is there a way to workaround this broken state without clearing the podman storage with system prune -a?

I've several deployments of podman in the field on a low-bandwith or cost-by-byte connection and would like to keep downloaded images and still recover from this broken state. Any ideas?

EDIT:

Might be related to #5986 - at least there seems to be a valid work-around using read-only fs: #5986 (comment)

@rhatdan
Copy link
Member

rhatdan commented Oct 27, 2020

Can you remove the image, that caused the issue?

@w4tsn
Copy link
Contributor Author

w4tsn commented Oct 28, 2020

I can remove all images together, or one by one. I actually don't know which image I would have to try. As far as I can tell right now, it's independent of any particular image. I did not test the whole thing with each image one by one to be fair. I have to see when I'll be free to try that

@rhatdan
Copy link
Member

rhatdan commented Oct 28, 2020

Id you do podman images what error do you see?

We had a similar experience and found that we can edit

./.local/share/containers/storage/overlay-images/images.json
And remove the bad container link in the json file, and get the images back up and running.

We are looking into making this work with podman rmi IMAGE, but have to figure out how to clean up the image store.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@polygamma
Copy link

polygamma commented Nov 28, 2020

I'd like to contribute to this issue, since I am also experiencing this behavior and would love to help fixing it.

I am able to reproduce it consistently on a Raspberry Pi 4 running Arch Linux ARM with Podman 2.2.0-rc2 which I built from source. The Raspberry Pi is currently not needed for anything important, I may fiddle around with it in any way you want me to, to help fixing this problem.

So let me describe my setup in more detail and give you as much information as possible:

I cannot give you the Containerfile with everything else you'd need to recreate my image, but it's providing a complete work environment including Qt, OpenCV, different compilers, programming IDEs, browsers etc. The built image is huge, around 15 GB of size. It's also using systemd, so the entrypoint is set to /sbin/init.

To force this issue, I am simply running a container based on that image with:

sudo podman run -i -t --rm --name auv --privileged --network='host' --ipc='host' --systemd='true' --volume=/lib/modules:/lib/modules:ro auv:latest

While that container is running, I unplug the power cable of the Raspberry Pi. In about 1 of 3 times that breaks the Podman environment. Being in that state, I executed the following commands:

[attk@jonny-raspberry ~]$ sudo podman system prune

WARNING! This will remove:
        - all stopped containers
        - all stopped pods
        - all dangling images
        - all build cache
Are you sure you want to continue? [y/N] y
Deleted Pods
Deleted Containers
Deleted Images
[attk@jonny-raspberry ~]$ sudo podman image prune

WARNING! This will remove all dangling images.
Are you sure you want to continue? [y/N] y
[attk@jonny-raspberry ~]$ sudo podman container prune
WARNING! This will remove all non running containers.
Are you sure you want to continue? [y/N] y
[attk@jonny-raspberry ~]$ sudo podman container ls -a
CONTAINER ID  IMAGE   COMMAND  CREATED  STATUS  PORTS   NAMES
[attk@jonny-raspberry ~]$ sudo podman image ls
REPOSITORY                                         TAG     IMAGE ID      CREATED            SIZE
localhost/auv                                      latest  7937f0d3891c  About an hour ago  4.92 GB
docker.io/polygamma/archlinux_arm_generic_aarch64  latest  53c4eb44e69e  3 days ago         1.4 GB

I stripped the Containerfile down a bit, to not have to wait so long during building, which is why it's only about 5GB of size here.

[attk@jonny-raspberry ~]$ sudo podman run -i -t --rm --name auv --privileged --network='host' --ipc='host' --systemd='true' --volume=/lib/modules:/lib/modules:ro auv:latest
Error: readlink /var/lib/containers/storage/overlay/l/74I7Y5UEZAA3GKIJJJSSGEMXNQ: no such file or directory

The Raspberry Pi is still in exactly that state, so if you need more information, I can provide it.

@rhatdan Do you have any idea on how to progress with this problem?

@rhatdan
Copy link
Member

rhatdan commented Dec 26, 2020

This is the same issue as #8437

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Jan 26, 2021

Let's concentrate on #8347

@rhatdan rhatdan closed this as completed Jan 26, 2021
@yangm97
Copy link
Contributor

yangm97 commented Feb 2, 2021

@rhatdan Sorry, I think you meant #5986?

@giuseppe
Copy link
Member

giuseppe commented Feb 2, 2021

to create a corrupted storage, we can use a reproducer like:

# podman --root /root/test/safe pull fedora
# sync
# podman --root /root/test/broken pull fedora
# sleep 5
# echo o > /proc/sysrq-trigger

The machine is powered off.

On the next reboot, if it still boots :-) ...

# diff -qr /root/{safe,broken}/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/[ and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/[ differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/alias and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/alias differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/arch and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/arch differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/awk and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/awk differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/b2sum and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/b2sum differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/base32 and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/base32 differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/base64 and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/base64 differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/basename and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/basename differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/basenc and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/basenc differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/bash and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/bash differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/bashbug and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/bashbug differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/bashbug-64 and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/bashbug-64 differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/bg and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/bg differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/cal and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/cal differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/ca-legacy and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/ca-legacy differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/cat and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/cat differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/catchsegv and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/catchsegv differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/cd and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/cd differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/chage and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/chage differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/chcon and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/chcon differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/chgrp and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/chgrp differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/chmem and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/chmem differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/chmod and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/chmod differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/choom and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/choom differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/chown and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/chown differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/chrt and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/chrt differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/cksum and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/cksum differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/col and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/col differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/colcrt and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/colcrt differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/colrm and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/colrm differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/column and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/column differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/comm and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/comm differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/command and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/command differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/cp and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/cp differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/csplit and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/csplit differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/curl and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/curl differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/cut and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/cut differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/cvtsudoers and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/cvtsudoers differ
Files safe/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/date and broken/overlay/d9e1d1e08de26f234a83c6c737827884dd15c68c80714a5a973d245ed456f7a1/diff/bin/date differ
...

@rhatdan
Copy link
Member

rhatdan commented Feb 2, 2021

If you attempt to run a container on that image, what error do you see?

@giuseppe
Copy link
Member

giuseppe commented Feb 2, 2021

# podman --runtime /usr/bin/crun --root /root/test/broken/ run --rm -ti fedora sh
Error: executable file `sh` not found in $PATH: No such file or directory: OCI not found

# podman --runtime /usr/bin/runc --root /root/test/broken/ run --rm -ti fedora sh
Error: container_linux.go:370: starting container process caused: exec: "sh": executable file not found in $PATH: OCI not found

@rhatdan
Copy link
Member

rhatdan commented Feb 3, 2021

We have seen these type of errors many times, which gives us a clue on what could cause it.

If you just do podman rmi fedora, or better yet podman pull fedora, does that cleanup your images?

@pclass-sensonix
Copy link

I am sporadically seeing this issue as well on my Linux device when I pull power unexpectedly. I have a 'base image' on a seperate partition that I spin up a podman container against on boot using systemd. When this failure happens, the container fails to launch and I have to 'podman rmi' my base image and reload it from scratch to use my container. I haven't been able to figure out how to work around removing the image itself and recover the container.

@LordPraslea
Copy link

I'm also starting to see this issue on some containers run by systemd. Again, unexpected power disconnect.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue
Projects
None yet
Development

No branches or pull requests

10 participants