-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Container storage creation fragile #7941
Comments
Are you certain you interrupted a container when it was starting? The container has somehow been removed from the database, which only ever happens during removal. Starting a container will never cause it to be removed from the database. |
Is it possible the database entry was lost, or was never written in the first place? To confirm, I'm certain that this occurred during startup: this is a reasonably long-running build-container which I manually invoked, and then just as it was coming up I realised I'd missed something, and so ctrl+c'd it. Now I can't re-run the process as the container's name is blocked, even though it doesn't appear (by name or ID) to (This is via a build-script which armours calls to |
Specifically (hurrah for scroll-back! :)
|
Hm. Probably failed midway through creation, after c/storage created the container but before we actually added it to the DB as a container. Those are in very close proximity, so either something seriously slowed down our DB operations and it aborted mid-transaction or this was a very narrow timing window. We could potentially alter the order of operations so that the storage is created after we are added to the DB, but that introduces another potential race (someone could try and start the container immediately after it was added to the DB but before storage was created, which would itself be an error. I'll think a bit more on this - this may be another way. |
I suspect I was unlucky with a narrow timing window - the system is reasonably powerful with multiple spinning disks, and wasn't heavily loaded. I'm not familiar with the code, but is this a case where multiple DB states could be used (something along the lines of |
If it helps:
Removing all of the above seems to have fixed the issue... |
Looking at this further: We cannot easily move things around - adding to the Libpod DB needs to be the last thing done, because most things we do before generate information that will need to be added to the container configuration, which can only be written once. My current thinking is that it may be sufficient to intercept SIGINT and SIGTERM during container creation, and delay them until after the function is run. This does not help with the SIGKILL case, but if things have gotten bad enough to merit a SIGKILL we're probably not going to be able to clean up properly regardless. We hold off on exiting until container creation is finished, and then step out afterwards. |
I think that this sounds like a good solution - if the problem's still happening after then it can be looked at in more detail, but this will hopefully prevent the majority (if not all) of the incidences of this problem in the first place! |
Expand the use of the Shutdown package such that we now use it to handle signals any time we run Libpod. From there, add code to container creation to use the Inhibit function to prevent a shutdown from occuring during the critical parts of container creation. We also need to turn off signal handling when --sig-proxy is invoked - we don't want to catch the signals ourselves then, but instead to forward them into the container via the existing sig-proxy handler. Fixes containers#7941 Signed-off-by: Matthew Heon <[email protected]>
Got the same issue in |
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
podman
is fragile, and there are various points where interrupting an operation with ctrl+c will leave podman's state undefined, and cause follow-on breakage.podman should make every effort to be in a position to roll-forwards or roll-back on invocation from any state-change which was in progress but incomplete from a previous run (taking into account that it may be running in parallel with other instances of itself).
For example, I inadvertently interrupted a container starting, I assume during volume creation. I now get this:
... and no amount of system/volume pruning seems to be able to fix this issue :(
I can poke around on the filesystem to try to fix this (... or simply never use that container name ever again!!), but
podman
shouldn't let things get into this state.Could locking be added to determine whether a partially-completed resource is under the control of a simultaneous still-running task, or is stale and should be removed?
Ideally, podman should be able to be interrupted at any point during execution, and still able to operate when re-run (or, at the very least, correctly clean-up any partial state on a
system prune
).Output of
podman version
:Output of
podman info --debug
:Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?
Yes
The text was updated successfully, but these errors were encountered: