-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
podman with custom static_dir
slow to tear-down containers
#19938
Comments
looks like it takes a while to delete the btrfs volume. Can you try with the overlay driver? |
It's been like this since before I switched to the (Is there any way to switch drivers without having to Or, is there any instrumentation I can add/debug options I can enable that will get the innards of I believe that subvolume removal should be instant, and whenever I've removed subvolumes manually on the same system, there's never been any noticeable delay. |
you could experiment with overlay on a different store:
and see how that performs. You cannot switch drivers without recreating the images, what you could try is to copy them:
|
The 20 seconds is the delay uses to wait for the exit code to appear in the db. You can see that with So the real problem is is likely the podman container cleanup command not working correctly or something wrong in conmon maybe. |
Setup:
Tear-down:
Unfortunately there's no per-debug-statement timing information, but |
podman/libpod/container_api.go Lines 600 to 621 in e8c4d79
|
Can you run |
Ah - you might be on to something here! Running from
… running from NVMe, but using the overlay driver:
So the issue does seem to be related to the … apologies, my recollection is that it had always been slow even before migrating to Also, thanks for the image-copy command - I wasn't aware that was an option! |
… so the state seems to have moved to |
None of the core maintainers works on the BTRFS driver, so unlikely to be fixed unless community members step up. |
Is it worth filing a bug against https://github.com/containers/conmon, or is that the same audience of maintainers/contributors as here - or indeed, does the above debugging information indicate that the issue is (triggered) within |
hard to see where is the error without being able to reproduce it. I'd suggest you manually try running |
I've reproduced this issue on a couple of systems which are using the amd64:
Configuration:
Concurrent containers:
arm64:
Configuration:
Concurrent containers:
… which is interesting, as on the original system the problem went away when temporarily changing to the Also, both Given this, could there be any link to total image size within podman storage and this delay? Original system:
New amd64 system:
arm64 system:
|
I am absolutely incapable of reproducing. Can you run |
Output of
There's no significant delay removing this
… although in this case, the container persists with state
I'm not familiar with 'execsnoop', but I'll take a look now... |
Thanks for sharing!
It's part of bcc-tools and a veeeeery helpful script. |
I feel I'm making progress - the With a default/unconfigured installation, as expected, However, if I do nothing more than import my configuration, change the driver to … so the problem only appears with some non-default setting in I'll keep digging... |
… it's something in
|
Urgh - I've found the specific issue:
… which I suspect means that some component has hard-coded the default ( Rough reproducer: In a fresh container, install
… then create
… and re-run the above: the path used for (This is with |
Furthermore, the specified Is appears that the actual Is Interestingly, a I also have:
… in
… why do I have a boltdb file which is more up to date (and much larger) than the sqlite one - I thought sqlite was a complete replacement? |
Update: Removing (Although running with |
I can reproduce as well with this setting. Great work tracking this down, @srcshelton! As to why this happens needs some investigation.
I assume you changed the db_backend, so the previous one will continue lingering around. |
static_dir
slow to tear-down containers
I made the observation that the cleanup process does not see (or read) the containers.conf file correctly. I configured mine in the $HOME directory and noticed that |
I can raise a new bug report for this, but I suspect it might be the same root-cause: Even with |
Yes, there is something really strange going on. Cc: @mheon |
This is expected. The blob-info-cache is something the image library was using. |
The processing and setting of the static and volume directories was scattered across the code base (including c/common) leading to subtle errors that surfaced in containers#19938. There were multiple issues that I try to summarize below: - c/common loaded the graphroot from c/storage to set the defaults for static and volume dir. That ignored Podman's --root flag and surfaced in containers#19938 and other bugs. c/common does not set the defaults anymore which gives Podman the ability to detect when the user/admin configured a custom directory (not empty value). - When parsing the CLI, Podman (ab)uses containers.conf structures to set the defaults but also to override them in case the user specified a flag. The --root flag overrode the static dir which is wrong and broke a couple of use cases. Now there is a dedicated field for in the "PodmanConfig" which also includes a containers.conf struct. - The defaults for static and volume dir and now being set correctly and adhere to --root. - The CONTAINERS_CONF_OVERRIDE env variable has not been passed to the cleanup process. I believe that _all_ env variables should be passed to conmon to avoid such subtle bugs. Overall I find that the code and logic is scattered and hard to understand and follow. I refrained from larger refactorings as I really just want to get containers#19938 fixed and then go back to other priorities. containers/common#1659 broke three pkg/machine tests. Those have been commented out until getting fixed. Fixes: containers#19938 Signed-off-by: Valentin Rothberg <[email protected]>
Issue Description
Whilst podman is able to start containers from even very large images fast enough that it feels interactively instant, shutting them down again takes what feels like a long time.
I wonder whether this is because I'm running the
btrfs
storage-driver, because I'm not running on a system managed bysystemd
(and so perhaps some call-back is failing?), or whether some default timeout is being adhered to even if the container processes have exited near the start of the timeout period?Steps to reproduce the issue
Describe the results you received
There is a 20s pause between the container process exiting and the container itself having been fully torn-down.
N.B. This is not a new problem - this system has been upgraded from
podman
2.x with the VFS storage-driver, but has always been slow to tear-down containers.Describe the results you expected
In other scenarios, if the process within the container exits then the container is torn-down effectively instantly.
podman run
has a--stop-timeout
which apparently defaults to 10 seconds, but the stop delay is actually twice that - and in this case the container process has itself exited, so I'd imagine that any timeout dependent on processes within the container (as opposed to related host processes such ascrun
/runc
) should not be adhered to?podman info output
Podman in a container
No
Privileged Or Rootless
Privileged
Upstream Latest Release
Yes
Additional environment details
Host is running
openrc
, notsystemd
.Additional information
The size of the image (within reason) does not seem to have a significant effect on the tear-down time - for a 1.08GB image, the result is:
… so the delay appears to be a consistent 20s.
The text was updated successfully, but these errors were encountered: