-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error when checkpointing container: Can't lookup mount #860
Comments
What kernel do you use? |
(00.000079) Running on wraeclast Linux 5.0.0-36-generic #39~18.04.1-Ubuntu SMP Tue Nov 12 11:09:50 UTC 2019 x86_64 |
Cc: @Snorch |
Do you have mount 795 on host? If you still have dumpable ct running or you can reproduce it do:
If the above gives anything, please, also show the optput of:
If (1), you might need some external mount given to criu. (2) is not supported. For (3) I'm not sure if we can handle these. Likely you case is (3): (00.006249) type overlay source overlay mnt_id 898 s_dev 0x55 / @ ./ flags 0x280000 options lowerdir=/var/lib/docker/overlay2/l/FQGSJGYMN3LHZQBI7BRPIJG32M:/var/lib/docker/overlay2/l/5PSVBCDONUTQRDXFW2KXS3OEVC,upperdir=/var/lib/docker/overlay2/d39179831d1d7b13c89b4b5c50168d857c107e805e6c4ccc9cb5fa80af405504/diff,workdir=/var/lib/docker/overlay2/d39179831d1d7b13c89b4b5c50168d857c107e805e6c4ccc9cb5fa80af405504/work,xino=off To verify it, (upd) do inside you docker ct (e.g. via nsenter -t $CTPID -m):
It should give you 795, or if number changed some other id which is not seen in mountinfo. @avagin Also the strange thing is that fd number is <0, not sure what it can mean. |
Returns nothing. Also, the results of the nsenter verification:
I'm also not sure of how relevant this is but:
|
I think the problem might be in the linux kernel. Can you try to downgrade the kernel? |
Not exactly the same problem connected to bad (pseudo) stat sd_dev on overlay: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1751243 In git://kernel.ubuntu.com/ubuntu/autotest-client-tests: +++ b/ubuntu_unionmount_overlayfs_suite/0001-Fix-check-for-file-on-overlayfs.patch
|
Looks like we have the same as for st_dev now with mnt_id, that is bad, because we can't find on which mount to open the file if kernel hides these information from us. |
That was the problem. Downgrading the kernel solves the error! So, for reference, I was using kernel version 5.0.0-36-generic and downgrading to kernel version 5.0.0-23-generic solves the error. Thank you, i really appreciate the effort! Cheers. |
I reported the ubuntu kernel problems here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1857257 |
This bug is now also seen in multiple other projects which have to disable CRIU tests on Travis: containerd/containerd#3898 |
Any update with this? I have the same problem when trying to use CRIU Version: 3.6, Docker version 19.03.6 on the 5.3.0-40-generic Ubuntu kernel. I would prefer not having to downgrade my kernel. |
@anagainaru You have to complain here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1857257 There is patch to fix it, but it seems not have been applied by Ubuntu. You could always switch to another Linux distribution. |
Here is the fix: https://lists.openvz.org/pipermail/criu/2020-April/044973.html |
The status of the launchpad bug is a bit confusing to me--the janitor reported back in May that it was fixed in 5.3/5.4, comments suggest the fix caused other issues and it was either rolled back or mistakenly not rolled back, and now the janitor reports it fixed again in 5.4 in the focal repo, but no mention of other releases with currently-active kernel support. Can anyone give any insight to the status of this? |
I am also confused by the state of that bug. Someone would need to test it. I am not using Ubuntu, so I cannot test it. For CRIU we rely on Travis and Travis is based on Ubuntu, so it would be nice to have a fixed kernel in Travis. But Travis also takes some time to update their images. Even if it fixed I cannot verify it via Travis. Additionally Travis uses the GCE variant of the Ubuntu kernel and I am not sure how that kernel version maps to the kernel version in the launchpad bug report. No answer for your question, but I can confirm that the state of that bug is also very unclear to me. |
Alright. A big part of my CRIU usage is on google cloud, so I went ahead and created some clean ubuntu images with latest stable docker from the download.docker repo, and CRIU from the PPA. I tried to checkpoint a container created from giovannivenancio's MWE above just to see where things stand.
16.04/xenial installs with 18.04/bionic installs with
20.04/focal installs with
TLDR; still broken, but I'm not sure of the mapping between gcp and vanilla ubuntu versions. edit: here's a mapping, but not totally helpful to get back to baseline ubuntu https://people.canonical.com/~kernel/info/kernel-version-map.html
|
Thanks a lot for trying it out! Good to know that it is still not fixed. I will re-open the launchpad bug. As far I understood it this is related to Ubuntu's out of tree shiftfs implementation which is not part of the upstream linux kernel. That is the reason it only happens on Ubuntu. |
No problem at all, thanks for following up on launchpad. I'm keeping my work on xenial to avoid manually blocking kernel updates, but I'll keep these instances around to retry if I see any movement on a fix |
Just wanted to drop by again since launchpad is again confusing. There's another fix-released comment, but still "confirmed" status for focal and "won't fix" for eoan. The two most recent LTS releases are on 5.4 kernels, so I guess it's not surprising that it's still broken in both of them. Summary:
Wonder if it's worth tagging launchpad for bionic as well? |
My guess right now is, that this will not be fixed any time soon. As far as I understand it, this is related to non-upstream kernel patches concerning 'shiftfs'. From what I heard this will not be upstreamed in the version Ubuntu carries it. Other implementations of similar features are also not upstream (yet). So if we are lucky it will be fixed with the next LTS release, but that is about 1.5 years of waiting and as the feature is not yet in the kernel it is not clear if this will happen. If this breaks your workflow you have to either run an old kernel or switch distributions. |
Understood, just wanted to follow up since there was a new janitor post and you had said previously that you didn't have an easy way of testing it. I'll still re-test and check back in if it looks like their fix makes it back to focal. Hopefully that would cover bionic as well, since they're currently on the same LTS kernel for gcloud at least. I've kept my google cloud instances on xenial and my (linux mint) desktops on the still-supported 4.15 kernel. I might move away from ubuntu/derivatives at some point, but I'm trying to keep things static until my current work wraps up. |
I removed the #887 is another example where we were impacted by this kernel bug. |
A friendly reminder that this issue had no activity for 30 days. |
A friendly reminder that this issue had no activity for 30 days. |
update for checkpoint creation on google cloud images: I ran
and confirmed a zero exit code
|
Thanks everyone for working on fixes and checking the state of the kernel. Closing this now as it seems to be fixed in Ubuntu. |
Edit: Upgrading to kernel: 5.4.0-1068-azure worked. Hey was facing a similar issue. I am guessing I need to update the kernel based on the thread, is that correct? OS: Ubuntu 18.04.4 cmds:
tail of the log:
|
Had the same issue and got it to work on Ubuntu 21.04 LTS after performing a kernel upgrade through |
Unfortunately we have also seen it being reintroduced by Ubuntu. Not much we can do. |
@elchead you need to find what kernel change broke the workflow and report it to ubuntu. |
@elchead What is the error message you see with Ubuntu 22? I noticed that Ubuntu 22.04 has upgraded to glibc 2.35: #1696 $ podman run ubuntu:22.04 ldd --version
ldd (Ubuntu GLIBC 2.35-0ubuntu1) 2.35
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper. |
not sure yet which change broke it but I can further confirm that kernel 5.4.0.1010.11 did not work, but 5.4.0-1068-azure does as stated above. |
Also seeing this:
|
You can try this: Follow the steps from here to change the boot kernel: |
5.13.0-1017-azure has this issue. I filed a new ubuntu issue: |
I'm not using an azure build but standard that comes with ubuntu, so this issue is not limited to that build. |
Thanks @baelter, the problem should be fixed with the standard kernel that comes with Ubuntu: |
I have the same issue on an Ubuntu AWS VM, the kernel version is: I can't find what version this is exactly, and whether it's expected to work or not? |
As far as I understand your kernel is build on Wed, 17 Aug 2022. So it doesn't contain my last fix for this issue. Please try to upgrade your kernel to the recent version. It should work. If not - then ping me ;-) |
Updated to |
I have a freshly installed Ubuntu 18.04 and I get the following error when checkpointing a container:
docker checkpoint create cr checkpoint1 Error response from daemon: Cannot checkpoint container cr: runc did not terminate sucessfully: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v1.linux/moby/f2e3e31d2e1b75c6ad339f8931d38e296344f87917d7ef54b6de16a1268d2faf/criu-dump.log: unknown
When inspecting the log, there is this error: (00.008480) Error (criu/files-reg.c:1338): Can't lookup mount=795 for fd=-3 path=/bin/sh
criu-dump.log
The container was created as follows:
docker run --security-opt=seccomp:unconfined --name cr -d busybox /bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'
I'm using Docker 19.03.5 and CRIU 3.12.
Any ideas? Thanks in advance!
The text was updated successfully, but these errors were encountered: