error when checkpointing container: Can't lookup mount #860

giovannivenancio · 2019-11-19T11:56:34Z

I have a freshly installed Ubuntu 18.04 and I get the following error when checkpointing a container:

docker checkpoint create cr checkpoint1 Error response from daemon: Cannot checkpoint container cr: runc did not terminate sucessfully: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v1.linux/moby/f2e3e31d2e1b75c6ad339f8931d38e296344f87917d7ef54b6de16a1268d2faf/criu-dump.log: unknown

When inspecting the log, there is this error: (00.008480) Error (criu/files-reg.c:1338): Can't lookup mount=795 for fd=-3 path=/bin/sh

criu-dump.log

The container was created as follows:

docker run --security-opt=seccomp:unconfined --name cr -d busybox /bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'

I'm using Docker 19.03.5 and CRIU 3.12.

Any ideas? Thanks in advance!

The text was updated successfully, but these errors were encountered:

avagin · 2019-11-19T23:18:37Z

What kernel do you use?

avagin · 2019-11-19T23:19:09Z

(00.000079) Running on wraeclast Linux 5.0.0-36-generic #39~18.04.1-Ubuntu SMP Tue Nov 12 11:09:50 UTC 2019 x86_64

avagin · 2019-11-19T23:20:13Z

Cc: @Snorch

Snorch · 2019-11-20T07:33:09Z

Do you have mount 795 on host? If you still have dumpable ct running or you can reproduce it do:

grep "^795\>" /proc/*/mountinfo

If the above gives anything, please, also show the optput of:

lsns | grep mnt

You can have open fd in the dumpable process from the mount outside of dumpable mntns'es.
You can have open mount in detached mntns.
You can have open file on overlayfs and it sometimes shows mnt_id of pseudo mount which never existed.

If (1), you might need some external mount given to criu. (2) is not supported. For (3) I'm not sure if we can handle these.

Likely you case is (3):

(00.006249) type overlay source overlay mnt_id 898 s_dev 0x55 / @ ./ flags 0x280000 options lowerdir=/var/lib/docker/overlay2/l/FQGSJGYMN3LHZQBI7BRPIJG32M:/var/lib/docker/overlay2/l/5PSVBCDONUTQRDXFW2KXS3OEVC,upperdir=/var/lib/docker/overlay2/d39179831d1d7b13c89b4b5c50168d857c107e805e6c4ccc9cb5fa80af405504/diff,workdir=/var/lib/docker/overlay2/d39179831d1d7b13c89b4b5c50168d857c107e805e6c4ccc9cb5fa80af405504/work,xino=off

To verify it, (upd) do inside you docker ct (e.g. via nsenter -t $CTPID -m):

exec 100< /bin/sh
cat /proc/$$/fdinfo/100

It should give you 795, or if number changed some other id which is not seen in mountinfo.

@avagin Also the strange thing is that fd number is <0, not sure what it can mean.

giovannivenancio · 2019-11-20T16:16:16Z

grep "^795\>" /proc/*/mountinfo

Returns nothing. Also, the results of the nsenter verification:

root@d5edadab7ac3:/# exec 100< /bin/sh
root@d5edadab7ac3:/# cat /proc/$$/fdinfo/100
pos:	0
flags:	0100000
mnt_id:	795

I'm also not sure of how relevant this is but:

Last week, checkpointing (and restore too) was working just fine. Without any major changes to the S.O. (apart from apt updates), it wasn't working anymore. For this reason I reinstalled Ubuntu, but the error persists.
I also tried to install CRIU on a VM (using the same setup Ubuntu 18.04, CRIU 3.12) and the checkpointing works...

avagin · 2019-11-20T21:24:24Z

I think the problem might be in the linux kernel. Can you try to downgrade the kernel?

Snorch · 2019-11-21T08:16:52Z

Not exactly the same problem connected to bad (pseudo) stat sd_dev on overlay:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1751243

In git://kernel.ubuntu.com/ubuntu/autotest-client-tests:

+++ b/ubuntu_unionmount_overlayfs_suite/0001-Fix-check-for-file-on-overlayfs.patch
@@ -0,0 +1,50 @@
+From e19b161d30d648cf0ac5bd68df84b82322de7451 Mon Sep 17 00:00:00 2001
+From: Kleber Sacilotto de Souza [email protected]
+Date: Thu, 31 May 2018 13:52:30 +0200
+Subject: [PATCH][unionmount-testsuite] Fix check for file on overlayfs
+
+BugLink: https://bugs.launchpad.net/bugs/1751243
+
+After kernel 4.15, the unmodified files do not have the same st_dev as
+the lower filesystem anymore, but they are assigned an anonymous bdev
+instead. So checking if a file which is expected to be unmodified has
+the same st_dev as the lower filesystem doesn't work anymore, we need to
+check if they do not have the same st_dev as the overlay filesystem.
+
+Signed-off-by: Kleber Sacilotto de Souza [email protected]
+Acked-by: Colin Ian King [email protected]
+---

context.py | 13 +++++--------
1 file changed, 5 insertions(+), 8 deletions(-)

Snorch · 2019-11-21T08:35:01Z

Looks like we have the same as for st_dev now with mnt_id, that is bad, because we can't find on which mount to open the file if kernel hides these information from us.

giovannivenancio · 2019-11-21T11:26:07Z

I think the problem might be in the linux kernel. Can you try to downgrade the kernel?

That was the problem. Downgrading the kernel solves the error!

So, for reference, I was using kernel version 5.0.0-36-generic and downgrading to kernel version 5.0.0-23-generic solves the error.

Thank you, i really appreciate the effort! Cheers.

adrianreber · 2019-12-22T20:42:24Z

I reported the ubuntu kernel problems here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1857257

adrianreber · 2020-01-06T12:48:13Z

This bug is now also seen in multiple other projects which have to disable CRIU tests on Travis:

containerd/containerd#3898
opencontainers/runc#2196
opencontainers/runc#2198

anagainaru · 2020-04-07T18:48:30Z

Any update with this?

I have the same problem when trying to use CRIU Version: 3.6, Docker version 19.03.6 on the 5.3.0-40-generic Ubuntu kernel. I would prefer not having to downgrade my kernel.

adrianreber · 2020-04-07T18:51:39Z

@anagainaru You have to complain here:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1857257

There is patch to fix it, but it seems not have been applied by Ubuntu. You could always switch to another Linux distribution.

avagin · 2020-04-23T17:44:26Z

Here is the fix: https://lists.openvz.org/pipermail/criu/2020-April/044973.html

buck2202 · 2020-08-03T15:15:42Z

The status of the launchpad bug is a bit confusing to me--the janitor reported back in May that it was fixed in 5.3/5.4, comments suggest the fix caused other issues and it was either rolled back or mistakenly not rolled back, and now the janitor reports it fixed again in 5.4 in the focal repo, but no mention of other releases with currently-active kernel support. Can anyone give any insight to the status of this?

adrianreber · 2020-08-04T07:45:39Z

The status of the launchpad bug is a bit confusing to me--the janitor reported back in May that it was fixed in 5.3/5.4, comments suggest the fix caused other issues and it was either rolled back or mistakenly not rolled back, and now the janitor reports it fixed again in 5.4 in the focal repo, but no mention of other releases with currently-active kernel support. Can anyone give any insight to the status of this?

I am also confused by the state of that bug. Someone would need to test it. I am not using Ubuntu, so I cannot test it. For CRIU we rely on Travis and Travis is based on Ubuntu, so it would be nice to have a fixed kernel in Travis. But Travis also takes some time to update their images. Even if it fixed I cannot verify it via Travis. Additionally Travis uses the GCE variant of the Ubuntu kernel and I am not sure how that kernel version maps to the kernel version in the launchpad bug report.

No answer for your question, but I can confirm that the state of that bug is also very unclear to me.

buck2202 · 2020-08-04T14:26:34Z

Alright. A big part of my CRIU usage is on google cloud, so I went ahead and created some clean ubuntu images with latest stable docker from the download.docker repo, and CRIU from the PPA. I tried to checkpoint a container created from giovannivenancio's MWE above just to see where things stand.

docker run --security-opt=seccomp:unconfined --name cr -d busybox /bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'
docker checkpoint create cr chk

16.04/xenial installs with 4.15.0-1080-gcp #90~16.04.1-Ubuntu. Checkpointing still works fine here

18.04/bionic installs with 5.3.0-1032-gcp #34~18.04.1-Ubuntu. Fails with

Error response from daemon: Cannot checkpoint container cr: runc did not terminate sucessfully: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v1.linux/moby/a3fd40353bfa4b041fb4eb9f38e6140e243700106208e9de77ec1f97bc206986/criu-dump.log: unknown

(00.170336) Error (criu/files-reg.c:1372): Can't lookup mount=339 for fd=-3 path=/bin/sh
(00.170347) Error (criu/cr-dump.c:1247): Collect mappings (pid: 1927) failed with -1

20.04/focal installs with 5.4.0-1021-gcp #21-Ubuntu. Fails with

Error response from daemon: Cannot checkpoint container cr: runc did not terminate sucessfully: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v1.linux/moby/802ae7117527f88f524fe32db4959b191df694de4ed33cf00a6ceeef5055993a/criu-dump.log: unknown

(00.255897) Error (criu/files-reg.c:1371): Can't lookup mount=436 for fd=-3 path=/bin/sh
(00.255910) Error (criu/cr-dump.c:1247): Collect mappings (pid: 1149) failed with -1

TLDR; still broken, but I'm not sure of the mapping between gcp and vanilla ubuntu versions.

edit: here's a mapping, but not totally helpful to get back to baseline ubuntu https://people.canonical.com/~kernel/info/kernel-version-map.html

Ubuntu Kernel Version	Ubuntu Kernel Tag	Mainline Kernel Version
(bionic) 5.3.0-1032.34~18.04.1	Ubuntu-gcp-5.3-5.3.0-1032.34_18.04.1	5.3.18
(focal) 5.4.0-1021.21	Ubuntu-gcp-5.4.0-1021.21	5.4.44

adrianreber · 2020-08-04T15:08:14Z

Thanks a lot for trying it out! Good to know that it is still not fixed. I will re-open the launchpad bug.

As far I understood it this is related to Ubuntu's out of tree shiftfs implementation which is not part of the upstream linux kernel. That is the reason it only happens on Ubuntu.

buck2202 · 2020-08-04T16:05:19Z

No problem at all, thanks for following up on launchpad. I'm keeping my work on xenial to avoid manually blocking kernel updates, but I'll keep these instances around to retry if I see any movement on a fix

buck2202 · 2020-09-09T12:04:03Z

Just wanted to drop by again since launchpad is again confusing. There's another fix-released comment, but still "confirmed" status for focal and "won't fix" for eoan. The two most recent LTS releases are on 5.4 kernels, so I guess it's not surprising that it's still broken in both of them.

Summary:

Release	kernel	working?
18.04 server LTS	5.4.0-1024-gcp #24~18.04.1-Ubuntu	no
20.04 server LTS	5.4.0-1024-gcp #24-Ubuntu	no

Wonder if it's worth tagging launchpad for bionic as well?

adrianreber · 2020-09-09T12:24:20Z

My guess right now is, that this will not be fixed any time soon. As far as I understand it, this is related to non-upstream kernel patches concerning 'shiftfs'. From what I heard this will not be upstreamed in the version Ubuntu carries it. Other implementations of similar features are also not upstream (yet). So if we are lucky it will be fixed with the next LTS release, but that is about 1.5 years of waiting and as the feature is not yet in the kernel it is not clear if this will happen.

If this breaks your workflow you have to either run an old kernel or switch distributions.

buck2202 · 2020-09-09T13:03:41Z

Understood, just wanted to follow up since there was a new janitor post and you had said previously that you didn't have an easy way of testing it. I'll still re-test and check back in if it looks like their fix makes it back to focal. Hopefully that would cover bionic as well, since they're currently on the same LTS kernel for gcloud at least.

I've kept my google cloud instances on xenial and my (linux mint) desktops on the still-supported 4.15 kernel. I might move away from ubuntu/derivatives at some point, but I'm trying to keep things static until my current work wraps up.

adrianreber · 2021-01-11T06:26:00Z

I removed the stale-issue label and added the no-auto-close label as long as this is not fixed in Ubuntu.

#887 is another example where we were impacted by this kernel bug.

github-actions · 2021-02-11T00:14:57Z

A friendly reminder that this issue had no activity for 30 days.

github-actions · 2021-04-02T00:13:00Z

A friendly reminder that this issue had no activity for 30 days.

mihalicyn · 2021-04-25T18:43:44Z

new fix https://lists.openvz.org/pipermail/criu/2021-April/045238.html

buck2202 · 2021-08-03T22:11:53Z

update for checkpoint creation on google cloud images:

I ran

docker run --security-opt=seccomp:unconfined --name cr -d busybox /bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'
docker checkpoint create cr chk

and confirmed a zero exit code

base image	kernel	working?
18.04 server LTS (hwe)	5.4.0-1049-gcp #53~18.04.1-Ubuntu SMP	yes
20.04 server LTS	5.8.0-1038-gcp #40~20.04.1-Ubuntu SMP	yes

adrianreber · 2021-08-04T07:47:11Z

Thanks everyone for working on fixes and checking the state of the kernel. Closing this now as it seems to be fixed in Ubuntu.

108anup · 2022-02-08T00:26:51Z

Edit: Upgrading to kernel: 5.4.0-1068-azure worked.

Hey was facing a similar issue. I am guessing I need to update the kernel based on the thread, is that correct?

OS: Ubuntu 18.04.4
kernel: 5.3.0-1034-azure (this is on an azure machine)
CRIU version: 3.6 (installed from ppa)
docker version: 20.10.7

cmds:

docker run --security-opt=seccomp:unconfined --name cr -d busybox /bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'
docker checkpoint create cr checkpoint1

tail of the log:

(00.033224) ========================================
(00.033228) Dumping task (pid: 15327)
(00.033232) ========================================
(00.033235) Obtaining task stat ...
(00.033274)
(00.033277) Collecting mappings (pid: 15327)
(00.033280) ----------------------------------------
(00.033454) Error (criu/files-reg.c:1281): Can't lookup mount=387 for fd=-3 path=/bin/sh
(00.033470) Error (criu/cr-dump.c:1249): Collect mappings (pid: 15327) failed with -1
(00.033505) Unlock network
(00.033509) Running network-unlock scripts
(00.033512)     RPC
iptables-restore: invalid option -- 'w'
ip6tables-restore: invalid option -- 'w'
(00.061090) Unfreezing tasks into 1
(00.061159) Error (criu/cr-dump.c:1709): Dumping FAILED.

elchead · 2022-03-07T14:52:09Z

Had the same issue and got it to work on Ubuntu 21.04 LTS after performing a kernel upgrade through sudo apt-get dist-upgrade to version 5.11.0-1027-azure x86_64 (Ubuntu 22 did not work with kernel 5.13)

adrianreber · 2022-03-07T15:26:35Z

Had the same issue and got it to work on Ubuntu 21.04 LTS after performing a kernel upgrade through sudo apt-get dist-upgrade to version 5.11.0-1027-azure x86_64 (Ubuntu 22 did not work with kernel 5.13)

Unfortunately we have also seen it being reintroduced by Ubuntu. Not much we can do.

avagin · 2022-03-07T16:18:10Z

@elchead you need to find what kernel change broke the workflow and report it to ubuntu.
Cc: @Snorch @mihalicyn

rst0git · 2022-03-07T17:19:48Z

Ubuntu 22 did not work

@elchead What is the error message you see with Ubuntu 22?

I noticed that Ubuntu 22.04 has upgraded to glibc 2.35: #1696

$ podman run ubuntu:22.04 ldd --version
ldd (Ubuntu GLIBC 2.35-0ubuntu1) 2.35
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

elchead · 2022-03-18T10:02:09Z

not sure yet which change broke it but I can further confirm that

kernel 5.4.0.1010.11 did not work, but 5.4.0-1068-azure does as stated above.
5.13.0.1017.19 also does not work (latest kernel available through azure on 20.04 LTS)

baelter · 2022-03-31T15:15:36Z

Also seeing this:

~# uname -r
5.13.0-39-generic
~# criu --version
Version: 3.16.1
~# crun --version
crun version 1.4.4
commit: 6521fcc5806f20f6187eb933f9f45130c86da230
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL

elchead · 2022-04-01T15:32:23Z

@baelter

You can try this:
apt install -y linux-image-unsigned-5.4.0-1068-azure

Follow the steps from here to change the boot kernel:
https://meetrix.io/blog/aws/changing-default-ubuntu-kernel.html

avagin · 2022-04-05T17:14:54Z

5.13.0-1017-azure has this issue. I filed a new ubuntu issue:
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1967924

baelter · 2022-08-15T09:26:45Z

I'm not using an azure build but standard that comes with ubuntu, so this issue is not limited to that build.

rst0git · 2022-08-15T09:46:22Z

Thanks @baelter, the problem should be fixed with the standard kernel that comes with Ubuntu:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1967924

RonaldGalea · 2022-11-24T20:03:02Z

I have the same issue on an Ubuntu AWS VM, the kernel version is:
Linux version 5.15.0-1019-aws

I can't find what version this is exactly, and whether it's expected to work or not?

mihalicyn · 2022-11-24T21:02:00Z

As far as I understand your kernel is build on Wed, 17 Aug 2022. So it doesn't contain my last fix for this issue. Please try to upgrade your kernel to the recent version. It should work. If not - then ping me ;-)

RonaldGalea · 2022-11-25T11:11:51Z

Updated to 5.15.0-1023-aws and it works as expected, thank you very much :)

adrianreber mentioned this issue Dec 3, 2019

Criu failed: type NOTIFY errno 0 #874

Closed

rst0git mentioned this issue Dec 21, 2019

Fix tests on Ubuntu #891

Merged

Dudeldu mentioned this issue Mar 18, 2020

Add support for (experimental) docker checkpoint feature Idein/dockworker#81

Merged

avagin added kernel bug labels Apr 23, 2020

adrianreber mentioned this issue Oct 13, 2020

Cant make Criu dump work on a GCP node #1230

Closed

This was referenced Oct 21, 2020

criu docker checkpoint error #1239

Closed

Hi everyone, #1243

Closed

github-actions bot added the stale-issue label Feb 11, 2021

avagin removed the stale-issue label Mar 2, 2021

github-actions bot added the stale-issue label Apr 2, 2021

adrianreber closed this as completed Aug 4, 2021

avagin reopened this Mar 7, 2022

ymanton mentioned this issue Mar 22, 2022

CRIU: Checkpoint/Restore Feature Status eclipse-openj9/openj9#14361

Open

Zheaoli mentioned this issue Apr 2, 2022

nerdctl checkpoint command support containerd/nerdctl#956

Open

adrianreber mentioned this issue Apr 5, 2022

ci: Ubuntu broke overlayfs again #1801

Merged

rst0git closed this as completed Aug 14, 2022

YJDoc2 mentioned this issue Sep 26, 2023

Implement container restore functionality youki-dev/youki#2335

Open

error when checkpointing container: Can't lookup mount #860

error when checkpointing container: Can't lookup mount #860

Comments

giovannivenancio commented Nov 19, 2019

avagin commented Nov 19, 2019

avagin commented Nov 19, 2019

avagin commented Nov 19, 2019

Snorch commented Nov 20, 2019 • edited Loading

giovannivenancio commented Nov 20, 2019

avagin commented Nov 20, 2019

Snorch commented Nov 21, 2019

Snorch commented Nov 21, 2019

giovannivenancio commented Nov 21, 2019

adrianreber commented Dec 22, 2019

adrianreber commented Jan 6, 2020

anagainaru commented Apr 7, 2020

adrianreber commented Apr 7, 2020

avagin commented Apr 23, 2020

buck2202 commented Aug 3, 2020

adrianreber commented Aug 4, 2020

buck2202 commented Aug 4, 2020 • edited Loading

adrianreber commented Aug 4, 2020

buck2202 commented Aug 4, 2020

buck2202 commented Sep 9, 2020

adrianreber commented Sep 9, 2020

buck2202 commented Sep 9, 2020

adrianreber commented Jan 11, 2021

github-actions bot commented Feb 11, 2021

github-actions bot commented Apr 2, 2021

mihalicyn commented Apr 25, 2021

buck2202 commented Aug 3, 2021

adrianreber commented Aug 4, 2021

108anup commented Feb 8, 2022 • edited Loading

elchead commented Mar 7, 2022

adrianreber commented Mar 7, 2022

avagin commented Mar 7, 2022

rst0git commented Mar 7, 2022

elchead commented Mar 18, 2022

baelter commented Mar 31, 2022

elchead commented Apr 1, 2022

avagin commented Apr 5, 2022

baelter commented Aug 15, 2022

rst0git commented Aug 15, 2022

RonaldGalea commented Nov 24, 2022

mihalicyn commented Nov 24, 2022

RonaldGalea commented Nov 25, 2022

Snorch commented Nov 20, 2019 •

edited

Loading

buck2202 commented Aug 4, 2020 •

edited

Loading

108anup commented Feb 8, 2022 •

edited

Loading