-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Id mapped mounts for shared volumes #3429
Support Id mapped mounts for shared volumes #3429
Conversation
go.mod
Outdated
@@ -24,3 +25,5 @@ require ( | |||
golang.org/x/sys v0.0.0-20211116061358-0a5406a5449c | |||
google.golang.org/protobuf v1.27.1 | |||
) | |||
|
|||
replace github.com/opencontainers/runtime-spec/specs-go => ./staging/src/github.com/opencontainers/runtime-spec/specs-go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please open PR for runtime-spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libcontainer/init_linux.go
Outdated
@@ -97,6 +97,7 @@ func newContainerInit(t initType, pipe *os.File, consoleSocket *os.File, fifoFd, | |||
logFd: logFd, | |||
}, nil | |||
case initStandard: | |||
logrus.Infof("mountFds: %v", mountFds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logrus.Infof("mountFds: %v", mountFds) | |
logrus.Debugf("mountFds: %v", mountFds) |
go.mod
Outdated
github.com/opencontainers/selinux v1.10.0 | ||
github.com/pkg/errors v0.9.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deprecated
libcontainer/mount_linux.go
Outdated
OPEN_TREE_CLONE = 1 | ||
OPEN_TREE_CLOEXEC = 02000000 | ||
MOVE_MOUNT_F_EMPTY_PATH = 0x00000004 | ||
AT_FDCWD = -100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be defined in x/sys/unix? Please open a CL against it -- they are usually more than happy to accept such stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I wrote about it, in cover message, but definitely x/sys/unix is better candidate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolyshkin I've seen your https://go-review.googlesource.com/c/sys/+/363444 I assume this change was made w/o code generation e.g. by sys/unix/mkall.sh ?
libcontainer/mount_linux.go
Outdated
|
||
//move_mount(int from_dfd, const char *from_pathname, int to_dfd, | ||
// const char *to_pathname, unsigned int flags) | ||
func moveMount(from_dfd int, fromPathName string, to_dfd int, toPathName string, flags uint) (err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here -- seems that this belongs to x/sys/unix.
libcontainer/mount_linux.go
Outdated
} | ||
|
||
// int dfd, const char *filename, unsigned int flags) | ||
func openTree(dfd int, fileName string, flags uint) (r int, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here -- seems that this belongs to x/sys/unix.
libcontainer/rootfs_linux.go
Outdated
@@ -68,6 +74,7 @@ func prepareRootfs(pipe io.ReadWriter, iConfig *initConfig, mountFds []int) (err | |||
cgroupns: config.Namespaces.Contains(configs.NEWCGROUP), | |||
} | |||
setupDev := needsSetupDev(config) | |||
//time.Sleep(30 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug session leftover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, initial version, after moving helpers to x/sys/unix and runtime-spec, it would be clear.
libcontainer/rootfs_linux.go
Outdated
@@ -380,6 +387,9 @@ func mountToRootfs(m *configs.Mount, c *mountConfig) error { | |||
return err | |||
} | |||
|
|||
if mountFd != nil { | |||
logrus.Infof("mountFd: %d", *mountFd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either logrus.Debugf
or remove it.
libcontainer/rootfs_linux.go
Outdated
logrus.Infof("m.Data: %v\n", m.Data) | ||
logrus.Infof("flags: %v\n", flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for \n
here. Also seems like a debug leftover.
libcontainer/rootfs_linux.go
Outdated
@@ -1055,7 +1079,9 @@ func remount(m *configs.Mount, rootfs string, mountFd *int) error { | |||
} | |||
// ... and retry the mount with ro flag set. | |||
flags |= unix.MS_RDONLY | |||
return mount(source, m.Destination, procfd, m.Device, flags, "") | |||
|
|||
logrus.Infof("m.Data: %v\n", m.Data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for \n
here. Also seems like a debug leftover.
OK, this should be a draft / DNM until changes to runtime-spec and x/sys/unix are landed. |
This obviously won't work because of e.g. RHEL kernels that backport lots of stuff.
Right, a simple test checking if a syscall is available (perhaps calling it with wrong arguments and expecting As an example, see how we check if runc/libcontainer/cgroups/file.go Lines 79 to 98 in c4c4889
|
c1a18c3
to
26b6744
Compare
This PR depends on golang's x/sys library modifications: |
26b6744
to
27d6967
Compare
|
||
func mountByMove(m *configs.Mount, dest string) error { | ||
// Check once for all mount point that syscall exists | ||
syscallOnce.Do(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be exported as an interface so other parts may query the syscall availability status? We face the same issue too and seems it could be reused.
@AlexeyPerevalov the runtime spec PR was already merged (opencontainers/runtime-spec#1143). It would be great if you can finish this PR now :) |
27d6967
to
b1187e9
Compare
runtime-spec changes are not yet tagged, latest tag is v1.0.2 |
@AlexeyPerevalov No need to wait for a runtime-spec tag, IMHO. We are not using a tag already: Line 15 in a776ec9
For the x/sys pkg, I think it makes sense to define ourselves the constants if that is not going to be released soon. Do you happen to know when they plan to release that with the constants we need? If there is no visibility into that, I don't know what others think, but I would guess it seems reasonable to use some tmp file with the constants defined here. We can switch whenever that becomes available. |
We are already using a sha for it (due to the lack of runtime-spec maintainers I guess).
This one is never tagged, i.e. all users are just pin it to a particular sha. |
1ba3c92
to
28c40da
Compare
28c40da
to
7fdbd19
Compare
in centos-stream-9 two tests are failed: Tests related to checkpoint restore failed due to segfault in criu, I reproduced it on Centos Stream 8 (I don't have Stream 9), but It is not reproducible on Ubuntu 20.04, with working criu. |
if mountByMove(m, dest) != nil { | ||
if err := mountPropagate(m, rootfs, mountLabel, mountFd); err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means, if we fail to mount with idmap mounts, it just does a bind-mount?
I'm not sure what is the right behavior, because just mounting and ignoring the mapping for that mount will probably create issues to access the files (if the container running inside a userns).
I'm not sure what the right behavior should be. What do you think? And other maintainers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning towards in case of errors just failing the mount, rather than doing a bind-mount and ignoring the mapping.
It will definitely cause access issues, or even worse: files might get written with completely different UIDs/GIDs and it will be painful to fix once that happens.
@AlexeyPerevalov What do you think?
@AkihiroSuda @kolyshkin What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking a little bit how higher layers will use this, yes, I think we need to error out. At the k8s layer we want to start the container with an id mapped mount or not start it at all. If it is started without the id mapped mounts, the files created will have just random garbage as UID/GID, won't be able to read files... it will be a complete mess.
I think if we just return an error if an id mapped mount was requested and we failed to create it is the right path here.
Also, even if in the future we want to open this door (doubt about it), we can always open a door. But we won't be able to close it if we open it now.
@AlexeyPerevalov Can we make this error out if the id map mount fails?
@AlexeyPerevalov but it seems all test have passed... Was that maybe before the latest fix merged yesterday to fix the CI? (although IIRC it was only about one of those test, not both) |
@AlexeyPerevalov friendly ping? It would be great to have this in for runc 1.2. (otherwise, we will have to wait for another runc release, that takes a while, some more time for stable versions of high level container runtimes to include it, etc. :)) |
Signed-off-by: Alexey Perevalov <[email protected]>
to configs.Mount from spec This patch is using mount_setattr for idmapped mounts mount_setattr syscall requires anonymous mount point, IOW mount point should not be published, but it published by mount syscall. So either fsmount or open_tree should be used here. So open_tree syscall is used instead of mount, for bind mount. Signed-off-by: Alexey Perevalov <[email protected]>
This is not based on kernel version, just to check syscall existence. Signed-off-by: Alexey Perevalov <[email protected]>
7fdbd19
to
7b34ecb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlexeyPerevalov Hey, I was testing this and can't make it work.
It seems the code to create an idmap mount is fine (except probably the move mount should be last), but as this code is run already inside the userns, we always get EPERM for the mountSetAttr() syscall, because that can only be run from the initial userns to create a IDMAP mount.
Have you tested this with runc and it worked for you? It would be great if I'm missing something and there is a simple way to make this work :)
if err := mountPropagate(m, rootfs, mountLabel, mountFd); err != nil { | ||
return err | ||
|
||
if mountByMove(m, dest) != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm testing this, and it seems this can't work. At this point we are inside the userns (I've added some logrus.Debugf
to print the content of /proc/self/uid_map
to verify this) and therefore we create an idmap mount. They only work from the initial userns.
Did you made this runc code work for you at some point? Or was this a c&p from working code in a standalone binary (that you tried from the host, so the initial userns) that never was tested in the runc binary (that IIUC this part runs inside the userns).
Am I missing something? Or do we need a big rework to make this work?
I really wish to include this functionality in the next runc release. This is required for our userns work in Kubernetes. I'll try to make this work. If anyone has any thoughts on the approach to take, please let me know :) |
Thanks @rata , could you open a new PR to carry this PR? |
@AkihiroSuda Thanks! Yes, I'm working on it and I'll open a new PR in the next days. I need to start from scratch and do most things in nsenter/nsexec, so nothing really to carry from here. And quite more involved, that is why I haven't opened it yet :) |
Opened this draft PR with an approach that works: #3717 |
This pull request adds support for id map mount feature for shared volumes.
For rootfs this is already implemented in containerd/containerd#5890.
Both commit has a code in common, which should be placed on the same place in the future.
There are two possibilities to use mount_setattr syscall with fd returned by open_tree or with fd returned by fsmount syscall. This PR uses open_tree, since this approach was used in examples, as well as in crun.
As a fallback scenario, mount syscall still used for bind mount. Open_tree syscall as well as mount_setattr are quite new syscall, to use it latest kernel version is required. In this patch set, kernel version is checked, but as an alternative approach we could just check syscall existence, any suggestions are welcome.
Also this PR includes specs-go which could be moved to separate PR.