-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mounts: allow runtimes to create the destination #955
Conversation
Consider the following mount: ``` "mounts": [ { "destination": "/non-existent", "source": "/etc/resolv.conf", "options": ["bind","ro"] } ] ``` When /non-existent does not exist in the container rootfs, the spec does not say whether the runtime can create it or if it should let the mount fails. runc just calls 'mkdir' if it does not exist, after creating and entering the namespaces. It works most of the times but fails if: - the source of the bind mount is a file instead of a directory, or - the container does not have the rights for the mkdir (readonly rootfs or rootfs belonging to an unmapped uid). Doing the mkdir/touch before entering the namespaces is difficult to get correctly because: - symlinks would need to be resolved manually as if we were chrooted in the container (symlinks are resolved by the mount syscall: a symlink itself cannot be a source or a target of a bind mount), and - ordering nested mounts is complicated by possible symlinks So I'd prefer if the spec does not mandate to create the file or directory in all possible scenarios because it's overly complicated. But the runtime should be allowed to address the simple case like runc does. Hence the "MAY" wording. Signed-off-by: Alban Crequy <[email protected]>
I'd rather MUST this, since otherwise runtime-callers cannot rely on it. Previous discussion in #591. |
Don't validate uid mappings and gid mappings separately: containers with only user mappings or with only group mappings are not usable. Additionnally, don't rely on the runtime to create the directories to be mounted. runc mounts them in the easy cases but it does not work with user namespaces. Marking as WIP because this is in discussion in opencontainers/runtime-spec#955 Signed-off-by: Alban Crequy <[email protected]>
Don't validate uid mappings and gid mappings separately: containers with only user mappings or with only group mappings are not usable. Additionnally, don't rely on the runtime to create the directories to be mounted. runc mounts them in the easy cases but it does not work with user namespaces. Marking as WIP/RFC because this is in discussion in opencontainers/runtime-spec#955 Signed-off-by: Alban Crequy <[email protected]>
This is the work that made me want to clarify the spec: opencontainers/runtime-tools#597 |
What should the runtime do if the uid-mappings don't give enough privileges for the directories to be created after entering the namespaces? Should it give up with EPERM like runc is doing now? Or should it be able to work around this by using mkdir/touch before entering the user namespaces? rkt is doing the latter but it has a cost in code complexity: rkt evaluates symlinks manually in the rootfs (EvaluateSymlinksInsideApp and rkt#3490). If the spec were to say "MUST" here, it should also say what to do in the case above. /cc @s-urbaniak @iaguis |
IMO it'd make more sense to make this a Auto-creating mounted directories isn't horrible, but when it creates a directory and a file was intended, it causes more harm than good (as can be and has been seen with |
This is the approach that was considered and rejected in #591 (although we can obviously revisit it). I think the crux there was @crosbymichael pointing out that, if you're about to die because of a missing destination directory, who wouldn't want the runtime to at least attempt to create the destination? |
You can handle this by creating target files for bind mounts where the source resolves to a non-directory (e.g. wking/ccon@ce2d4a45629, fixes in wking/ccon@8b368c61fc). I'm fine with that being a configured option instead of detecting it though, if folks feel like detection is too high a runtime burden. |
I think this should be allowed, but not required. Creating the target dir/file when the container process has sufficient priveleges should be required, because it's helpful and a dozen or two lines of C.
Besides symlink rooting, you also have the "maybe a previpus mount changes one of these" issue. I think you'd need to have the container process check the source type, and the tell the runtime process (over their synchronization pipe) "can you create a {file|dir} at {container path}", and the runtime would have to clone a sub-process into the mount namespace (while staying in the runtime user namespace) to create the target. So not impossible, and great for runtimes that want to support this, but not something I'm ready to impose on runtimes withough more motivation.
Agreed. |
@alban FYI this behavior from Docker caused some frictions in the past. Kubernetes thus introduced dedicated |
+1 to @tianon position |
house keeping: i know the expectations of creating the path are unclear, but just having this as a |
Consider the following mount:
When /non-existent does not exist in the container rootfs, the spec does
not say whether the runtime can create it or if it should let the mount
fails.
runc just calls 'mkdir' if it does not exist, after creating and
entering the namespaces. It works most of the times but fails if:
or rootfs belonging to an unmapped uid).
Doing the mkdir/touch before entering the namespaces is difficult to get
correctly because:
the container (symlinks are resolved by the mount syscall: a symlink
itself cannot be a source or a target of a bind mount), and
So I'd prefer if the spec does not mandate to create the file or
directory in all possible scenarios because it's overly complicated. But
the runtime should be allowed to address the simple case like runc does.
Hence the "MAY" wording.
Signed-off-by: Alban Crequy [email protected]