Consider changing NodePublish to NodePrepare #96

cpuguy83 · 2017-08-30T14:26:09Z

Right now, NodePublish is used to instruct the storage driver to set everything up that's required for the node, get everything mounted, etc.

One issue with this approach is we end up having to manage mounts in the runtime's mount namespace. Often this leads to leakages, difficulty in telling if something needs to be mounted or not, etc.

Instead we can change this to be more of a Prepare type statement where the SP does everything except mount and instead returns a mount struct (or list of mounts) which the caller would need to perform.
This would allow passing the actual mount request to the container runtime to deal with (in the container's mount namespace even).
This also has the benefit that the SP does not need access to the mount target.

I'm not sure what would need to be returned for block storage here since nothing is mounted (maybe an mknod? this is platform dependent, maybe nothing at all is required), but I think there would still be benefit here.

Note that this is the approach being used for containerD storage drivers: https://godoc.org/github.com/containerd/containerd/snapshot#Snapshotter

The text was updated successfully, but these errors were encountered:

julian-hj · 2017-08-30T16:10:01Z

I feel like we discussed this possibility early on, and eventually discarded it because some plugins will require fuse software or other non-kernel mounting protocols that we don't necessarily want to require the CO to know about.

We also discussed the possibility of letting the plugin optionally do the mount, for cases where we just want to use a standard kernel mount, but I recall that we decided that would add more complexity to the flow than we wanted. So we settled on letting the plugin do the mount as the "least bad" option.

cpuguy83 · 2017-08-30T16:18:04Z

some plugins will require fuse software or other non-kernel mounting protocols

As long as the mount struct can contain this, that's ok. To the CO it should be translating the returned mount(s) into a (or multiple) mount(2) call(s).
The non-mount preparation steps should still happen with the SP. If there's nothing to really mount, it can just return something like type=none,device=/path/to/data,opts=bind (just as an example).

julian-hj · 2017-08-30T18:51:28Z

Fair enough. So if I understand it correctly, basically what you're saying is that the container runtime would always need to do a bind mount to consume the volume as the spec stands today. But the OCI runtime spec can take an array of any kind of mount (not just bind mounts). So:

for normal kernel mountable volumes, we could return mount options to do a real mount.
for volumes that are fuse mounted or need to be mounted by the SP for some reason, we could do the mount in the SP and return mount options to do a bind mount.

cpuguy83 · 2017-08-30T19:03:11Z

Yep.
And even in the case of the fuse mount, the SP never needs to know about the eventual mount target as it does today.

jieyu · 2017-08-30T19:21:38Z

for volumes that are fuse mounted or need to be mounted by the SP for some reason, we could do the mount in the SP and return mount options to do a bind mount.

Does that still require setting up mount propagation so that CO can see the mount by the SP?

I am attaching my original email to the working group for some context why we choose the existing route:

Hi folks,

There are quite a few discussion going on in the draft design doc, especially related to:
(1) Who should be responsible for doing the 'mount'?
(2) Should the interface be a CLI interface or a Service interface?

Initially, I pushed for letting Container Orchestrator (CO) do the mount so that plugin can be non >privileged, and a CLI interface so that CO does not need to maintain the life cycle of the plugin >container.

However, Julian and Saad mentioned that the above might not work for the fuse. Therefore, I spent some time researching how fuse works. Looks like it does complicate the design. Take 'sshfs' as an example, this is what will happen if one does sshfs me@xxx:/path /mnt:

(1) sshfs executable (linked with libfuse.so) creates a unix domain socket pair
(2) sshfs executable fork/exec fusermount, and passes the write end of the socket pair as an environment variable (_FUSE_COMMFD) to fusermount.
(3) fusermount executable opens /dev/fuse, and gets an fd
(4) fusermount executable calls mount syscall with the opened fd as one of the mount option. mount needs SYS_ADMIN capability. The fact fusermount can be invoked by non privileged user is because it's a setuid program.
(5) fusermount executable send fd through the write end of the domain socket pair to the parent (which is the sshfs executable).
(6) sshfs executable receives the translated fd, and start to read/write on that fd (kernel communicate vfs operations to this fd).

I think fuse poses at least two challenges:
(1) The fuse main process serving vfs operations needs to be long running. If we go with a CLI interface, then the plugin needs to talk back to the CO to launch a container managing the long running process, which is not very clean.
(2) If the 'mount' is done by CO, then the plugin needs to send the opened fd (/dev/fuse) to the CO through a domain socket so that it can get translated, which is also not very clean.

As a result, I am now more leaning towards a Service model, and let the plugin do the mount. Unfortunately, this means the plugin has to be given SYS_ADMIN capability in its bounding set. Another options is, as Julian pointed in the doc, that we can let CO do the mount for non-fuse case. I guess this is doable, but will probably cause quite a few confusions?

Let me know if you have any thoughts on that!

Thanks,

Jie

cpuguy83 · 2017-08-31T01:16:52Z

Does that still require setting up mount propagation so that CO can see the mount by the SP?

This will be highly dependent on if the plugin will be performing any mounts that the CO needs to see.
In the ideal scenario, no there would be no need for mount propagation if the CO is setting up the mounts (or deferring to some low-level component).

re: FUSE concerns

This should still work. Having the CO do mounts provides more flexibility here since the SP can still do the mount(s) itself if required or can leave it to the CO to handle if at all possible.

jieyu · 2017-08-31T02:11:00Z

@cpuguy83 can you give a sample fuse workflow (in which CO does the mount) that does not depends on mount propagation?

cpuguy83 · 2017-08-31T13:00:38Z

@jieyu I'm not sure there is one. In this case the SP would need to handle the root mount and the mount will need to be propagated back to the CO.

jieyu · 2017-08-31T14:56:51Z

@cpuguy83 that means CO has to support mount propagation anyway (for fuse). Given that, I'd prefer a more consistent way for handling mounts (what's specified in the spec).

cpuguy83 · 2017-08-31T15:09:09Z

@jieyu What is inconsistent in this case? Either way a mount would still be returned which would contain the path where the data is stored.
Setting up mount propagation is up to the needs of the plugin. Not everything necessarily needs to mount (local storage) at all.

jieyu · 2017-08-31T15:26:43Z

@cpuguy83 the inconsistency i was referring to was:

for non-fuse plugins, CO will mount it in the CO's mount namespace
for fuse plugins, the plugin will mount it in the SP's mount namespace and propagate to CO's mount namespace

The mount propagation is setup by the CO when launching the plugin. For instance, you can mark the directory that contains target_path as a shared mount when launching the plugin container. In that way, the SP does not need to know about mount propagation, it just need to mount to target_path.

julian-hj · 2017-08-31T15:30:35Z

FWIW I am in favor of this change. It will leave the overall process for fuse mounts largely unchanged, and will make kernel mounts considerably safer and better, since it won't require mounting into the container host namespace. I also think that returning a mount specification instead of accepting a path dovetails nicely with the OCI runtime spec, so to me it is a bit more elegant. It's kind of a big change to the spec, but better to do it now than later.

…

--- Sent from VMware Boxer On August 31, 2017 at 8:11:57 AM PDT, Brian Goff <[email protected]> wrote: @jieyu<https://github.com/jieyu> What is inconsistent in this case? Either way a mount would still be returned which would contain the path where the data is stored. Setting up mount propagation is up to the needs of the plugin. Not everything necessarily needs to mount (local storage) at all. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#96 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AIF0_HApH_ikBJ8WdNJzmEnmaQsoi0dwks5sdsyXgaJpZM4PHeHb>.

jieyu · 2017-08-31T17:30:54Z

I think another reason that was discussed in the past is that some SP might want to rely on some existing mount utility programs (e.g., mount.nfs, mount.glusterfs ) to do the mount. If we ask CO to do the mount, then the SP can no longer use those existing mount utility programs.

Also, I cross reference some relevant discussion in k8s community on this topic:
kubernetes/community#589

cc @jsafrane

cpuguy83 · 2017-08-31T19:46:48Z

There would not be anything stopping the SP from doing their own mount with a mount helper, it just won't get the same level of safety as with allowing the CO to do it. Recommendation would be to have the CO do the mount(s), but we'd still support SP mounts.

…

On Thu, Aug 31, 2017 at 1:30 PM, Jie Yu ***@***.***> wrote: I think another reason that was discussed in the past is that some SP might want to rely on some existing mount utility programs (e.g., mount.nfs, mount.glusterfs ) to do the mount. If we ask CO to do the mount, then the SP can no longer use those existing mount utility programs. Also, I cross reference some relevant discussion in k8s community on this topic: kubernetes/community#589 <kubernetes/community#589> cc @jsafrane <https://github.com/jsafrane> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#96 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAwxZon0o8KgwV6Cx0vOXFVUNO-yDnN1ks5sdu3PgaJpZM4PHeHb> .

-- - Brian Goff

cpuguy83 · 2017-09-30T18:54:03Z

So one other thing here, and this probably complicates things but also allows a lot of flexibility (so forgive my Saturday afternoon rambling)... What if instead of just supporting a mount return type (although out of the gate, it should likely be limited to this), it was more open ended and could be extended (in the future) to pass other things such as, for example, a command or even (potentially) an image+command.

jieyu · 2017-10-02T20:51:48Z

@cpuguy83 What the CO should do in respond to those open ended things?

cpuguy83 · 2017-10-03T13:27:40Z

@jieyu Determine what type of thing the SP said to run and run it if supported, error out if not supported.

aciton := resp.GetAction() // go-grpc thing for getting a one of
switch action.(type) {
case MountAction:
   // perform mount
default:
    return ErrUnsupported
}

jdef · 2017-10-30T10:45:56Z

#122 separates the responsibilities of publishing a device "globally" for the node, and publishing a volume on a per-workload basis

saad-ali · 2018-11-01T18:55:43Z

This is a cool idea. We think we can add it in the future post v1 as a additive optional change. It would be a new capability on node. If capability exists, and CO recognizes it and wants to use it, CO can specify new field on NodePublishVolumeRequest to say it wants this, and SP can return a new field on NodePublishVolumeResponse with a bind mount struct that the CO can use to do the bind.

bergwolf · 2019-06-25T04:01:28Z

cc @xing-yang @jingxu97

saad-ali added the post-v1 label Nov 1, 2018

julian-hj mentioned this issue Nov 5, 2018

CSI support for hypervisor container runtimes #166

Open

jdef mentioned this issue Nov 14, 2018

Clarify staging and target paths for blocks #335

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider changing NodePublish to NodePrepare #96

Consider changing NodePublish to NodePrepare #96

cpuguy83 commented Aug 30, 2017

julian-hj commented Aug 30, 2017

cpuguy83 commented Aug 30, 2017

julian-hj commented Aug 30, 2017

cpuguy83 commented Aug 30, 2017

jieyu commented Aug 30, 2017

cpuguy83 commented Aug 31, 2017

jieyu commented Aug 31, 2017

cpuguy83 commented Aug 31, 2017

jieyu commented Aug 31, 2017

cpuguy83 commented Aug 31, 2017

jieyu commented Aug 31, 2017

julian-hj commented Aug 31, 2017 via email

jieyu commented Aug 31, 2017

cpuguy83 commented Aug 31, 2017 via email

cpuguy83 commented Sep 30, 2017

jieyu commented Oct 2, 2017

cpuguy83 commented Oct 3, 2017

jdef commented Oct 30, 2017

saad-ali commented Nov 1, 2018 •

edited

Loading

bergwolf commented Jun 25, 2019

Consider changing NodePublish to NodePrepare #96

Consider changing NodePublish to NodePrepare #96

Comments

cpuguy83 commented Aug 30, 2017

julian-hj commented Aug 30, 2017

cpuguy83 commented Aug 30, 2017

julian-hj commented Aug 30, 2017

cpuguy83 commented Aug 30, 2017

jieyu commented Aug 30, 2017

cpuguy83 commented Aug 31, 2017

jieyu commented Aug 31, 2017

cpuguy83 commented Aug 31, 2017

jieyu commented Aug 31, 2017

cpuguy83 commented Aug 31, 2017

jieyu commented Aug 31, 2017

julian-hj commented Aug 31, 2017 via email

jieyu commented Aug 31, 2017

cpuguy83 commented Aug 31, 2017 via email

cpuguy83 commented Sep 30, 2017

jieyu commented Oct 2, 2017

cpuguy83 commented Oct 3, 2017

jdef commented Oct 30, 2017

saad-ali commented Nov 1, 2018 • edited Loading

bergwolf commented Jun 25, 2019

saad-ali commented Nov 1, 2018 •

edited

Loading