Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider changing NodePublish to NodePrepare #96

Open
cpuguy83 opened this issue Aug 30, 2017 · 20 comments
Open

Consider changing NodePublish to NodePrepare #96

cpuguy83 opened this issue Aug 30, 2017 · 20 comments
Labels

Comments

@cpuguy83
Copy link
Contributor

Right now, NodePublish is used to instruct the storage driver to set everything up that's required for the node, get everything mounted, etc.

One issue with this approach is we end up having to manage mounts in the runtime's mount namespace. Often this leads to leakages, difficulty in telling if something needs to be mounted or not, etc.

Instead we can change this to be more of a Prepare type statement where the SP does everything except mount and instead returns a mount struct (or list of mounts) which the caller would need to perform.
This would allow passing the actual mount request to the container runtime to deal with (in the container's mount namespace even).
This also has the benefit that the SP does not need access to the mount target.

I'm not sure what would need to be returned for block storage here since nothing is mounted (maybe an mknod? this is platform dependent, maybe nothing at all is required), but I think there would still be benefit here.

Note that this is the approach being used for containerD storage drivers: https://godoc.org/github.com/containerd/containerd/snapshot#Snapshotter

@julian-hj
Copy link
Contributor

I feel like we discussed this possibility early on, and eventually discarded it because some plugins will require fuse software or other non-kernel mounting protocols that we don't necessarily want to require the CO to know about.

We also discussed the possibility of letting the plugin optionally do the mount, for cases where we just want to use a standard kernel mount, but I recall that we decided that would add more complexity to the flow than we wanted. So we settled on letting the plugin do the mount as the "least bad" option.

@cpuguy83
Copy link
Contributor Author

some plugins will require fuse software or other non-kernel mounting protocols

As long as the mount struct can contain this, that's ok. To the CO it should be translating the returned mount(s) into a (or multiple) mount(2) call(s).
The non-mount preparation steps should still happen with the SP. If there's nothing to really mount, it can just return something like type=none,device=/path/to/data,opts=bind (just as an example).

@julian-hj
Copy link
Contributor

Fair enough. So if I understand it correctly, basically what you're saying is that the container runtime would always need to do a bind mount to consume the volume as the spec stands today. But the OCI runtime spec can take an array of any kind of mount (not just bind mounts). So:

  • for normal kernel mountable volumes, we could return mount options to do a real mount.
  • for volumes that are fuse mounted or need to be mounted by the SP for some reason, we could do the mount in the SP and return mount options to do a bind mount.

@cpuguy83
Copy link
Contributor Author

Yep.
And even in the case of the fuse mount, the SP never needs to know about the eventual mount target as it does today.

@jieyu
Copy link
Member

jieyu commented Aug 30, 2017

for volumes that are fuse mounted or need to be mounted by the SP for some reason, we could do the mount in the SP and return mount options to do a bind mount.

Does that still require setting up mount propagation so that CO can see the mount by the SP?

I am attaching my original email to the working group for some context why we choose the existing route:

Hi folks,

There are quite a few discussion going on in the draft design doc, especially related to:
(1) Who should be responsible for doing the 'mount'?
(2) Should the interface be a CLI interface or a Service interface?

Initially, I pushed for letting Container Orchestrator (CO) do the mount so that plugin can be non >privileged, and a CLI interface so that CO does not need to maintain the life cycle of the plugin >container.

However, Julian and Saad mentioned that the above might not work for the fuse. Therefore, I spent some time researching how fuse works. Looks like it does complicate the design. Take 'sshfs' as an example, this is what will happen if one does sshfs me@xxx:/path /mnt:

(1) sshfs executable (linked with libfuse.so) creates a unix domain socket pair
(2) sshfs executable fork/exec fusermount, and passes the write end of the socket pair as an environment variable (_FUSE_COMMFD) to fusermount.
(3) fusermount executable opens /dev/fuse, and gets an fd
(4) fusermount executable calls mount syscall with the opened fd as one of the mount option. mount needs SYS_ADMIN capability. The fact fusermount can be invoked by non privileged user is because it's a setuid program.
(5) fusermount executable send fd through the write end of the domain socket pair to the parent (which is the sshfs executable).
(6) sshfs executable receives the translated fd, and start to read/write on that fd (kernel communicate vfs operations to this fd).

I think fuse poses at least two challenges:
(1) The fuse main process serving vfs operations needs to be long running. If we go with a CLI interface, then the plugin needs to talk back to the CO to launch a container managing the long running process, which is not very clean.
(2) If the 'mount' is done by CO, then the plugin needs to send the opened fd (/dev/fuse) to the CO through a domain socket so that it can get translated, which is also not very clean.

As a result, I am now more leaning towards a Service model, and let the plugin do the mount. Unfortunately, this means the plugin has to be given SYS_ADMIN capability in its bounding set. Another options is, as Julian pointed in the doc, that we can let CO do the mount for non-fuse case. I guess this is doable, but will probably cause quite a few confusions?

Let me know if you have any thoughts on that!

Thanks,

  • Jie

@cpuguy83
Copy link
Contributor Author

Does that still require setting up mount propagation so that CO can see the mount by the SP?

This will be highly dependent on if the plugin will be performing any mounts that the CO needs to see.
In the ideal scenario, no there would be no need for mount propagation if the CO is setting up the mounts (or deferring to some low-level component).

re: FUSE concerns

This should still work. Having the CO do mounts provides more flexibility here since the SP can still do the mount(s) itself if required or can leave it to the CO to handle if at all possible.

@jieyu
Copy link
Member

jieyu commented Aug 31, 2017

@cpuguy83 can you give a sample fuse workflow (in which CO does the mount) that does not depends on mount propagation?

@cpuguy83
Copy link
Contributor Author

@jieyu I'm not sure there is one. In this case the SP would need to handle the root mount and the mount will need to be propagated back to the CO.

@jieyu
Copy link
Member

jieyu commented Aug 31, 2017

@cpuguy83 that means CO has to support mount propagation anyway (for fuse). Given that, I'd prefer a more consistent way for handling mounts (what's specified in the spec).

@cpuguy83
Copy link
Contributor Author

@jieyu What is inconsistent in this case? Either way a mount would still be returned which would contain the path where the data is stored.
Setting up mount propagation is up to the needs of the plugin. Not everything necessarily needs to mount (local storage) at all.

@jieyu
Copy link
Member

jieyu commented Aug 31, 2017

@cpuguy83 the inconsistency i was referring to was:

  1. for non-fuse plugins, CO will mount it in the CO's mount namespace
  2. for fuse plugins, the plugin will mount it in the SP's mount namespace and propagate to CO's mount namespace

The mount propagation is setup by the CO when launching the plugin. For instance, you can mark the directory that contains target_path as a shared mount when launching the plugin container. In that way, the SP does not need to know about mount propagation, it just need to mount to target_path.

@julian-hj
Copy link
Contributor

julian-hj commented Aug 31, 2017 via email

@jieyu
Copy link
Member

jieyu commented Aug 31, 2017

I think another reason that was discussed in the past is that some SP might want to rely on some existing mount utility programs (e.g., mount.nfs, mount.glusterfs ) to do the mount. If we ask CO to do the mount, then the SP can no longer use those existing mount utility programs.

Also, I cross reference some relevant discussion in k8s community on this topic:
kubernetes/community#589

cc @jsafrane

@cpuguy83
Copy link
Contributor Author

cpuguy83 commented Aug 31, 2017 via email

@cpuguy83
Copy link
Contributor Author

So one other thing here, and this probably complicates things but also allows a lot of flexibility (so forgive my Saturday afternoon rambling)... What if instead of just supporting a mount return type (although out of the gate, it should likely be limited to this), it was more open ended and could be extended (in the future) to pass other things such as, for example, a command or even (potentially) an image+command.

@jieyu
Copy link
Member

jieyu commented Oct 2, 2017

@cpuguy83 What the CO should do in respond to those open ended things?

@cpuguy83
Copy link
Contributor Author

cpuguy83 commented Oct 3, 2017

@jieyu Determine what type of thing the SP said to run and run it if supported, error out if not supported.

aciton := resp.GetAction() // go-grpc thing for getting a one of
switch action.(type) {
case MountAction:
   // perform mount
default:
    return ErrUnsupported
}

@jdef
Copy link
Member

jdef commented Oct 30, 2017

#122 separates the responsibilities of publishing a device "globally" for the node, and publishing a volume on a per-workload basis

@saad-ali
Copy link
Member

saad-ali commented Nov 1, 2018

This is a cool idea. We think we can add it in the future post v1 as a additive optional change. It would be a new capability on node. If capability exists, and CO recognizes it and wants to use it, CO can specify new field on NodePublishVolumeRequest to say it wants this, and SP can return a new field on NodePublishVolumeResponse with a bind mount struct that the CO can use to do the bind.

@bergwolf
Copy link

cc @xing-yang @jingxu97

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants