Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider a "MountDevice" equivalent step #119

Closed
saad-ali opened this issue Oct 6, 2017 · 26 comments
Closed

Consider a "MountDevice" equivalent step #119

saad-ali opened this issue Oct 6, 2017 · 26 comments
Milestone

Comments

@saad-ali
Copy link
Member

saad-ali commented Oct 6, 2017

Kuberentes has a "MountDevice" step that attachable volumes can optionally implement. This allows a single device to be mounted to a "global" location, and then subsequent pod mounts bind mount that directory to the final pod mount path.

If no equivalent call exists in CSI, volume plugins would be forced to set up their own global mounts, and do their own reference counting when unmounting to determine if the global mount should also be unmounted (which is very error prone),

@saad-ali saad-ali added this to the v0.1 milestone Oct 6, 2017
@akutz
Copy link
Contributor

akutz commented Oct 6, 2017

Hmm. Given that devs already understand that we need to do the ref counting and that this would be an optional piece that likely still results in plug-ins implementing the same logic for the publish operation, I’d rather leave it as is — opaque and up to the plug-ins to simply provide the volume at the requested target path and access mode.

@akutz
Copy link
Contributor

akutz commented Oct 6, 2017

Hi @saad-ali,

Another piece of feedback. If you're going to separate NodePublishVolume into two distinct steps - MountDevice and NodePublishVolume - for the purpose of removing the need for ref counting, then shouldn't you also include UnmountDevice? Otherwise NodeUnpublishVolume is back to checking if it's the last time a device is mounted and unmounting the device as well.

@clintkitson
Copy link

@saad-ali How would this relate to the BlockVolume and MountVolume objects. My assumption has been that the CSI plugin would implicitly do either an attach or attach and mount depending on the object type being passed in.

@akutz
Copy link
Contributor

akutz commented Oct 6, 2017

Hi @saad-ali,

Lastly, please be clear re: your intentions for NodePublishVolume in a world with MountDevice and possibly UnmountDevice. If those RPCs do not exist is NodePublishVolume still expected to handle mount/publication logic as it does today? My concern is that this is a case where the NodePub/Unpub operations will still need to handle things as they do presently and able to interact with the new RPCs if they exist.

@akutz
Copy link
Contributor

akutz commented Oct 6, 2017

Hi @saad-ali,

I lied. One more thing. I think CSI should be as opaque as possible for the health of both the CO and the plug-in, not to mention the backend storage platform. Just like a CO can fail or crash during a CreateVolume RPC and lose state, so could too that happen during a MountDevice or UnmountDevice operation.

I think it's dangerous and engenders too much potential for unsynchronized state for a CO to participate in the ref counting workflow. The process should be opaque to the CO. Its job is to request storage, not care how the plug-in provides it. The CO says "provide me storage at target_path with this access mode." That's it, until the CO says "I no longer need the storage at this target_path." It's simply not, and should not be, up to the CO to care whether or not the device on which that storage was allocated is used by any other volume.

That's my two cents. Thanks!

@clintkitson
Copy link

clintkitson commented Oct 6, 2017

Is the reference counting central to the issue? With the Docker implementation we have been challenged in this area since the unmount of any container would trigger an unmount operation to the plugin. It has been on the plugin to maintain the references to figure out when appropriate to actually perform a detach of the device. But this has not been a great solution since the plugin is not aware of the actual state of containers running. For example, the engine resetting and killing containers without proper unmount calls making their way to the plugin. Having a single authoritative source of truth will help here.

I am not sure this handles your primary concern for this thread @saad-ali, but here's some suggestions for the specification to make it more deterministic for ref counting.

  1. The spec can declare that the CO must maintain the reference counts for the usage of the volume
  2. The spec can declare NodeUnpublishVolumeRequest with a target_path populated when reference count is >1
  3. the spec can declare NodeUnpublishVolumeRequest with a target_path unpopulated when reference count == 1. An unpopulated target_path should indicate that the plugin clean up all mounts associated with the volume.
  4. The spec can declare ControllerUnpublishVolumeRequest only occur when reference count == 1.

@akutz
Copy link
Contributor

akutz commented Oct 6, 2017

Hi @saad-ali,

Because there's always going to be the edge case of dangling, orphaned devices in some circumstance, this is what I'd recommend. Create an RPC called NodeListVolumes that enables a CO to ask nodes which volumes are mounted. If a CO crashes and needs to reconcile state with a storage plug-in then the CO can use this to figure out what volumes the plug-in thinks are mounted on the node. If the CO decides that no containers are using those volumes, the CO can then issue NodeUnpublishVolume RPCs to the nodes for the affected volumes.

This handles the case of drift, which I think can always happen between a CO and a node if the CO crashes or there's some error in a publish call.

cc @clintkitson @codenrhoden

@jieyu
Copy link
Member

jieyu commented Oct 6, 2017

@saad-ali can you elaborate a bit more on the problem. I thought we made it pretty clear that CO will do ref counting and and plugin simply react to the instructions from the CO.

For your case, IIUC, why not simply call NodePublishVolume once in a global location (to the CO), and call NodeUnpublishVolume when CO knows that no workload is using this volume?

@akutz
Copy link
Contributor

akutz commented Oct 6, 2017

I agree with you Jie. It’s not broken, so why “fix” it.

@saad-ali
Copy link
Member Author

Another piece of feedback. If you're going to separate NodePublishVolume into two distinct steps - MountDevice and NodePublishVolume - for the purpose of removing the need for ref counting, then shouldn't you also include UnmountDevice?

Yes, that would be necessary,

How would this relate to the BlockVolume and MountVolume objects. My assumption has been that the CSI plugin would implicitly do either an attach or attach and mount depending on the object type being passed in.

The "MountDevice" would be optional.

Lastly, please be clear re: your intentions for NodePublishVolume in a world with MountDevice and possibly UnmountDevice. If those RPCs do not exist is NodePublishVolume still expected to handle mount/publication logic as it does today?

Yes. "MountDevice"/"UnmountDevice" would be optional like ControllerPublishVolume/ControllerUnpublishVolume. It is up to a particular volume plugin to decide if it supports it or not. If it does, the CO must call MountDevice before NodePublish... If not, then the CO shall not call MountDevice at all and the plugin can handle all of it's logic in the NodePublish/Unpublish.

I think it's dangerous and engenders too much potential for unsynchronized state for a CO to participate in the ref counting workflow. The process should be opaque to the CO. Its job is to request storage, not care how the plug-in provides it. The CO says "provide me storage at target_path with this access mode." That's it, until the CO says "I no longer need the storage at this target_path." It's simply not, and should not be, up to the CO to care whether or not the device on which that storage was allocated is used by any other volume.

At least on the k8s side we already do it, and adding it to the API would simplify plugin development significantly, as this logic wouldn't need to be reimplemented for each plugin that requires it.

Is the reference counting central to the issue? With the Docker implementation we have been challenged in this area since the unmount of any container would trigger an unmount operation to the plugin. It has been on the plugin to maintain the references to figure out when appropriate to actually perform a detach of the device. But this has not been a great solution since the plugin is not aware of the actual state of containers running. For example, the engine resetting and killing containers without proper unmount calls making their way to the plugin. Having a single authoritative source of truth will help here.

That's exactly the problem this is trying to address.

I am not sure this handles your primary concern for this thread @saad-ali, but here's some suggestions for the specification to make it more deterministic for ref counting.

  1. The spec can declare that the CO must maintain the reference counts for the usage of the volume

Agreed.

  1. The spec can declare NodeUnpublishVolumeRequest with a target_path populated when reference count is >1
  2. the spec can declare NodeUnpublishVolumeRequest with a target_path unpopulated when reference count == 1. An unpopulated target_path should indicate that the plugin clean up all mounts associated with the volume.

Overloading the existence or lack of a parameter to mean something is error prone. So I'd prefer having a new explicit call (MountDevice/UnmountDevice) to handle this.

  1. The spec can declare ControllerUnpublishVolumeRequest only occur when reference count == 1.

Agreed.

Create an RPC called NodeListVolumes that enables a CO to ask nodes which volumes are mounted. If a CO crashes and needs to reconcile state with a storage plug-in then the CO can use this to figure out what volumes the plug-in thinks are mounted on the node. If the CO decides that no containers are using those volumes, the CO can then issue NodeUnpublishVolume RPCs to the nodes for the affected volumes.

This handles the case of drift, which I think can always happen between a CO and a node if the CO crashes or there's some error in a publish call.

I completely agree, state will shift and CO will need a mechanism to recover its node state. So I really like this idea @akutz! We should pursue this as a separate request as it isn't a direct blocker for this.

@saad-ali can you elaborate a bit more on the problem. I thought we made it pretty clear that CO will do ref counting and and plugin simply react to the instructions from the CO.

For your case, IIUC, why not simply call NodePublishVolume once in a global location (to the CO), and call NodeUnpublishVolume when CO knows that no workload is using this volume?

We discussed offline. To summarize, the NodePublishVolume call was originally intended to do what we I described the MountDevice call as doing, but we changed it from once per volume node to once per volume per workload in order to handle credentials properly. That change however makes the plugin implementation for plugins that require an intermediary mount path more complicated.

We can discuss this in the community meeting.

@akutz
Copy link
Contributor

akutz commented Oct 18, 2017

Hi Saad,

I’m slightly concerned about the number of changes this close to a possible, tagged release of the spec. I also strongly believe that the spec should remain as opaque as possible with respect to the workflow and data model, keeping things flexible.

The more the CO is responsible for owning certain aspects of the operation, the more opportunities there are for failures along the way.

All that said, this function should be called NodePublishDevice IMO, as the device isn’t being mounted but attached. The publish prefix is consistent with existing function names and is abstract enough to be a suitable description of the operation.

As for NodeListDevices, I am going to file a PR for it and ListDevices so that they both are able to accept a series of optional tags on which to filter results. This is because of my next concern WRT this proposal...

Reconciliation. I do not think this spec ever fully handles the fact that the CO can and will drift out of sync with the state of the storage platform. The more moving pieces (distinct RPCs) there are, the more likely this will happen.

akutz added a commit to akutz/csi-spec that referenced this issue Oct 18, 2017
This patch handles issue container-storage-interface#119 by adding two new RPCs,
"NodePublishDevice" and "NodeUnpublishDevice". These RPCs MUST be called
by the CO if the Node Plugin advertises the "PUBLISH_UNPUBLISH_DEVICE"
capability. Plugins that advertise this capability SHOULD defer volume
reference counting to the CO.
akutz added a commit to akutz/csi-spec that referenced this issue Oct 18, 2017
This patch handles issue container-storage-interface#119 by adding two new RPCs,
"NodePublishDevice" and "NodeUnpublishDevice". These RPCs MUST be called
by the CO if the Node Plugin advertises the "PUBLISH_UNPUBLISH_DEVICE"
capability. Plugins that advertise this capability SHOULD defer volume
reference counting to the CO.
akutz added a commit to akutz/csi-spec that referenced this issue Oct 18, 2017
This patch handles issue container-storage-interface#119 by adding two new RPCs,
"NodePublishDevice" and "NodeUnpublishDevice". These RPCs MUST be called
by the CO if the Node Plugin advertises the "PUBLISH_UNPUBLISH_DEVICE"
capability. Plugins that advertise this capability SHOULD defer volume
reference counting to the CO.
@julian-hj
Copy link
Contributor

This seems an awful lot like the original intent of NodePublishVolume so I am not convinced that we really need a separate RPC for it.

If the issue is that we don't want the plugin to have to maintain reference counts when there are multiple containers using the same mounted volume, could we not just stipulate that NodePublishVolume gets called by the CO once per unique set of inputs? That way if there are several sets of credentials, we will mount several times, but if the plugin doesn't use credentials, the CO can just re-use the same mount for multiple workloads. We do something very similar to this in Cloud Foundry today, and it works nicely.

Another point I would add is that even if we add this RPC, it will be incompatible with plugins that consume credentials, since the creds are typically injected at mount time, and we require multiple mounts when there are multiple creds. (Unless you're doing something esoteric like bind mounting with bindfs, but I feel certain that nobody is doing that currently.)

@akutz
Copy link
Contributor

akutz commented Oct 18, 2017

could we not just stipulate that NodePublishVolume gets called by the CO once per unique set of inputs

This still necessitates reference counting on the plug-in side in order to know when it is safe to unmount something.

That said @julian-hj, I agree, I prefer the way it is today with the plug-in responsible for doing reference counting and keeping it opaque from the CO.

@cpuguy83
Copy link
Contributor

There has been a lot of people asking to change the current docker volume API such that the plugin would not have to do ref-counting.
In can see the reasoning behind such a desire, in particular because state can get out of sync and the plugin could just never get an unmount request for some volume and never be able to unmount it.... I'm wondering if in such a case the plugin would even get the detach request.

If we can solve the issue of ref-counting going out of sync somehow then I think it'd be a worthy thing to add, however I don't think this particular proposal really solves that problem.

@julian-hj
Copy link
Contributor

From what I can see in the current spec, NodeUnpublishVolume explicitly states that the operation must be idempotent. So I am confused where the assumption that the plug does reference counting would come from? The way I understand it, the CO will call NodePublishVolume once for each credential set with distinct mount points. The CO will call NodeUnpublishVolume once for each mount point on the back end. At no point does the plugin require a reference count, because NodeUnpublish just means unmount the specified mount point.

akutz added a commit to akutz/csi-spec that referenced this issue Oct 23, 2017
This patch handles issue container-storage-interface#119 by adding two new RPCs,
"NodePublishDevice" and "NodeUnpublishDevice". These RPCs MUST be called
by the CO if the Node Plugin advertises the "PUBLISH_UNPUBLISH_DEVICE"
capability. Plugins that advertise this capability SHOULD defer volume
reference counting to the CO.
akutz added a commit to akutz/csi-spec that referenced this issue Oct 24, 2017
This patch handles issue container-storage-interface#119 by adding two new RPCs,
"NodePublishDevice" and "NodeUnpublishDevice". These RPCs MUST be called
by the CO if the Node Plugin advertises the "PUBLISH_UNPUBLISH_DEVICE"
capability. Plugins that advertise this capability SHOULD defer volume
reference counting to the CO.
@jieyu jieyu removed this from the v0.1 milestone Nov 15, 2017
@clintkitson
Copy link

I believe this issue is related to #161.

MountDevice may be misleading as a method name. But clearly there is a need for the Node to clearly contain code that enables a storage platform which requires attaches to occur locally to host where storage is to be advertised.

@saad-ali saad-ali added this to the v0.2 milestone Jan 11, 2018
@davidz627
Copy link
Contributor

Seeing as we’ve crossed the 0.1 milestone I’d like to reopen discussion on this for 0.2 as well as summarize some of the thoughts of this thread.

What we all currently agree on is that adding a new call for this functionality would move reference counting of volumes from the plugin to the CO.

I propose we move reference counting from the plugin to the CO for 2 main reasons:

  1. Simplification of plugin development significantly, each plugin would not have to reimplement the reference counting logic.
  2. Alignment with existing design philosophy in CSI.

To elaborate on the second point. A core design decision in CSI is for the CO to handle complicated behaviors such as failure modes and recovery while the plugin is meant to to opaquely and idempotently react to commands from the CO. This allows much of the complexity to be centralized in a well maintained and shared part of the system. The plugins main job is to make volumes available for consumption by workloads, not to be a partial orchestration system on its own. By adding MountDevice we remove some of the burden of failure recovery from the plugin and push it onto the CO.

I would also like to propose naming the new call NodeStageVolume/NodeUnstageVolume. The naming is consistent with current calls as the “Node” part refers to the scope of the call and the “Volume” part follows the existing convention. “Stage” is the action word here that refers to the fact that this call is not finally publishing the volume for usage, but is performing an intermediary staging step to make it available for publishing. I am open for other suggestions on the naming.

I’ve looked at this #122 PR by @akutz and would be happy to pick up this work where he left off.

@cpuguy83
Copy link
Contributor

@davidz627 Are you suggesting we add both MountDevice/UnmountDevice and NodeStageVolume/NodeUnstageVolume?

btw, I would prefer naming such as "Prepare" and "Release" for this than "Stage" and "Unstage".

@davidz627
Copy link
Contributor

@cpuguy83 Sorry I was unclear. I meant just NodeStageVolume/NodeUnstageVolume. "MountDevice/UnmountDevice" are just the synonymous calls in k8s.

@julian-hj
Copy link
Contributor

Please note that CSI create/delete publish/unpublish calls are already idempotent in the current spec, and reference counting is already the responsibility of the CO.

Not saying we shouldn't do this, just that we ought to do it for the right reasons.

@akutz
Copy link
Contributor

akutz commented Jan 13, 2018

Hi @davidz627,

Well, you're free to pick things up, but I was expecting to continue the work once the release was tagged. I'd be more than happy to collaborate with you!

Also, as it is, the spec basically poo-poo'd the notion of referencing counting by the SPs. I too was initially under the impression that a volume should be able to be mounted multiple times on a single node host. However, as @codenrhoden and I learned after doing all the work to accomplish this task, the changes to the spec's documentation around NodePublishVolume means a volume can never be mounted to a single node host more than once if the volume is not MULTI_ capable. The SPs created by myself and Travis initially had the ability to mount the same volume to multiple target paths on a given host, and so we had to perform ref counting. However, since this is no longer permitted by the spec for non MULTI_ volumes, ref counting by SPs is far less of an issue.

@akutz
Copy link
Contributor

akutz commented Jan 13, 2018

Hi @jieyu,

Would you please link that super-long discussion where Travis and I realized that the docs had changed and there's no longer the ability to mount a single volume to multiple target paths on a single host unless the volume is MULTI_ capable? I think that discussion would be helpful to @davidz627, and you may be able to find that issue faster than me. Thanks!

@jieyu
Copy link
Member

jieyu commented Jan 13, 2018

Related to #150

@akutz
Copy link
Contributor

akutz commented Jan 13, 2018

Thanks @jieyu :)

@davidz627
Copy link
Contributor

@akutz, I would be happy to collaborate!

I believe adding this 'MountDevice' call will be important to shift reference counting logic for MULTI_ capable volumes. That was what I believed the original use case was anyway.

davidz627 pushed a commit to davidz627/spec that referenced this issue Jan 18, 2018
This patch handles issue container-storage-interface#119 by adding two new RPCs,
"NodePublishDevice" and "NodeUnpublishDevice". These RPCs MUST be called
by the CO if the Node Plugin advertises the "PUBLISH_UNPUBLISH_DEVICE"
capability. Plugins that advertise this capability SHOULD defer volume
reference counting to the CO.
davidz627 pushed a commit to davidz627/spec that referenced this issue Jan 18, 2018
This patch handles issue container-storage-interface#119 by adding two new RPCs,
"NodePublishDevice" and "NodeUnpublishDevice". These RPCs MUST be called
by the CO if the Node Plugin advertises the "PUBLISH_UNPUBLISH_DEVICE"
capability. Plugins that advertise this capability SHOULD defer volume
reference counting to the CO.
davidz627 pushed a commit to davidz627/spec that referenced this issue Feb 2, 2018
This patch handles issue container-storage-interface#119 by adding two new RPCs,
"NodePublishDevice" and "NodeUnpublishDevice". These RPCs MUST be called
by the CO if the Node Plugin advertises the "PUBLISH_UNPUBLISH_DEVICE"
capability. Plugins that advertise this capability SHOULD defer volume
reference counting to the CO.
davidz627 pushed a commit to davidz627/spec that referenced this issue Feb 8, 2018
This patch handles issue container-storage-interface#119 by adding two new RPCs,
"NodePublishDevice" and "NodeUnpublishDevice". These RPCs MUST be called
by the CO if the Node Plugin advertises the "PUBLISH_UNPUBLISH_DEVICE"
capability. Plugins that advertise this capability SHOULD defer volume
reference counting to the CO.
davidz627 pushed a commit to davidz627/spec that referenced this issue Feb 16, 2018
This patch handles issue container-storage-interface#119 by adding two new RPCs,
"NodePublishDevice" and "NodeUnpublishDevice". These RPCs MUST be called
by the CO if the Node Plugin advertises the "PUBLISH_UNPUBLISH_DEVICE"
capability. Plugins that advertise this capability SHOULD defer volume
reference counting to the CO.
davidz627 pushed a commit to davidz627/spec that referenced this issue Feb 20, 2018
This patch handles issue container-storage-interface#119 by adding two new RPCs,
"NodePublishDevice" and "NodeUnpublishDevice". These RPCs MUST be called
by the CO if the Node Plugin advertises the "PUBLISH_UNPUBLISH_DEVICE"
capability. Plugins that advertise this capability SHOULD defer volume
reference counting to the CO.
@saad-ali
Copy link
Member Author

Closed with #169

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants