NodeStageVolume/NodeUnstageVolume #169

davidz627 · 2018-01-18T02:24:26Z

Fixes Issue: #119

Adds two new RPCs, NodeStageVolume/NodeUnstageVolume. These RPCs MUST be called by the CO if the Node Plugin advertises the STAGE_UNSTAGE_VOLUME capability. Plugins that advertise this capability SHOULD defer volume reference counting to the CO.

This PR supersedes #122, @akutz gets credit for the majority of the code, I'm just picking up where he left off and pushing this through.

cc @saad-ali @jieyu @clintkitson @jdef

davidz627 · 2018-01-18T02:31:19Z

Still taking suggestions on naming. I have nothing against NodePublishDevice/NodeUnpublishDevice, I just think it could be better.

I personally like: NodeStageVolume/NodeUnstageVolume. The naming is consistent with current calls as the “Node” part refers to the scope of the call and the “Volume” part follows the existing convention. “Stage” is the action word here that refers to the fact that this call is not finally publishing the volume for usage, but is performing an intermediary staging step to make it available for publishing. I am open for other suggestions on the naming.

Other suggestions include:
NodePrepareVolume/NodeReleaseVolume
[others?]

jdef

completed first pass, left some comments/suggestions. wondering about a possible relationship between these changes and #166 (support for hypervisor runtimes).

jdef · 2018-01-18T14:08:57Z

spec.md

+A Node Plugin MUST implement this RPC call if it has `PUBLISH_UNPUBLISH_DEVICE` node capability.
+This RPC is called by the CO when a workload that wants to use the specified volume is placed (scheduled) on a node.
+The Plugin SHALL assume that this RPC will be executed on the node where the volume will be used.
+This RPC MUST be called by the CO a maximum of once per node, per volume.


suggest: s/MUST/SHOULD/ -- the RPCs are idempotent, so it's possible that this RPC would be executed more than once.

jdef · 2018-01-18T14:09:13Z

csi.proto

@@ -325,7 +331,8 @@ message ControllerPublishVolumeRequest {

 message ControllerPublishVolumeResponse {
  // The SP specific information that will be passed to the Plugin in
-  // the subsequent `NodePublishVolume` call for the given volume.
+    // the subsequent `NodePublishDevice` and `NodePublishVolume` calls


nit: spacing

jdef · 2018-01-18T14:10:54Z

csi.proto

+  // request. The CO SHALL ensure uniqueness of global_target_path per
+  // volume.
+  // This is a REQUIRED field.
+  string global_target_path = 4;


naming: if we renamed the RPCs to the "staging" variants (e.g. NodeStageVolume) then maybe we could call this the staging_target_path? "global" seems a little far-reaching

jdef · 2018-01-18T14:12:33Z

csi.proto

@@ -41,6 +41,12 @@ service Controller {
 }

 service Node {
+  rpc NodePublishDevice (NodePublishDeviceRequest)


naming: I prefer NodeStageVolume and NodeUnstageVolume

jdef · 2018-01-18T14:13:57Z

csi.proto

  // The path to which the volume will be published. It MUST be an
  // absolute path in the root filesystem of the process serving this
  // request. The CO SHALL ensure uniqueness of target_path per volume.
  // The CO SHALL ensure that the path exists, and that the process
  // serving the request has `read` and `write` permissions to the path.
  // This is a REQUIRED field.
-  string target_path = 4;
+  string target_path = 5;


should we avoid renumbering proto fields now that we have a tagged/published version of the spec in the world?

My understanding that this should be an OK change, it will introduce incompatablity between different CSI spec versions but I think thats OK. Thats why we have a supported_versions field right?

The main issue with changing proto tags is when the protos are actually persisted somewhere and new code with different tags comes along and reads the old protos and gets confused. I don't believe that is an issue here.

@saad-ali may be able to provide more context and a more definitive answer

Client/server mismatch of proto field number can cause issues. However, for v0.2, we should be ok making breaking changes, if needed.

jdef · 2018-01-18T14:19:39Z

spec.md

+| Volume does not exist | 5 NOT_FOUND | Indicates that a volume corresponding to the specified `volume_id` does not exist. | Caller MUST verify that the `volume_id` is correct and that the volume is accessible and has not been deleted before retrying with exponential back off. |
+| Volume published but is incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified `volume_id` has already been published at the specified `global_target_path` but is incompatible with the specified `volume_capability` flag. | Caller MUST fix the arguments before retying. |
+| Operation pending for volume | 10 ABORTED | Indicates that there is a already an operation pending for the specified volume. In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than one call "in-flight" per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The Plugin, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls. | Caller SHOULD ensure that there are no other calls pending for the specified volume, and then retry with exponential back off. |
+| Exceeds capabilities | 10 FAILED_PRECONDITION | Indicates that the CO has exceeded the volume's capabilities because the volume does not have MULTI_NODE capability. | Caller MAY choose to call `ValidateVolumeCapabilities` to validate the volume capabilities, or wait for the volume to be unpublished on the node. |


what's the relationship between supporting these staging calls and SINGLE vs MULTI capability?

these calls will only really be useful for MULTI capability plugins (ref counting moved to CO). There is no restriction on using them for SINGLE but the benefit they provide would just be organizational.

I realize I may have misunderstood your question. This call should be supported for both SINGLE and MULTI capability.

jdef · 2018-01-18T14:20:34Z

spec.md

+
+The Plugin SHALL assume that this RPC will be executed on the node where the volume is being used.
+
+This RPC is typically called by the CO when the workload using the volume is being moved to a different node, or all the workload using the volume on a node has finished.


s/is typically/MAY be/
s/all the workload/all the workloads/
s/has finished/have finished/

jdef · 2018-01-18T14:26:25Z

csi.proto

+  // It MUST be set if the Node Plugin implements the 
+  // `PUBLISH_UNPUBLISH_DEVICE` node capability.
+  // This is an OPTIONAL field.
+  string global_target_path = 3;


why is the staging target path useful here - do we really need it for the use cases we have in mind?

the path that the volume is staged to will likely be a mountpoint, the unstage operation will need to be aware of the location of the mountpoint for this specific volume to "unmount" the staging point.

jdef · 2018-01-18T14:29:59Z

csi.proto

+  VolumeCapability volume_capability = 5;
+}
+
+message NodePublishDeviceResponse {}


might there be a need for some opaque staging_volume_info that would be passed downstream to subsequent NodePublishVolume RPCs?

if we had this, would we need the (global/staging)_target_path as a first class field in NodePublishVolume?

from what I understand staging_target_path would be the only field inside of staging_volume_info. I may not be aware of other possible use cases and I wouldn't be against making this more open-ended. Do you know of any other fields in particular that would need to be added?

jdef · 2018-01-18T14:39:23Z

spec.md

@@ -1021,6 +1056,104 @@ It is NOT REQUIRED for a controller plugin to implement the `LIST_VOLUMES` capab

 ### Node Service RPC


Might it be worth adding an "RPC Interations" subsection after all of the node service RPCs (like we did for the controller RPCs: https://github.com/container-storage-interface/spec/blob/master/spec.md#rpc-interactions) in order to concisely describe the relationship between some node RPCs?

I'm suggesting this because it would be a convenient place to compare/contrast the staging volume calls from the publish volume calls: staging is once per node/volume vs publish.

RPC Interactions

NodeStageVolume, NodeUnstageVolume, NodePublishVolume, NodeUnpublishVolume

The following IS REQUIRED if the plugin advertises the STAGE_UNSTAGE_VOLUME capability.
NodeStageVolume MUST be called and return success once per volume per node before any NodePublishVolume MAY be called for the volume.
All NodeUnpublishVolume MUST be called and return success for a volume before NodeUnstageVolume MAY be called for the volume.

Suggestions?

akutz · 2018-01-19T00:36:21Z

Hi @davidz627,

Regarding naming, I personally like NodePublishVolume. I know it's already in use, but I wonder if it doesn't make sense to keep the two existing RPCs, NodePublishVolume and NodeUnpublishVolume and simply add a field to the former that indicates the SP is responsible for reference counting?

davidz627 · 2018-01-19T00:49:20Z

@akutz

I believe the rationale for adding this as a separate call is so that the Storage Plugin will not have to reference count, this allows NodePublishVolume and NodeUnpublishVolume to have the same functionality each time they are called. Without the new NodeStageVolume/NodeUnstageVolume calls, many storage plugins would have to implement functionality that does an extra "staging mount" and "staging unmount" for the first NodePublishVolume and the last NodeUnpublishVolume called for each volume for each node.

Adding this new functionality will push the "decision" to make these staging/unstaging mounts to the CO so that it may call these functionalities in specific and separate RPC's. This frees up NodePublishVolume and NodeUnpublishVolume to only be responsible for the "last mile" of making the volume available to the consumer - it will no longer have to keep track of whether this is the first or fifth NodePublishVolume called for the volume.

akutz · 2018-01-19T01:04:33Z

Hi @davidz627,

You know what, I’ve been feeling unwell today and clearly I had a massive brain fart when I suggested that. It’s obviously anathema to the intended behavior :)

Mea culpa.

akutz · 2018-01-19T01:08:42Z

Regarding naming, I keep coming back to NodePublishDevice, but that ties it to device-backed storage, and honestly, it’s a volume-centric API operation anyway.

How about augmenting NodePublishVolume with a “source_path” field that, when specified, indicates that the operation should bind mount “source_path” to “target_path”. Otherwise the volume should be directly mounted to “target_path” per the norm. The NodeUnpublishVolume call would not have to change since it would be called once for each corresponding publication call.

davidz627 · 2018-01-22T18:49:57Z

Thanks for the suggestion Andrew, we've added in staging_target_path in the publish call for the bind mount indication. It is an optional field, so if unspecified, the volume will be directly mounted to target_path. I have also not added it to NodeUnpublishVolume because of the reasoning you mentioned.

jieyu · 2018-01-22T18:54:51Z

xref #166

akutz · 2018-01-22T19:34:52Z

Hi @davidz627,

I guess I'm not sure why we need to create a new RPC when the existing ones would work as-is, and the SP could make the decision based on whether or not the staging_target_path field is empty or not?

davidz627 · 2018-01-22T19:37:39Z

NodeStageVolume/NodeUnstageVolume are required to setup/tear-down the staging_target_path mount.

akutz · 2018-01-22T19:40:53Z

Hi @davidz627,

Yes, but the point I'm trying to make is that they do not have to be. If the staging path does not exist the first time a call is made to NodePublishVolume then that RPC could do all of the same work as NodeStageVolume.

akutz · 2018-01-22T22:03:25Z

Hi All,

FWIW, @davidz627 and I had a conversation offline where I came around to the PR's solution. I was approaching the PR from the point-of-view that an SP must support COs that may or may not implement ref counting. David pointed out that all COs must implement ref counting -- IFF an SP advertises the STAGE_UNSTAGE_VOLUME capability. I was arguing that if an SP must be able to do ref counting anyway if a CO doesn't then why not wrap it all in the existing NodePublishVolume. Once I realized it was up to the SP to determine if the CO handles ref counting (since all COs are required by the spec to do so if the SP asks them to), then I agreed that distinct RPCs make sense.

Sorry for the confusion. I've had a touch of dumbassitis for the last several years and have been slow on the pickup it seems. :)

jieyu · 2018-01-23T00:48:18Z

According to the current spec, only volumes with MULTI_NODE_XXX capabilities potentially need this extra RPC (because for SINGLE_NODE_XXX volumes, NodePublishVolume won't be called for different target_path, thus the logic can be in NodePublishVolume).

So for plugins that're capable of dealing with MULTI_NODE volumes, how many of them will benefit with this extra RPC? Can you provide some concrete examples? I imagine that most of the fuse based plugins (MULTI_NODE capable) won't benefit from this extra RPC.

I think my main worry about this change is that it introduces two ways for a plugin to implement the same functionality. A plugin can either choose to implement this new RPC, or do ref counting itself. I understand the benefit of this: allowing plugin to be stateless and no need to do ref counting. But isn't that hard? In most of the case I imagine it's just parsing the mount table.

One more note: this change is probably related to #166. It'll make supporting #166 even harder.

davidz627 · 2018-01-24T01:25:09Z

@jieyu I know of a few Storage Plugin types that will/may benefit:
AWS EBS, Azure DD, Fiber Channel, GCE PD, ISCSI, Photon, Rados Block Device, VSphere

Yes, the plugin will have a "choice" but if this type of functionality is required it makes sense for them to offload the work to the CO so that the SP can just respond "dumbly" to requests instead of doing its own ref counting and failure recovery.

jieyu · 2018-01-24T01:50:39Z

@davidz627 As far as I know, AWS EBS should be categorized as SINGLE_NODE_WRITER because it cannot be published to multiple nodes. As a result, NodePublishVolume shouldn't be called multiple times according to the current spec. I expect many of the plugins that you mentioned above fall into the same category.

davidz627 · 2018-01-24T02:19:42Z

Many of the listed volumes support multi reader only.

davidz627 · 2018-01-24T02:26:10Z

@jieyu many of the volumes listed support MULTI_NODE_READER_ONLY. This PR will improve SP development for any volume that supports MULTI_NODE_READER_ONLY which is a commonly supported storage mode.

jieyu · 2018-01-24T16:33:30Z

@davidz627 I am wondering for MULTI_NODE_READER_ONLY volumes, can the CO always call NodeStageVolume for each NodePublishVolume. In other words, can the logic in NodeStageVolume can be merged into NodePublishVolume? I'll be a bit surprised if a volume can be published at multiple nodes, but cannot perform the "NodeStage" step multiple times on a node.

davidz627 · 2018-01-29T18:51:52Z

@jieyu I've done some more research into this.
I have confirmed the following drivers require a local attach step so NodeStageVolume/NodeUnstageVolume that must be done per-volume/per-node that cannot be folded into PublishVolume/UnpublishVolume (specifically the unpublish volume part would require ref counting):

FC, ISCSI, NVMEOE, Lustre, RBD. Potentially Ceph and Rook (@sbezverk may be able to confirm)

Most block devices have a non-trivial local attach step.

Additionally, Any SINGLE_NODE_ plugin will benefit from if we clarify access modes to allow multiple workloads on a single node access the volume. Currently, that access mode paradigm is the one adopted by Kubernetes and I think it is the correct way to go. Multiple workloads accessing a SINGLE_NODE_ volume on the same node can have several benefits (live backup, migration comes to mind) and not too many drawbacks.

akutz · 2018-01-29T19:28:30Z

Hi @davidz627,

Oh boy, now we're rehashing #119 and #150. Look, myself and @codenrhoden are both in favor of supporting non MULTI volumes as published to multiple target paths on a single host. But we've had that discussion. @jieyu and others said "no." Travis and I see no reason it cannot work. It works today on K8s. However, that was the decision that was made.

If we're reopening this discussion, I'll simply say that I again think the existence of a separate NodeStageVolume RPC is purely a design choice. The advertised capability doesn't require a separate RPC. I know I said I agreed with the separate RPCs, and I have no new issues with them, I'm just pointing out they're not required and only an authorial choice.

Today there are several CSI plug-ins that implement private mount directories and stage SINGLE volumes:

They work great! Don't need separate RPCs either. If we just added the capability to indicate to a CO that it needs to handle ref counting then the above SPs could simply stop handling it inside NodePublishVolume.

Anyway, here we go again :)

davidz627 · 2018-01-29T19:37:07Z

Right, lets ignore the SINGLE_NODE_ volumes for now. The new RPC will still benefit in design:
FC, ISCSI, NVMEOE, Lustre, RBD. Potentially Ceph and Rook.

The argument is that this new RPC is useful and doesn't just add an unnecessary RPCs. The main benefit (shifting ref-counting responsibility to the CO) is valid for all the above storage plugins. So my argument is that it is useful for a significant number of SPs to have this new RPC. It does not add any additional complexity to existing SP workflows if they choose not to implement this optional RPC.

akutz · 2018-01-29T19:43:14Z

Hi @davidz627,

To play devil's advocate:

The main benefit (shifting ref-counting responsibility to the CO) is valid for all the above storage plugins.

Technically the benefit is derived from the ability of the SP to advertise the capabilitySTAGE_UNSTAGE_VOLUME which indicates to COs they should handle reference counting. The RPC is simply the method by which they do that. The same method could be folded into NodePublishVolume.

akutz · 2018-02-09T06:49:17Z

From earlier:

Hmm. If the consensus in #171 is that an SP should be aware of how it is running, then I reiterate my objection to the need for separate stage RPCs. An SP should know it advertises the related capability and do the appropriate logic in the existing RPCs. This is inline with @jdef's remarks about keeping things tight.

I want to make sure this is seen since no one commented on it. The more I consider this, the more I am confident that RPCs separate from the existing NodePublishVolume and NodeUnpublishVolume is not the solution. Some reasons:

Multiple CO members have stated that an SP should be aware of how it is running/configured. Thus a CO knows if it is running with the STAGE_UNSTAGE_VOLUME capability and thus can handle the appropriate logic in the NodePublishVolume RPC.
The spec states that a CO will do its best effort to ensure that RPCs are called in a serial fashion, but that if a CO or SP process crashes then it is the responsibility of the SP to handle recovery behaviors. To that end, it's very possible that a CO sends a NodeStageVolume request to an SP that fails to complete the request and the CO also fails to receive the error or record the failed attempt. In this case the CO may send a NodePublishVolume request expecting the aforementioned NodeStageVolume was completed successfully. Based on previous conversations I expect that @saad-ali might say the SP should do its best effort to handle this behavior. I double-checked the PR's changes -- there is no error code defined for NodePublishVolume to return to a CO that indicates the request's staging_target_path field is set but it appears NodeStageVolume was not invoked. Without such an error, it would be easy to infer that the intent is NodePublishVolume should try to do both the logic for NodeStageVolume and NodePublishVolume. If NodePublishVolume has to handle both in edge cases anyway, why not just consolidate it?

I agree that the capability STAGE_UNSTAGE_VOLUME is a good idea. I just don't see the need for the additional RPCs. The spec should remain as opaque and simple as possible, and CO members seem to agree to that much of the time. This is one of those cases, an opportunity to either expand the spec or keep it the way it is with minimal change to achieve the desired functionality. I do not see a compelling reason why we are not electing to use the simple, straight-forward, and simple solution now and expanding it later and only if necessary.

jdef · 2018-02-09T16:51:29Z

At one point I think we agreed that we **could** keep the spec as-is, w/o support for staging RPCs. Doing so pushes the reference counting burden to the plugin. One of the goals of CSI is to reduce the burden on plugin writers, if we can. This goal sometimes competes with the goal of maintaining a set of minimal RPCs. Somewhere in this discussion there's the mention of a specific use case, relating to an iSCSI plugin, that would benefit from this design. I'd like to address the scenario you describe w/ the CO failing to receive an error, or record an error, from NodeStageVolume: if the RPC errors out, then the CO could retry because the RPC is idempotent; if the CO fails to checkpoint the failed RPC attempt and fails over .. then that means it should have no memory of completing NodeStageVolume successfully and SHOULD subsequently re-attempt the RPC, right? If a plugin's NodePublishVolume RPC expects staging_target_path to be set, and it isn't, I'd expect the plugin to fail with, at least, an "invalid argument" response. Though perhaps we should tighten up the spec here and specifically name an error code (maybe a different one?) that clearly indicates that RPCs are probably being called out of order?

…

On Fri, Feb 9, 2018 at 1:49 AM, Schley Andrew Kutz ***@***.*** > wrote: From earlier: Hmm. If the consensus in #171 <#171> is that an SP should be aware of how it is running, then I reiterate my objection to the need for separate stage RPCs. An SP should know it advertises the related capability and do the appropriate logic in the existing RPCs. This is inline with @jdef <https://github.com/jdef>'s remarks about keeping things tight. I want to make sure this is seen since no one commented on it. The more I consider this, the more I am confident that RPCs separate from the existing NodePublishVolume and NodeUnpublishVolume is not the solution. Some reasons: - Multiple CO members have stated that an SP should be aware of how it is running/configured. Thus a CO knows if it is running with the STAGE_UNSTAGE_VOLUME capability and thus can handle the appropriate logic in the NodePublishVolume RPC. - The spec states that a CO will do its best effort to ensure that RPCs are called in a serial fashion, but that if a CO or SP process crashes then it is the responsibility of the SP to handle recovery behaviors. To that end, it's very possible that a CO sends a NodeStageVolume request to an SP that fails to complete the request and the CO also fails to receive the error or record the failed attempt. In this case the CO may send a NodePublishVolume request expecting the aforementioned NodeStageVolume was completed successfully. Based on previous conversations I expect that @saad-ali <https://github.com/saad-ali> might say the SP should do its best effort to handle this behavior. I double-checked the PR's changes -- there is no error code defined for NodePublishVolume to return to a CO that indicates the request's staging_target_path field is set but it appears NodeStageVolume was not invoked. Without such an error, it would be easy to infer that the intent is NodePublishVolume should try to do both the logic for NodeStageVolume *and* NodePublishVolume. If NodePublishVolume has to handle both in edge cases anyway, why not just consolidate it? I agree that the capability STAGE_UNSTAGE_VOLUME is a good idea. I just don't see the need for the additional RPCs. The spec should remain as opaque and simple as possible, and CO members seem to agree to that much of the time. This is one of those cases, an opportunity to either expand the spec or keep it the way it is with minimal change to achieve the desired functionality. I do not see a compelling reason why we are not electing to use the simple, straight-forward, and simple solution now and expanding it later and only if necessary. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#169 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACPVLPt5E0Vbky33Jahwya9mNXfIyEpqks5tS-pvgaJpZM4RiShS> .

akutz · 2018-02-09T16:59:53Z

James,

You missed something in my remarks. I said as is with the new STAGE capability. The SP would still be relieved of ref counting, it would just handle the stage logic at the top of the existing RPCs.

davidz627 · 2018-02-12T19:06:32Z

I've reworded the RPC per @jdef and @cpuguy83's suggestions to be more clear as to what this RPC is doing.

I've also added a new error on NodePublishVolume specifically for when NodeStageVolume is not set as I believe this could use a specific call-out.

cpuguy83

Left a couple of questions.

Thanks!

cpuguy83 · 2018-02-12T19:53:45Z

spec.md

+| Volume does not exist | 5 NOT_FOUND | Indicates that a volume corresponding to the specified `volume_id` does not exist. | Caller MUST verify that the `volume_id` is correct and that the volume is accessible and has not been deleted before retrying with exponential back off. |
+| Volume published but is incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified `volume_id` has already been published at the specified `staging_target_path` but is incompatible with the specified `volume_capability` flag. | Caller MUST fix the arguments before retying. |
+| Operation pending for volume | 10 ABORTED | Indicates that there is a already an operation pending for the specified volume. In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than one call "in-flight" per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The Plugin, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls. | Caller SHOULD ensure that there are no other calls pending for the specified volume, and then retry with exponential back off. |
+| Exceeds capabilities | 10 FAILED_PRECONDITION | Indicates that the CO has exceeded the volume's capabilities because the volume does not have MULTI_NODE capability. | Caller MAY choose to call `ValidateVolumeCapabilities` to validate the volume capabilities, or wait for the volume to be unpublished on the node. |


In what cases would the node plugin be aware of this state?

The Node Plugin should always be aware of its "NodeServiceCapability". The SP is the one that sets the STAGE_UNSTAGE_VOLUME capability in the first place.

It will receive the request parameters in the request and should be able to check whether staging_target_path is empty/nil.

cpuguy83 · 2018-02-12T20:03:47Z

spec.md

+  // request. The CO SHALL ensure uniqueness of staging_target_path per
+  // volume.
+  // This is a REQUIRED field.
+  string staging_target_path = 4;


I get that this falls in line with the publish RPC, but I'm wondering if this is something that really works here.
Since the CO would not consume this path, I'm not sure it's the CO's responsibility to know anything about it.

Discussed this on slack, the description of this is really the path to publish for the node, and NodePublish's target_path is really the path for the single workload.
This effectively treats the CO as a storage mechanism which I am ok with for this case.

Perhaps the staging_ should be omitted in the stage RPC?

I think having the staging_ makes it especially clear that this is the same path as the one that is used as staging_target_path in NodePublishVolume in later calls. Removing it would be naming the same variable differently in different contexts and may introduce some confusion.

cpuguy83 · 2018-02-14T16:58:05Z

spec.md

+The CO MUST guarantee that this RPC is called and returns a success before any `NodePublishVolume` is called for the given volume on the given node.
+
+This operation MUST be idempotent.
+If the volume corresponding to the `volume_id` is already staged to the `staging_target_path`, and is identical to the specified `volume_capability` the Plugin MUST reply `0 OK`.


Can we clarify what to do when volume ID exists with a different staging path?

As per the NodeStageVolumeRequest description: The CO SHALL ensure uniqueness of staging_target_path per volume., so this is the CO's job to ensure. If the SP were to receive non-unique staging_target_path this would be undefined behavior, and I am ok leaving it as that.

Would be willing to define the behavior if you think that this is not enough for the spec and have a suggestion for what should happen.

In this case, uniqueness of the path is not an issue, it's that the SP got a 2nd target path for for a given volume ID.

I've clarified the wording to The CO SHALL ensure that there is only one staging_target_path per volume. hope that helps

saad-ali · 2018-02-15T06:45:20Z

/lgtm

Will let @cpuguy83 provide the second approval.

akutz · 2018-02-15T06:52:11Z

Hi All,

I'm just making one last note that these new RPCs don't make things any easier on SP developers. The STAGE_UNSTAGE_VOLUME capability might help, but even that could be up for debate.

The reason is that SP developers should still try to be as exhaustive as they are able when it comes to publication/unpublication on nodes. For example:

A NodeUnstageVolume request cannot just check to see if a directory exists or not. The request has to examine the mount table and verify that there isn't a trinary relationship in effect, perhaps even with a directory that doesn't exist but for which there still exists a bind mount entry in the mount table (an annoying, but possible scenario).
A NodePublishVolume request should attempt to create missing path components and scan the bind mount relationships to determine if the operation is idempotent.

Because developers will still need to implement exhaustive logic in both the Node Stage/Pub Unstage/Unpub RPCs, separating them into two sets of RPCs doesn't make anything easier. If anything it makes things more complicated for developers that have to duplicate checks across multiple RPCs.

But hey, I guess y'all have already decided that this is what's happening. I just don't understand the desire to add to the spec when opaqueness should be the guiding principal and the existing RPCs in addition to perhaps a new capability and the stage_target_path can already do the job.

cc @codenrhoden @clintkitson

akutz · 2018-02-15T06:57:27Z

One last thing, I know most of the project's voting members belong to COs. However, I'm a voice of someone working on both the CO side and developing SPs. I would ask that my arguments be given the consideration that dual-focus might justify. Having worked on more than a half dozen CSI SPs so far, I'm saying that this change doesn't make things easier because ultimately you don't want to fail fast -- your goal for an SP is for it to succeed, even when that means doing some exhaustive discovery logic. This change just means that logic is duplicated / chopped up into two locations, making things more complex on SP devs.

I hear @saad-ali arguing that this makes things easier on SP devs, but I'm an SP dev, and I'm saying it doesn't. The only members that seem to be arguing for this are CO members. That seems odd to me.

I'd just ask that y'all look at CSI-VFS's NodePublishVolume and NodeUnpublishVolume RPC implementations. Note that things don't fail fast -- we try to reconnect broken links as it were to avoid a system getting into a bad state.

cpuguy83 · 2018-02-15T15:56:55Z

@akutz I don't think the CO's are arguing for this at all. I know this was brought up by the Kube team, but only at the request of some SP's.... so maybe we need to re-sync on this topic.

I definitely agree that a good plugin would not rely simply on the CO to ensure these are correct. State will definitely get out of sync.

akutz · 2018-02-15T17:28:36Z

Hi Brian,

Yeah, this issue was filed by Saad. If SP devs raised it with him, I was unaware.

akutz · 2018-02-15T17:46:41Z

Hi Brian,

Anyway, my point, as you so succinctly restated it (thank you), is that SPs must, or rather should, be capable of detecting drift and reconciling it to sync state. These additional RPCs simply introduce new locations that must implement nearly identical reconciliation logic, and thus add little value for me, an SP dev.

Unless we also add a new RPC that allows a CO to validate its expected state against a Node or Controller, I prefer keeping he number of mutative RPCs to a bare minimum.

davidz627 · 2018-02-15T23:09:32Z

@akutz if NodeStageVolume does not suit the needs of your SP or it is not architected in a way that NodeStageVolume would be beneficial to it, then the SP does not need to implement it.

It is an optional call that has its use and make development of many SP's easier as outlined above.

If it is easier for your SP to implement everything in NodePublishVolume even if that includes reference counting, the SP is welcome to do that. It just should not advertise STAGE_UNSTAGE_VOLUME capability.

The arguments you made are all assuming that NodeStageVolume will be forced upon your SP's to implement even though it may not be the best way for your SP's to be architected. That is simply not the case.

akutz · 2018-02-15T23:25:42Z

I’m arguing I’ve implemented multiple types of SPs and that my advice that this doesn’t actually make things easier should be given consideration. I know from experience that there is a common enough pattern between SPs that the same reasons I outlined above apply to the majority of SPs and thus this change has minimal actual value and perhaps even the opposite.

saad-ali · 2018-02-16T03:56:27Z

A NodeUnstageVolume request cannot just check to see if a directory exists or not. The request has to examine the mount table and verify that there isn't a trinary relationship in effect, perhaps even with a directory that doesn't exist but for which there still exists a bind mount entry in the mount table (an annoying, but possible scenario).

It doesn't have to. Per spec, all an SP has to do is make the specified device available at the specified path. Lots of volume plugins do just this in Kubernetes without extensive additional mount table verification (which is very expensive) and are deployed at scale in production workloads without issues.

A NodePublishVolume request should attempt to create missing path components and scan the bind mount relationships to determine if the operation is idempotent.

Again, this is not true. The CO SHALL ensure that the path exists. And a plugin can simply issue a bind mount. If the bind mount already exists it will simply succeed and be idempotent.

I'm saying that this change doesn't make things easier because ultimately you don't want to fail fast -- your goal for an SP is for it to succeed, even when that means doing some exhaustive discovery logic.

This addition does not (and is not intended) to simplify all plugins. It is an optional addition designed to make a certain class of plugins, those that have a local attach/detach step, easier to implement. It should go without saying, you should not implement it if it doesn't make sense for your plugin or complicates the logic (as it does in your case).

Some plugins may choose to implement more complicated logic, but if we can lower the bar for writing a class of plugins, that will be beneficial to the ecosystem as a whole.

I believe @davidz627 has diligently addressed the feedback so far. And this PR is close to done. @cpuguy83 do you have any other concrete concerns?

@davidz627 could you please rebase the PR.

cpuguy83 · 2018-02-16T15:53:11Z

@saad-ali I think the argument is that this makes the happy case simpler but does not help at all in the case that there are problems.
I know that SP's are requesting this to offload reference counting onto the CO, but this can't really do that... not without opening up a lot of room for errors.

Do we have some concrete examples of a SP that cannot work without this?

akutz

I no longer really have any dog in this fight, and while I stand by what I've said, I agree with @davidz627 that this is optional and doesn't harm SPs that don't wish to implement it. I just worry that the bigger picture with regards to state won't be as apparent to new SP devs and these new RPCs could be tricky to implement correctly. Either way, I'm going to approve the PR and wish y'all the best of luck!

saad-ali · 2018-02-17T01:23:39Z

@cpuguy83 the purpose of this call was always to lower the bar for writing a plugin, not to fix an blocking issue. Volume plugins can implement this logic in the existing NodePublishVolume call, but, for a class of plugins, doing so requires the plugin have more complicated logic.

I disagree that plugins that will use this will open themselves up for errors. If this is a concern, let's be more concrete about the error cases that would be more common. It is important to not that this is a model that many volume plugins already follow in Kubernetes without issue.

"A workload referencing a specified volume is scheduled to a node for the first time" and "the last workload referencing a specified volume is removed from a node" are important events and plugins should be able to hook in to they if they choose.

cpuguy83

LGTM

Needs a rebase.

This patch handles issue container-storage-interface#119 by adding two new RPCs, "NodePublishDevice" and "NodeUnpublishDevice". These RPCs MUST be called by the CO if the Node Plugin advertises the "PUBLISH_UNPUBLISH_DEVICE" capability. Plugins that advertise this capability SHOULD defer volume reference counting to the CO.

…deStageVolume/NodeUnstageVolume. Addressed jdef comments. Added RPC interactions and reference counting section NodeStageVolume modifications: added publish_volume_info to response. Changed credential naming. Added credentials to unstage.

davidz627 · 2018-02-20T20:09:29Z

Rebased. Thanks everyone for the thorough reviews and questions!

saad-ali

LGTM

davidz627 force-pushed the feature/MountDevice branch from e5e86f7 to 2697a9c Compare January 18, 2018 02:28

jdef reviewed Jan 18, 2018

View reviewed changes

davidz627 force-pushed the feature/MountDevice branch from d59d122 to aaceb16 Compare January 22, 2018 18:50

davidz627 changed the title ~~NodePublishDevice/NodeUnpublishDevice~~ NodeStageVolume/NodeUnstageVolume Jan 22, 2018

davidz627 force-pushed the feature/MountDevice branch 5 times, most recently from 53fd3f3 to 44e20d0 Compare January 22, 2018 23:05

davidz627 force-pushed the feature/MountDevice branch from 0dffa65 to a239782 Compare February 12, 2018 19:05

cpuguy83 suggested changes Feb 12, 2018

View reviewed changes

cpuguy83 reviewed Feb 14, 2018

View reviewed changes

davidz627 force-pushed the feature/MountDevice branch from a239782 to f5f3333 Compare February 14, 2018 20:12

davidz627 force-pushed the feature/MountDevice branch from f5f3333 to 58316d1 Compare February 16, 2018 18:02

akutz approved these changes Feb 16, 2018

View reviewed changes

cpuguy83 approved these changes Feb 20, 2018

View reviewed changes

akutz and others added 2 commits February 20, 2018 12:07

davidz627 force-pushed the feature/MountDevice branch from 58316d1 to 068ae27 Compare February 20, 2018 20:08

saad-ali approved these changes Feb 20, 2018

View reviewed changes

saad-ali merged commit 2571bd3 into container-storage-interface:master Feb 20, 2018

saad-ali mentioned this pull request Feb 20, 2018

Consider a "MountDevice" equivalent step #119

Closed

jdef mentioned this pull request Feb 22, 2018

NodePublishDevice #122

Closed


		The Plugin SHALL assume that this RPC will be executed on the node where the volume is being used.

		This RPC is typically called by the CO when the workload using the volume is being moved to a different node, or all the workload using the volume on a node has finished.

		@@ -1021,6 +1056,104 @@ It is NOT REQUIRED for a controller plugin to implement the `LIST_VOLUMES` capab

		### Node Service RPC

NodeStageVolume/NodeUnstageVolume #169

NodeStageVolume/NodeUnstageVolume #169

Conversation

davidz627 commented Jan 18, 2018 • edited Loading

davidz627 commented Jan 18, 2018

jdef left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidz627 Jan 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RPC Interactions

akutz commented Jan 19, 2018

davidz627 commented Jan 19, 2018

akutz commented Jan 19, 2018

akutz commented Jan 19, 2018

davidz627 commented Jan 22, 2018

jieyu commented Jan 22, 2018

akutz commented Jan 22, 2018

davidz627 commented Jan 22, 2018

akutz commented Jan 22, 2018

akutz commented Jan 22, 2018 • edited Loading

jieyu commented Jan 23, 2018 • edited Loading

davidz627 commented Jan 24, 2018

jieyu commented Jan 24, 2018

davidz627 commented Jan 24, 2018

davidz627 commented Jan 24, 2018

jieyu commented Jan 24, 2018

davidz627 commented Jan 29, 2018 • edited Loading

akutz commented Jan 29, 2018

davidz627 commented Jan 29, 2018

akutz commented Jan 29, 2018

akutz commented Feb 9, 2018

jdef commented Feb 9, 2018 via email

akutz commented Feb 9, 2018

davidz627 commented Feb 12, 2018

cpuguy83 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saad-ali commented Feb 15, 2018

akutz commented Feb 15, 2018 • edited Loading

akutz commented Feb 15, 2018 • edited Loading

cpuguy83 commented Feb 15, 2018

akutz commented Feb 15, 2018

akutz commented Feb 15, 2018

davidz627 commented Feb 15, 2018

akutz commented Feb 15, 2018

saad-ali commented Feb 16, 2018

cpuguy83 commented Feb 16, 2018

akutz left a comment • edited Loading

Choose a reason for hiding this comment

saad-ali commented Feb 17, 2018

cpuguy83 left a comment

Choose a reason for hiding this comment

davidz627 commented Feb 20, 2018

saad-ali left a comment

Choose a reason for hiding this comment

davidz627 commented Jan 18, 2018 •

edited

Loading

davidz627 Jan 18, 2018 •

edited

Loading

akutz commented Jan 22, 2018 •

edited

Loading

jieyu commented Jan 23, 2018 •

edited

Loading

davidz627 commented Jan 29, 2018 •

edited

Loading

akutz commented Feb 15, 2018 •

edited

Loading

akutz commented Feb 15, 2018 •

edited

Loading

akutz left a comment •

edited

Loading