Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeStageVolume/NodeUnstageVolume #169

Merged

Conversation

davidz627
Copy link
Contributor

@davidz627 davidz627 commented Jan 18, 2018

Fixes Issue: #119

Adds two new RPCs, NodeStageVolume/NodeUnstageVolume. These RPCs MUST be called by the CO if the Node Plugin advertises the STAGE_UNSTAGE_VOLUME capability. Plugins that advertise this capability SHOULD defer volume reference counting to the CO.

This PR supersedes #122, @akutz gets credit for the majority of the code, I'm just picking up where he left off and pushing this through.

cc @saad-ali @jieyu @clintkitson @jdef

@davidz627 davidz627 force-pushed the feature/MountDevice branch from e5e86f7 to 2697a9c Compare January 18, 2018 02:28
@davidz627
Copy link
Contributor Author

Still taking suggestions on naming. I have nothing against NodePublishDevice/NodeUnpublishDevice, I just think it could be better.

I personally like: NodeStageVolume/NodeUnstageVolume. The naming is consistent with current calls as the “Node” part refers to the scope of the call and the “Volume” part follows the existing convention. “Stage” is the action word here that refers to the fact that this call is not finally publishing the volume for usage, but is performing an intermediary staging step to make it available for publishing. I am open for other suggestions on the naming.

Other suggestions include:
NodePrepareVolume/NodeReleaseVolume
[others?]

Copy link
Member

@jdef jdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completed first pass, left some comments/suggestions. wondering about a possible relationship between these changes and #166 (support for hypervisor runtimes).

spec.md Outdated
A Node Plugin MUST implement this RPC call if it has `PUBLISH_UNPUBLISH_DEVICE` node capability.
This RPC is called by the CO when a workload that wants to use the specified volume is placed (scheduled) on a node.
The Plugin SHALL assume that this RPC will be executed on the node where the volume will be used.
This RPC MUST be called by the CO a maximum of once per node, per volume.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest: s/MUST/SHOULD/ -- the RPCs are idempotent, so it's possible that this RPC would be executed more than once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

csi.proto Outdated
@@ -325,7 +331,8 @@ message ControllerPublishVolumeRequest {

message ControllerPublishVolumeResponse {
// The SP specific information that will be passed to the Plugin in
// the subsequent `NodePublishVolume` call for the given volume.
// the subsequent `NodePublishDevice` and `NodePublishVolume` calls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spacing

csi.proto Outdated
// request. The CO SHALL ensure uniqueness of global_target_path per
// volume.
// This is a REQUIRED field.
string global_target_path = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: if we renamed the RPCs to the "staging" variants (e.g. NodeStageVolume) then maybe we could call this the staging_target_path? "global" seems a little far-reaching

csi.proto Outdated
@@ -41,6 +41,12 @@ service Controller {
}

service Node {
rpc NodePublishDevice (NodePublishDeviceRequest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: I prefer NodeStageVolume and NodeUnstageVolume

// The path to which the volume will be published. It MUST be an
// absolute path in the root filesystem of the process serving this
// request. The CO SHALL ensure uniqueness of target_path per volume.
// The CO SHALL ensure that the path exists, and that the process
// serving the request has `read` and `write` permissions to the path.
// This is a REQUIRED field.
string target_path = 4;
string target_path = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we avoid renumbering proto fields now that we have a tagged/published version of the spec in the world?

Copy link
Contributor Author

@davidz627 davidz627 Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding that this should be an OK change, it will introduce incompatablity between different CSI spec versions but I think thats OK. Thats why we have a supported_versions field right?

The main issue with changing proto tags is when the protos are actually persisted somewhere and new code with different tags comes along and reads the old protos and gets confused. I don't believe that is an issue here.

@saad-ali may be able to provide more context and a more definitive answer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client/server mismatch of proto field number can cause issues. However, for v0.2, we should be ok making breaking changes, if needed.

spec.md Outdated
| Volume does not exist | 5 NOT_FOUND | Indicates that a volume corresponding to the specified `volume_id` does not exist. | Caller MUST verify that the `volume_id` is correct and that the volume is accessible and has not been deleted before retrying with exponential back off. |
| Volume published but is incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified `volume_id` has already been published at the specified `global_target_path` but is incompatible with the specified `volume_capability` flag. | Caller MUST fix the arguments before retying. |
| Operation pending for volume | 10 ABORTED | Indicates that there is a already an operation pending for the specified volume. In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than one call "in-flight" per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The Plugin, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls. | Caller SHOULD ensure that there are no other calls pending for the specified volume, and then retry with exponential back off. |
| Exceeds capabilities | 10 FAILED_PRECONDITION | Indicates that the CO has exceeded the volume's capabilities because the volume does not have MULTI_NODE capability. | Caller MAY choose to call `ValidateVolumeCapabilities` to validate the volume capabilities, or wait for the volume to be unpublished on the node. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the relationship between supporting these staging calls and SINGLE vs MULTI capability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these calls will only really be useful for MULTI capability plugins (ref counting moved to CO). There is no restriction on using them for SINGLE but the benefit they provide would just be organizational.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize I may have misunderstood your question. This call should be supported for both SINGLE and MULTI capability.

spec.md Outdated

The Plugin SHALL assume that this RPC will be executed on the node where the volume is being used.

This RPC is typically called by the CO when the workload using the volume is being moved to a different node, or all the workload using the volume on a node has finished.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/is typically/MAY be/
s/all the workload/all the workloads/
s/has finished/have finished/

csi.proto Outdated
// It MUST be set if the Node Plugin implements the
// `PUBLISH_UNPUBLISH_DEVICE` node capability.
// This is an OPTIONAL field.
string global_target_path = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the staging target path useful here - do we really need it for the use cases we have in mind?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the path that the volume is staged to will likely be a mountpoint, the unstage operation will need to be aware of the location of the mountpoint for this specific volume to "unmount" the staging point.

csi.proto Outdated
VolumeCapability volume_capability = 5;
}

message NodePublishDeviceResponse {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might there be a need for some opaque staging_volume_info that would be passed downstream to subsequent NodePublishVolume RPCs?

if we had this, would we need the (global/staging)_target_path as a first class field in NodePublishVolume?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I understand staging_target_path would be the only field inside of staging_volume_info. I may not be aware of other possible use cases and I wouldn't be against making this more open-ended. Do you know of any other fields in particular that would need to be added?

@@ -1021,6 +1056,104 @@ It is NOT REQUIRED for a controller plugin to implement the `LIST_VOLUMES` capab

### Node Service RPC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it be worth adding an "RPC Interations" subsection after all of the node service RPCs (like we did for the controller RPCs: https://github.com/container-storage-interface/spec/blob/master/spec.md#rpc-interactions) in order to concisely describe the relationship between some node RPCs?

I'm suggesting this because it would be a convenient place to compare/contrast the staging volume calls from the publish volume calls: staging is once per node/volume vs publish.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RPC Interactions

NodeStageVolume, NodeUnstageVolume, NodePublishVolume, NodeUnpublishVolume

The following IS REQUIRED if the plugin advertises the STAGE_UNSTAGE_VOLUME capability.
NodeStageVolume MUST be called and return success once per volume per node before any NodePublishVolume MAY be called for the volume.
All NodeUnpublishVolume MUST be called and return success for a volume before NodeUnstageVolume MAY be called for the volume.

Suggestions?

@akutz
Copy link
Contributor

akutz commented Jan 19, 2018

Hi @davidz627,

Regarding naming, I personally like NodePublishVolume. I know it's already in use, but I wonder if it doesn't make sense to keep the two existing RPCs, NodePublishVolume and NodeUnpublishVolume and simply add a field to the former that indicates the SP is responsible for reference counting?

@davidz627
Copy link
Contributor Author

@akutz

I believe the rationale for adding this as a separate call is so that the Storage Plugin will not have to reference count, this allows NodePublishVolume and NodeUnpublishVolume to have the same functionality each time they are called. Without the new NodeStageVolume/NodeUnstageVolume calls, many storage plugins would have to implement functionality that does an extra "staging mount" and "staging unmount" for the first NodePublishVolume and the last NodeUnpublishVolume called for each volume for each node.

Adding this new functionality will push the "decision" to make these staging/unstaging mounts to the CO so that it may call these functionalities in specific and separate RPC's. This frees up NodePublishVolume and NodeUnpublishVolume to only be responsible for the "last mile" of making the volume available to the consumer - it will no longer have to keep track of whether this is the first or fifth NodePublishVolume called for the volume.

@akutz
Copy link
Contributor

akutz commented Jan 19, 2018

Hi @davidz627,

You know what, I’ve been feeling unwell today and clearly I had a massive brain fart when I suggested that. It’s obviously anathema to the intended behavior :)

Mea culpa.

@akutz
Copy link
Contributor

akutz commented Jan 19, 2018

Regarding naming, I keep coming back to NodePublishDevice, but that ties it to device-backed storage, and honestly, it’s a volume-centric API operation anyway.

How about augmenting NodePublishVolume with a “source_path” field that, when specified, indicates that the operation should bind mount “source_path” to “target_path”. Otherwise the volume should be directly mounted to “target_path” per the norm. The NodeUnpublishVolume call would not have to change since it would be called once for each corresponding publication call.

@davidz627
Copy link
Contributor Author

Thanks for the suggestion Andrew, we've added in staging_target_path in the publish call for the bind mount indication. It is an optional field, so if unspecified, the volume will be directly mounted to target_path. I have also not added it to NodeUnpublishVolume because of the reasoning you mentioned.

@davidz627 davidz627 force-pushed the feature/MountDevice branch from d59d122 to aaceb16 Compare January 22, 2018 18:50
@jieyu
Copy link
Member

jieyu commented Jan 22, 2018

xref #166

@davidz627 davidz627 changed the title NodePublishDevice/NodeUnpublishDevice NodeStageVolume/NodeUnstageVolume Jan 22, 2018
@akutz
Copy link
Contributor

akutz commented Jan 22, 2018

Hi @davidz627,

I guess I'm not sure why we need to create a new RPC when the existing ones would work as-is, and the SP could make the decision based on whether or not the staging_target_path field is empty or not?

@davidz627
Copy link
Contributor Author

NodeStageVolume/NodeUnstageVolume are required to setup/tear-down the staging_target_path mount.

@akutz
Copy link
Contributor

akutz commented Jan 22, 2018

Hi @davidz627,

Yes, but the point I'm trying to make is that they do not have to be. If the staging path does not exist the first time a call is made to NodePublishVolume then that RPC could do all of the same work as NodeStageVolume.

@akutz
Copy link
Contributor

akutz commented Jan 22, 2018

Hi All,

FWIW, @davidz627 and I had a conversation offline where I came around to the PR's solution. I was approaching the PR from the point-of-view that an SP must support COs that may or may not implement ref counting. David pointed out that all COs must implement ref counting -- IFF an SP advertises the STAGE_UNSTAGE_VOLUME capability. I was arguing that if an SP must be able to do ref counting anyway if a CO doesn't then why not wrap it all in the existing NodePublishVolume. Once I realized it was up to the SP to determine if the CO handles ref counting (since all COs are required by the spec to do so if the SP asks them to), then I agreed that distinct RPCs make sense.

Sorry for the confusion. I've had a touch of dumbassitis for the last several years and have been slow on the pickup it seems. :)

@davidz627 davidz627 force-pushed the feature/MountDevice branch 5 times, most recently from 53fd3f3 to 44e20d0 Compare January 22, 2018 23:05
@jieyu
Copy link
Member

jieyu commented Jan 23, 2018

According to the current spec, only volumes with MULTI_NODE_XXX capabilities potentially need this extra RPC (because for SINGLE_NODE_XXX volumes, NodePublishVolume won't be called for different target_path, thus the logic can be in NodePublishVolume).

So for plugins that're capable of dealing with MULTI_NODE volumes, how many of them will benefit with this extra RPC? Can you provide some concrete examples? I imagine that most of the fuse based plugins (MULTI_NODE capable) won't benefit from this extra RPC.

I think my main worry about this change is that it introduces two ways for a plugin to implement the same functionality. A plugin can either choose to implement this new RPC, or do ref counting itself. I understand the benefit of this: allowing plugin to be stateless and no need to do ref counting. But isn't that hard? In most of the case I imagine it's just parsing the mount table.

One more note: this change is probably related to #166. It'll make supporting #166 even harder.

@davidz627
Copy link
Contributor Author

@jieyu I know of a few Storage Plugin types that will/may benefit:
AWS EBS, Azure DD, Fiber Channel, GCE PD, ISCSI, Photon, Rados Block Device, VSphere

Yes, the plugin will have a "choice" but if this type of functionality is required it makes sense for them to offload the work to the CO so that the SP can just respond "dumbly" to requests instead of doing its own ref counting and failure recovery.

@jieyu
Copy link
Member

jieyu commented Jan 24, 2018

@davidz627 As far as I know, AWS EBS should be categorized as SINGLE_NODE_WRITER because it cannot be published to multiple nodes. As a result, NodePublishVolume shouldn't be called multiple times according to the current spec. I expect many of the plugins that you mentioned above fall into the same category.

@davidz627
Copy link
Contributor Author

Many of the listed volumes support multi reader only.

@davidz627
Copy link
Contributor Author

@jieyu many of the volumes listed support MULTI_NODE_READER_ONLY. This PR will improve SP development for any volume that supports MULTI_NODE_READER_ONLY which is a commonly supported storage mode.

@jieyu
Copy link
Member

jieyu commented Jan 24, 2018

@davidz627 I am wondering for MULTI_NODE_READER_ONLY volumes, can the CO always call NodeStageVolume for each NodePublishVolume. In other words, can the logic in NodeStageVolume can be merged into NodePublishVolume? I'll be a bit surprised if a volume can be published at multiple nodes, but cannot perform the "NodeStage" step multiple times on a node.

@davidz627
Copy link
Contributor Author

davidz627 commented Jan 29, 2018

@jieyu I've done some more research into this.
I have confirmed the following drivers require a local attach step so NodeStageVolume/NodeUnstageVolume that must be done per-volume/per-node that cannot be folded into PublishVolume/UnpublishVolume (specifically the unpublish volume part would require ref counting):

FC, ISCSI, NVMEOE, Lustre, RBD. Potentially Ceph and Rook (@sbezverk may be able to confirm)

Most block devices have a non-trivial local attach step.

Additionally, Any SINGLE_NODE_ plugin will benefit from if we clarify access modes to allow multiple workloads on a single node access the volume. Currently, that access mode paradigm is the one adopted by Kubernetes and I think it is the correct way to go. Multiple workloads accessing a SINGLE_NODE_ volume on the same node can have several benefits (live backup, migration comes to mind) and not too many drawbacks.

@akutz
Copy link
Contributor

akutz commented Jan 29, 2018

Hi @davidz627,

Oh boy, now we're rehashing #119 and #150. Look, myself and @codenrhoden are both in favor of supporting non MULTI volumes as published to multiple target paths on a single host. But we've had that discussion. @jieyu and others said "no." Travis and I see no reason it cannot work. It works today on K8s. However, that was the decision that was made.

If we're reopening this discussion, I'll simply say that I again think the existence of a separate NodeStageVolume RPC is purely a design choice. The advertised capability doesn't require a separate RPC. I know I said I agreed with the separate RPCs, and I have no new issues with them, I'm just pointing out they're not required and only an authorial choice.

Today there are several CSI plug-ins that implement private mount directories and stage SINGLE volumes:

They work great! Don't need separate RPCs either. If we just added the capability to indicate to a CO that it needs to handle ref counting then the above SPs could simply stop handling it inside NodePublishVolume.

Anyway, here we go again :)

@davidz627
Copy link
Contributor Author

Right, lets ignore the SINGLE_NODE_ volumes for now. The new RPC will still benefit in design:
FC, ISCSI, NVMEOE, Lustre, RBD. Potentially Ceph and Rook.

The argument is that this new RPC is useful and doesn't just add an unnecessary RPCs. The main benefit (shifting ref-counting responsibility to the CO) is valid for all the above storage plugins. So my argument is that it is useful for a significant number of SPs to have this new RPC. It does not add any additional complexity to existing SP workflows if they choose not to implement this optional RPC.

@akutz
Copy link
Contributor

akutz commented Jan 29, 2018

Hi @davidz627,

To play devil's advocate:

The main benefit (shifting ref-counting responsibility to the CO) is valid for all the above storage plugins.

Technically the benefit is derived from the ability of the SP to advertise the capabilitySTAGE_UNSTAGE_VOLUME which indicates to COs they should handle reference counting. The RPC is simply the method by which they do that. The same method could be folded into NodePublishVolume.

@akutz
Copy link
Contributor

akutz commented Feb 9, 2018

From earlier:

Hmm. If the consensus in #171 is that an SP should be aware of how it is running, then I reiterate my objection to the need for separate stage RPCs. An SP should know it advertises the related capability and do the appropriate logic in the existing RPCs. This is inline with @jdef's remarks about keeping things tight.

I want to make sure this is seen since no one commented on it. The more I consider this, the more I am confident that RPCs separate from the existing NodePublishVolume and NodeUnpublishVolume is not the solution. Some reasons:

  • Multiple CO members have stated that an SP should be aware of how it is running/configured. Thus a CO knows if it is running with the STAGE_UNSTAGE_VOLUME capability and thus can handle the appropriate logic in the NodePublishVolume RPC.
  • The spec states that a CO will do its best effort to ensure that RPCs are called in a serial fashion, but that if a CO or SP process crashes then it is the responsibility of the SP to handle recovery behaviors. To that end, it's very possible that a CO sends a NodeStageVolume request to an SP that fails to complete the request and the CO also fails to receive the error or record the failed attempt. In this case the CO may send a NodePublishVolume request expecting the aforementioned NodeStageVolume was completed successfully. Based on previous conversations I expect that @saad-ali might say the SP should do its best effort to handle this behavior. I double-checked the PR's changes -- there is no error code defined for NodePublishVolume to return to a CO that indicates the request's staging_target_path field is set but it appears NodeStageVolume was not invoked. Without such an error, it would be easy to infer that the intent is NodePublishVolume should try to do both the logic for NodeStageVolume and NodePublishVolume. If NodePublishVolume has to handle both in edge cases anyway, why not just consolidate it?

I agree that the capability STAGE_UNSTAGE_VOLUME is a good idea. I just don't see the need for the additional RPCs. The spec should remain as opaque and simple as possible, and CO members seem to agree to that much of the time. This is one of those cases, an opportunity to either expand the spec or keep it the way it is with minimal change to achieve the desired functionality. I do not see a compelling reason why we are not electing to use the simple, straight-forward, and simple solution now and expanding it later and only if necessary.

@jdef
Copy link
Member

jdef commented Feb 9, 2018 via email

@akutz
Copy link
Contributor

akutz commented Feb 9, 2018

James,

You missed something in my remarks. I said as is with the new STAGE capability. The SP would still be relieved of ref counting, it would just handle the stage logic at the top of the existing RPCs.

@davidz627
Copy link
Contributor Author

I've reworded the RPC per @jdef and @cpuguy83's suggestions to be more clear as to what this RPC is doing.

I've also added a new error on NodePublishVolume specifically for when NodeStageVolume is not set as I believe this could use a specific call-out.

Copy link
Contributor

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of questions.

Thanks!

| Volume does not exist | 5 NOT_FOUND | Indicates that a volume corresponding to the specified `volume_id` does not exist. | Caller MUST verify that the `volume_id` is correct and that the volume is accessible and has not been deleted before retrying with exponential back off. |
| Volume published but is incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified `volume_id` has already been published at the specified `staging_target_path` but is incompatible with the specified `volume_capability` flag. | Caller MUST fix the arguments before retying. |
| Operation pending for volume | 10 ABORTED | Indicates that there is a already an operation pending for the specified volume. In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than one call "in-flight" per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The Plugin, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls. | Caller SHOULD ensure that there are no other calls pending for the specified volume, and then retry with exponential back off. |
| Exceeds capabilities | 10 FAILED_PRECONDITION | Indicates that the CO has exceeded the volume's capabilities because the volume does not have MULTI_NODE capability. | Caller MAY choose to call `ValidateVolumeCapabilities` to validate the volume capabilities, or wait for the volume to be unpublished on the node. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what cases would the node plugin be aware of this state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Node Plugin should always be aware of its "NodeServiceCapability". The SP is the one that sets the STAGE_UNSTAGE_VOLUME capability in the first place.

It will receive the request parameters in the request and should be able to check whether staging_target_path is empty/nil.

spec.md Outdated
// request. The CO SHALL ensure uniqueness of staging_target_path per
// volume.
// This is a REQUIRED field.
string staging_target_path = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that this falls in line with the publish RPC, but I'm wondering if this is something that really works here.
Since the CO would not consume this path, I'm not sure it's the CO's responsibility to know anything about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this on slack, the description of this is really the path to publish for the node, and NodePublish's target_path is really the path for the single workload.
This effectively treats the CO as a storage mechanism which I am ok with for this case.

Perhaps the staging_ should be omitted in the stage RPC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the staging_ makes it especially clear that this is the same path as the one that is used as staging_target_path in NodePublishVolume in later calls. Removing it would be naming the same variable differently in different contexts and may introduce some confusion.

The CO MUST guarantee that this RPC is called and returns a success before any `NodePublishVolume` is called for the given volume on the given node.

This operation MUST be idempotent.
If the volume corresponding to the `volume_id` is already staged to the `staging_target_path`, and is identical to the specified `volume_capability` the Plugin MUST reply `0 OK`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify what to do when volume ID exists with a different staging path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the NodeStageVolumeRequest description: The CO SHALL ensure uniqueness of staging_target_path per volume., so this is the CO's job to ensure. If the SP were to receive non-unique staging_target_path this would be undefined behavior, and I am ok leaving it as that.

Would be willing to define the behavior if you think that this is not enough for the spec and have a suggestion for what should happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, uniqueness of the path is not an issue, it's that the SP got a 2nd target path for for a given volume ID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've clarified the wording to The CO SHALL ensure that there is only one staging_target_path per volume. hope that helps

@saad-ali
Copy link
Member

/lgtm

Will let @cpuguy83 provide the second approval.

@akutz
Copy link
Contributor

akutz commented Feb 15, 2018

Hi All,

I'm just making one last note that these new RPCs don't make things any easier on SP developers. The STAGE_UNSTAGE_VOLUME capability might help, but even that could be up for debate.

The reason is that SP developers should still try to be as exhaustive as they are able when it comes to publication/unpublication on nodes. For example:

  • A NodeUnstageVolume request cannot just check to see if a directory exists or not. The request has to examine the mount table and verify that there isn't a trinary relationship in effect, perhaps even with a directory that doesn't exist but for which there still exists a bind mount entry in the mount table (an annoying, but possible scenario).
  • A NodePublishVolume request should attempt to create missing path components and scan the bind mount relationships to determine if the operation is idempotent.

Because developers will still need to implement exhaustive logic in both the Node Stage/Pub Unstage/Unpub RPCs, separating them into two sets of RPCs doesn't make anything easier. If anything it makes things more complicated for developers that have to duplicate checks across multiple RPCs.

But hey, I guess y'all have already decided that this is what's happening. I just don't understand the desire to add to the spec when opaqueness should be the guiding principal and the existing RPCs in addition to perhaps a new capability and the stage_target_path can already do the job.

cc @codenrhoden @clintkitson

@akutz
Copy link
Contributor

akutz commented Feb 15, 2018

One last thing, I know most of the project's voting members belong to COs. However, I'm a voice of someone working on both the CO side and developing SPs. I would ask that my arguments be given the consideration that dual-focus might justify. Having worked on more than a half dozen CSI SPs so far, I'm saying that this change doesn't make things easier because ultimately you don't want to fail fast -- your goal for an SP is for it to succeed, even when that means doing some exhaustive discovery logic. This change just means that logic is duplicated / chopped up into two locations, making things more complex on SP devs.

I hear @saad-ali arguing that this makes things easier on SP devs, but I'm an SP dev, and I'm saying it doesn't. The only members that seem to be arguing for this are CO members. That seems odd to me.

I'd just ask that y'all look at CSI-VFS's NodePublishVolume and NodeUnpublishVolume RPC implementations. Note that things don't fail fast -- we try to reconnect broken links as it were to avoid a system getting into a bad state.

@cpuguy83
Copy link
Contributor

@akutz I don't think the CO's are arguing for this at all. I know this was brought up by the Kube team, but only at the request of some SP's.... so maybe we need to re-sync on this topic.

I definitely agree that a good plugin would not rely simply on the CO to ensure these are correct. State will definitely get out of sync.

@akutz
Copy link
Contributor

akutz commented Feb 15, 2018

Hi Brian,

Yeah, this issue was filed by Saad. If SP devs raised it with him, I was unaware.

@akutz
Copy link
Contributor

akutz commented Feb 15, 2018

Hi Brian,

Anyway, my point, as you so succinctly restated it (thank you), is that SPs must, or rather should, be capable of detecting drift and reconciling it to sync state. These additional RPCs simply introduce new locations that must implement nearly identical reconciliation logic, and thus add little value for me, an SP dev.

Unless we also add a new RPC that allows a CO to validate its expected state against a Node or Controller, I prefer keeping he number of mutative RPCs to a bare minimum.

@davidz627
Copy link
Contributor Author

@akutz if NodeStageVolume does not suit the needs of your SP or it is not architected in a way that NodeStageVolume would be beneficial to it, then the SP does not need to implement it.

It is an optional call that has its use and make development of many SP's easier as outlined above.

If it is easier for your SP to implement everything in NodePublishVolume even if that includes reference counting, the SP is welcome to do that. It just should not advertise STAGE_UNSTAGE_VOLUME capability.

The arguments you made are all assuming that NodeStageVolume will be forced upon your SP's to implement even though it may not be the best way for your SP's to be architected. That is simply not the case.

@akutz
Copy link
Contributor

akutz commented Feb 15, 2018

I’m arguing I’ve implemented multiple types of SPs and that my advice that this doesn’t actually make things easier should be given consideration. I know from experience that there is a common enough pattern between SPs that the same reasons I outlined above apply to the majority of SPs and thus this change has minimal actual value and perhaps even the opposite.

@saad-ali
Copy link
Member

A NodeUnstageVolume request cannot just check to see if a directory exists or not. The request has to examine the mount table and verify that there isn't a trinary relationship in effect, perhaps even with a directory that doesn't exist but for which there still exists a bind mount entry in the mount table (an annoying, but possible scenario).

It doesn't have to. Per spec, all an SP has to do is make the specified device available at the specified path. Lots of volume plugins do just this in Kubernetes without extensive additional mount table verification (which is very expensive) and are deployed at scale in production workloads without issues.

A NodePublishVolume request should attempt to create missing path components and scan the bind mount relationships to determine if the operation is idempotent.

Again, this is not true. The CO SHALL ensure that the path exists. And a plugin can simply issue a bind mount. If the bind mount already exists it will simply succeed and be idempotent.

I'm saying that this change doesn't make things easier because ultimately you don't want to fail fast -- your goal for an SP is for it to succeed, even when that means doing some exhaustive discovery logic.

This addition does not (and is not intended) to simplify all plugins. It is an optional addition designed to make a certain class of plugins, those that have a local attach/detach step, easier to implement. It should go without saying, you should not implement it if it doesn't make sense for your plugin or complicates the logic (as it does in your case).

Some plugins may choose to implement more complicated logic, but if we can lower the bar for writing a class of plugins, that will be beneficial to the ecosystem as a whole.

I believe @davidz627 has diligently addressed the feedback so far. And this PR is close to done. @cpuguy83 do you have any other concrete concerns?

@davidz627 could you please rebase the PR.

@cpuguy83
Copy link
Contributor

@saad-ali I think the argument is that this makes the happy case simpler but does not help at all in the case that there are problems.
I know that SP's are requesting this to offload reference counting onto the CO, but this can't really do that... not without opening up a lot of room for errors.

Do we have some concrete examples of a SP that cannot work without this?

Copy link
Contributor

@akutz akutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I no longer really have any dog in this fight, and while I stand by what I've said, I agree with @davidz627 that this is optional and doesn't harm SPs that don't wish to implement it. I just worry that the bigger picture with regards to state won't be as apparent to new SP devs and these new RPCs could be tricky to implement correctly. Either way, I'm going to approve the PR and wish y'all the best of luck!

@saad-ali
Copy link
Member

@cpuguy83 the purpose of this call was always to lower the bar for writing a plugin, not to fix an blocking issue. Volume plugins can implement this logic in the existing NodePublishVolume call, but, for a class of plugins, doing so requires the plugin have more complicated logic.

I disagree that plugins that will use this will open themselves up for errors. If this is a concern, let's be more concrete about the error cases that would be more common. It is important to not that this is a model that many volume plugins already follow in Kubernetes without issue.

"A workload referencing a specified volume is scheduled to a node for the first time" and "the last workload referencing a specified volume is removed from a node" are important events and plugins should be able to hook in to they if they choose.

Copy link
Contributor

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Needs a rebase.

akutz and others added 2 commits February 20, 2018 12:07
This patch handles issue container-storage-interface#119 by adding two new RPCs,
"NodePublishDevice" and "NodeUnpublishDevice". These RPCs MUST be called
by the CO if the Node Plugin advertises the "PUBLISH_UNPUBLISH_DEVICE"
capability. Plugins that advertise this capability SHOULD defer volume
reference counting to the CO.
…deStageVolume/NodeUnstageVolume. Addressed jdef comments. Added RPC interactions and reference counting section

NodeStageVolume modifications: added publish_volume_info to response. Changed credential naming. Added credentials to unstage.
@davidz627
Copy link
Contributor Author

Rebased. Thanks everyone for the thorough reviews and questions!

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@saad-ali saad-ali merged commit 2571bd3 into container-storage-interface:master Feb 20, 2018
@jdef jdef mentioned this pull request Feb 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants