CSI: skip node unpublish on GC'd or down nodes #13301

tgross · 2022-06-08T20:48:58Z

If the node has been GC'd or is down, we can't send it a node
unpublish. The CSI spec requires that we don't send the controller
unpublish before the node unpublish, but in the case where a node is
gone we can't know the final fate of the node unpublish step.

The csi_hook on the client will unpublish if the allocation has
stopped and if the host is terminated there's no mount for the volume
anyways. So we'll now assume that the node has unpublished at its
end. If it hasn't, any controller unpublish will potentially hang or
error and need to be retried.

(Note that while this behavior isn't ideal, it appears to match user
expectations and the behavior reported by k8s users.)

If the node has been GC'd or is down, we can't send it a node unpublish. The CSI spec requires that we don't send the controller unpublish before the node unpublish, but in the case where a node is gone we can't know the final fate of the node unpublish step. The `csi_hook` on the client will unpublish if the allocation has stopped and if the host is terminated there's no mount for the volume anyways. So we'll now assume that the node has unpublished at its end. If it hasn't, any controller unpublish will potentially hang or error and need to be retried.

tgross · 2022-06-08T20:54:44Z

.changelog/13301.txt

@@ -0,0 +1,3 @@
+```release-note:bug
+csi: Fixed a bug where volume claims on lost or garbage collected nodes could not be freed


Note for reviewers: I'm torn on whether to call this a bug or improvement but calling it a bug makes it something we can backport so I'm leaning that way.

lgfa29

I was thinking about other possible node status (like initializing or draining) but those would be fine since the node is still around to handle the request right?

tgross · 2022-06-09T15:33:13Z

I was thinking about other possible node status (like initializing or draining) but those would be fine since the node is still around to handle the request right?

Exactly! The disconnected state is a little weird as well, but in that case we're keeping the claims if the node can be marked disconnected, because the allocations will still be up and running, just temporarily unavailable.

github-actions · 2022-12-24T02:12:07Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added the theme/storage label Jun 8, 2022

tgross added this to the 1.3.2 milestone Jun 8, 2022

vercel bot deployed to Preview – nomad-storybook-and-ui June 8, 2022 20:51 View deployment

tgross force-pushed the csi-discard-claims-on-gcd-nodes branch from eec9b4b to e1d6b40 Compare June 8, 2022 20:52

tgross added the type/enhancement label Jun 8, 2022

tgross force-pushed the csi-discard-claims-on-gcd-nodes branch from e1d6b40 to 9de444e Compare June 8, 2022 20:54

tgross commented Jun 8, 2022

View reviewed changes

vercel bot deployed to Preview – nomad-storybook-and-ui June 8, 2022 21:00 View deployment

tgross marked this pull request as ready for review June 9, 2022 14:17

tgross requested review from lgfa29 and shoenig June 9, 2022 14:18

lgfa29 approved these changes Jun 9, 2022

View reviewed changes

tgross merged commit dd1bbbe into main Jun 9, 2022

tgross deleted the csi-discard-claims-on-gcd-nodes branch June 9, 2022 15:33

This was referenced Jun 9, 2022

remove unbackportable test #13311

Merged

remove unbackportable test #13312

Merged

This was referenced Jun 17, 2022

Allocs using CSI stuck in pending after terminating client node #13416

Closed

Fail to detach ceph csi volume from a down node and migrate to another #13450

Closed

tgross added the backport/1.3.x backport to 1.3.x release line label Aug 23, 2022

tgross modified the milestones: 1.3.2, 1.3.4 Aug 23, 2022

hc-github-team-nomad-core mentioned this pull request Aug 23, 2022

Backport of CSI: skip node unpublish on GC'd or down nodes into release/1.3.x #14240

Merged

This was referenced Sep 27, 2022

backport to 1.2.x: CSI: skip node unpublish on GC'd or down nodes #14720

Merged

backport to 1.1.x: CSI: skip node unpublish on GC'd or down nodes #14721

Merged

github-actions bot locked as resolved and limited conversation to collaborators Dec 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI: skip node unpublish on GC'd or down nodes #13301

CSI: skip node unpublish on GC'd or down nodes #13301

tgross commented Jun 8, 2022 •

edited

Loading

tgross Jun 8, 2022

lgfa29 left a comment

tgross commented Jun 9, 2022

github-actions bot commented Dec 24, 2022

		@@ -0,0 +1,3 @@
		```release-note:bug
		csi: Fixed a bug where volume claims on lost or garbage collected nodes could not be freed

CSI: skip node unpublish on GC'd or down nodes #13301

CSI: skip node unpublish on GC'd or down nodes #13301

Conversation

tgross commented Jun 8, 2022 • edited Loading

tgross Jun 8, 2022

Choose a reason for hiding this comment

lgfa29 left a comment

Choose a reason for hiding this comment

tgross commented Jun 9, 2022

github-actions bot commented Dec 24, 2022

tgross commented Jun 8, 2022 •

edited

Loading