CSI: persist previous mounts on client to restore during restart #17840

tgross · 2023-07-07T17:43:21Z

When claiming a CSI volume, we need to ensure the CSI node plugin is running before we send any CSI RPCs. This extends even to the controller publish RPC because it requires the storage provider's "external node ID" for the client. This primarily impacts client restarts because the client's initial fingerprint won't include restored plugins.

Unfortunately there's no mapping of volume to plugin ID available in the jobspec, so we don't have enough information to wait on plugins until we either get the volume from the server or retrieve the plugin ID from data we've persisted on the client.

If we always require getting the volume from the server before making the claim, a client restart for disconnected clients will cause all the allocations that need CSI volumes to fail. Even while connected, checking in with the server to verify the volume's plugin before trying to make a claim RPC is inherently racy, so we'll leave that case as-is and it will fail the claim if the node plugin needed to support a newly-placed allocation is flapping such that the node fingerprint is changing.

So instead of blocking for information from the server, this changeset persists a minimum subset of data about the volume and its plugin in the client state DB, and retrieves that data during the CSI hook's prerun to avoid re-claiming and remounting the volume unnecessarily. This fixes the client restart bug but also reduces Nomad RPC and CSI RPC traffic during client restarts.

This changeset also updates the RPC handler to use the external node ID from the claim whenever it is available.

Fixes: #13028

Note for reviewers: this looks like a bit of a monster of a PR because of all the plumbing required for adding a new client state DB object. The bulk of the non-plumbing changes are in client/allocrunner/csi_hook.go, which GitHub is helpfully folding. 😀 An unfortunate amount of the churn in that file is having to break up the claimVolume function a bit, but I think the top-level flow in Prerun is a lot more legible as a result. In addition to expanding the test coverage in the unit test, I've also end-to-end tested this with the AWS EBS plugin and restarting the client in various configurations. (Also note this will not do anything for #15415)

When claiming a CSI volume, we need to ensure the CSI node plugin is running before we send any CSI RPCs. This extends even to the controller publish RPC because it requires the storage provider's "external node ID" for the client. This primarily impacts client restarts but also is a problem if the node plugin exits (and fingerprints) while the allocation that needs a CSI volume claim is being placed. Unfortunately there's no mapping of volume to plugin ID available in the jobspec, so we don't have enough information to wait on plugins until we either get the volume from the server or retrieve the plugin ID from data we've persisted on the client. If we always require getting the volume from the server before making the claim, a client restart for disconnected clients will cause all the allocations that need CSI volumes to fail. Even while connected, checking in with the server to verify the volume's plugin before trying to make a claim RPC is inherently racy, so we'll leave that case as-is and it will fail the claim if the node plugin needed to support a newly-placed allocation is flapping such that the node fingerprint is changing. This changeset persists a minimum subset of data about the volume and its plugin in the client state DB, and retrieves that data during the CSI hook's prerun to avoid re-claiming and remounting the volume unnecessarily. This changeset also updates the RPC handler to use the external node ID from the claim whenever it is available. Fixes: #13028

jrasell

LGTM!

) When claiming a CSI volume, we need to ensure the CSI node plugin is running before we send any CSI RPCs. This extends even to the controller publish RPC because it requires the storage provider's "external node ID" for the client. This primarily impacts client restarts but also is a problem if the node plugin exits (and fingerprints) while the allocation that needs a CSI volume claim is being placed. Unfortunately there's no mapping of volume to plugin ID available in the jobspec, so we don't have enough information to wait on plugins until we either get the volume from the server or retrieve the plugin ID from data we've persisted on the client. If we always require getting the volume from the server before making the claim, a client restart for disconnected clients will cause all the allocations that need CSI volumes to fail. Even while connected, checking in with the server to verify the volume's plugin before trying to make a claim RPC is inherently racy, so we'll leave that case as-is and it will fail the claim if the node plugin needed to support a newly-placed allocation is flapping such that the node fingerprint is changing. This changeset persists a minimum subset of data about the volume and its plugin in the client state DB, and retrieves that data during the CSI hook's prerun to avoid re-claiming and remounting the volume unnecessarily. This changeset also updates the RPC handler to use the external node ID from the claim whenever it is available. Fixes: #13028

) (#17880) When claiming a CSI volume, we need to ensure the CSI node plugin is running before we send any CSI RPCs. This extends even to the controller publish RPC because it requires the storage provider's "external node ID" for the client. This primarily impacts client restarts but also is a problem if the node plugin exits (and fingerprints) while the allocation that needs a CSI volume claim is being placed. Unfortunately there's no mapping of volume to plugin ID available in the jobspec, so we don't have enough information to wait on plugins until we either get the volume from the server or retrieve the plugin ID from data we've persisted on the client. If we always require getting the volume from the server before making the claim, a client restart for disconnected clients will cause all the allocations that need CSI volumes to fail. Even while connected, checking in with the server to verify the volume's plugin before trying to make a claim RPC is inherently racy, so we'll leave that case as-is and it will fail the claim if the node plugin needed to support a newly-placed allocation is flapping such that the node fingerprint is changing. This changeset persists a minimum subset of data about the volume and its plugin in the client state DB, and retrieves that data during the CSI hook's prerun to avoid re-claiming and remounting the volume unnecessarily. This changeset also updates the RPC handler to use the external node ID from the claim whenever it is available. Fixes: #13028 Co-authored-by: Tim Gross <[email protected]>

) (#17879) When claiming a CSI volume, we need to ensure the CSI node plugin is running before we send any CSI RPCs. This extends even to the controller publish RPC because it requires the storage provider's "external node ID" for the client. This primarily impacts client restarts but also is a problem if the node plugin exits (and fingerprints) while the allocation that needs a CSI volume claim is being placed. Unfortunately there's no mapping of volume to plugin ID available in the jobspec, so we don't have enough information to wait on plugins until we either get the volume from the server or retrieve the plugin ID from data we've persisted on the client. If we always require getting the volume from the server before making the claim, a client restart for disconnected clients will cause all the allocations that need CSI volumes to fail. Even while connected, checking in with the server to verify the volume's plugin before trying to make a claim RPC is inherently racy, so we'll leave that case as-is and it will fail the claim if the node plugin needed to support a newly-placed allocation is flapping such that the node fingerprint is changing. This changeset persists a minimum subset of data about the volume and its plugin in the client state DB, and retrieves that data during the CSI hook's prerun to avoid re-claiming and remounting the volume unnecessarily. This changeset also updates the RPC handler to use the external node ID from the claim whenever it is available. Fixes: #13028 Co-authored-by: Tim Gross <[email protected]>

) (#17878) When claiming a CSI volume, we need to ensure the CSI node plugin is running before we send any CSI RPCs. This extends even to the controller publish RPC because it requires the storage provider's "external node ID" for the client. This primarily impacts client restarts but also is a problem if the node plugin exits (and fingerprints) while the allocation that needs a CSI volume claim is being placed. Unfortunately there's no mapping of volume to plugin ID available in the jobspec, so we don't have enough information to wait on plugins until we either get the volume from the server or retrieve the plugin ID from data we've persisted on the client. If we always require getting the volume from the server before making the claim, a client restart for disconnected clients will cause all the allocations that need CSI volumes to fail. Even while connected, checking in with the server to verify the volume's plugin before trying to make a claim RPC is inherently racy, so we'll leave that case as-is and it will fail the claim if the node plugin needed to support a newly-placed allocation is flapping such that the node fingerprint is changing. This changeset persists a minimum subset of data about the volume and its plugin in the client state DB, and retrieves that data during the CSI hook's prerun to avoid re-claiming and remounting the volume unnecessarily. This changeset also updates the RPC handler to use the external node ID from the claim whenever it is available. Fixes: #13028 Co-authored-by: Tim Gross <[email protected]>

vercel bot deployed to Preview – nomad-storybook-and-ui July 7, 2023 17:48 View deployment

tgross added theme/storage type/bug hcc/cst Admin - internal labels Jul 7, 2023

tgross added this to the 1.6.0 milestone Jul 7, 2023

tgross force-pushed the csi-agent-restart-13028 branch from 5c88926 to efa385a Compare July 7, 2023 17:59

vercel bot deployed to Preview – nomad-storybook-and-ui July 7, 2023 18:01 View deployment

tgross force-pushed the csi-agent-restart-13028 branch from efa385a to 89792f1 Compare July 7, 2023 18:11

vercel bot deployed to Preview – nomad-storybook-and-ui July 7, 2023 18:14 View deployment

tgross force-pushed the csi-agent-restart-13028 branch from 89792f1 to 7a4f0af Compare July 7, 2023 18:28

vercel bot deployed to Preview – nomad-storybook-and-ui July 7, 2023 18:31 View deployment

tgross marked this pull request as ready for review July 7, 2023 18:48

tgross requested review from jrasell and gulducat July 7, 2023 18:48

tgross mentioned this pull request Jul 7, 2023

Allocations with CSI volume fail during Nomad agent restart #13028

Closed

jrasell approved these changes Jul 10, 2023

View reviewed changes

tgross added backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line backport/1.3.x backport to 1.3.x release line labels Jul 10, 2023

tgross merged commit 0ba7d00 into main Jul 10, 2023

tgross deleted the csi-agent-restart-13028 branch July 10, 2023 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI: persist previous mounts on client to restore during restart #17840

CSI: persist previous mounts on client to restore during restart #17840

tgross commented Jul 7, 2023 •

edited

Loading

jrasell left a comment

CSI: persist previous mounts on client to restore during restart #17840

CSI: persist previous mounts on client to restore during restart #17840

Conversation

tgross commented Jul 7, 2023 • edited Loading

jrasell left a comment

Choose a reason for hiding this comment

tgross commented Jul 7, 2023 •

edited

Loading