dynamic host volumes: client state #24595

gulducat · 2024-12-02T22:18:06Z

Store dynamic host volume creations in client state, so they can be "restored" on agent restart. Restore works by repeating the same Create operation as initial creation, and expecting the plugin to be idempotent.

This is (potentially) especially important after host restarts, which may have dropped mount points or such.

One particular thing I'm not sure of is when to error fatally if there are state operation errors... I don't have good intuition about the failure states there, but it does seem Bad for state to drift at all?

There's a decent chunk of code here, but a lot of it is just satisfying interfaces.

tgross · 2024-12-03T14:21:27Z

client/client.go

 		cfg.HostVolumePluginDir,
 		cfg.AllocMountsDir)
+	if err != nil {
+		return nil, err // db TODO(1.10.0): don't fail the whole client if state restore fails?


It sort of depends on the error. When we restore allocations it's not fatal to fail restoring a single allocation, because we assume that the allocation may have been destroyed while we were down (ex. the host rebooted), and there's a process to recover (check back in with the server, restart the alloc). But if we can't read the client state at all, then we're in a fatal scenario and want to stop the client from starting and making things worse / undebuggable.

error handling nailed down in 8043871

client/hostvolumemanager/host_volumes.go

tgross · 2024-12-03T14:33:35Z

client/hostvolumemanager/host_volumes.go

+	}
+	if err := hvm.stateMgr.PutDynamicHostVolume(volState); err != nil {
+		hvm.log.Error("failed to save volume in state", "volume_id", req.ID, "error", err)
+		// db TODO: bail or nah?


If we create the volume on-disk but not in client state / server state, we'll end up with stray volumes we have no way of cleaning up. I'd try to implement a transaction-like semantic here -- if we can't fully complete the task, we should destroy the volume when we're done.

tgross · 2024-12-03T14:37:41Z

client/hostvolumemanager/host_volumes.go

-	// db TODO(1.10.0): save the client state!
+	if err := hvm.stateMgr.DeleteDynamicHostVolume(req.ID); err != nil {
+		hvm.log.Error("failed to delete volume in state", "volume_id", req.ID, "error", err)
+		// db TODO: bail or nah?


If we fail to delete the client state but don't return an error, then we'll try to restore the volume when the client restarts, leaving a stray volume that shouldn't exist but also doesn't exist in the server and therefore can't be deleted. So we should probably return an error here so the user can retry.

client/state/db_bolt.go

client/hostvolumemanager/host_volumes_test.go

pkazmierczak · 2024-12-03T08:00:28Z

client/client.go

 		cfg.HostVolumePluginDir,
 		cfg.AllocMountsDir)
+	if err != nil {
+		return nil, err // db TODO(1.10.0): don't fail the whole client if state restore fails?


What is our usual behavior when state restore fails on the client? I think we shouldn't fail the client start completely, but would be good to stay consistent.

ah I see Tim already commented #24595 (comment)
don't mind me

I ended up with this 8043871#diff-bd3a55a72186f59e2e63efb4951573b2f9e4a7cc98086e922b0859f8ccc1dd09R543-R547

let me know if you have other ideas!

tgross

LGTM!

tgross · 2024-12-03T21:15:41Z

client/hostvolumemanager/host_volume_plugin.go

@@ -165,7 +164,7 @@ func (p *HostVolumePluginExternal) Create(ctx context.Context,
 	}

 	var pluginResp HostVolumePluginCreateResponse
-	err = json.Unmarshal(stdout, &pluginResp)
+	err = json.Unmarshal(stdout, &pluginResp) // db TODO(1.10.0): if this fails, then the volume may have been created, according to the plugin, but Nomad will not save it


The tricky thing here is that this means the plugin isn't behaving according to the spec, which makes it tricky for us to say that it's ok. If it's not behaving according to the spec, maybe it isn't idempotent either so a retry won't work. Not great. But so long as we log it (somewhere upstream of here is fine) and return the error to the user, I think that's about as good as we can do here.

tgross · 2024-12-03T21:17:10Z

client/hostvolumemanager/host_volumes.go

+			if _, err := plug.Create(ctx, vol.CreateReq); err != nil {
+				// plugin execution errors are only logged
+				hvm.log.Error("failed to restore", "plugin_id", vol.CreateReq.PluginID, "volume_id", vol.ID, "error", err)
+			}


Something you'll want to look at when you get to volume fingerprinting is what we do about the fingerprint of a volume we can't restore.

store dynamic host volume creations in client state, so they can be "restored" on agent restart. restore works by repeating the same Create operation as initial creation, and expecting the plugin to be idempotent. this is (potentially) especially important after host restarts, which may have dropped mount points or such.

gulducat added type/enhancement theme/storage labels Dec 2, 2024

gulducat added this to the 1.10.0 milestone Dec 2, 2024

gulducat requested review from pkazmierczak and tgross December 2, 2024 22:18

gulducat requested review from a team as code owners December 2, 2024 22:18

tgross reviewed Dec 3, 2024

View reviewed changes

pkazmierczak reviewed Dec 3, 2024

View reviewed changes

tgross force-pushed the dynamic-host-volumes branch from bef9714 to 8c3d8fe Compare December 3, 2024 19:11

gulducat added 5 commits December 3, 2024 14:21

dynamic host volumes: client state

1ebc51a

squashme: drop Context

40d545d

test restore behavior

6fe23ee

extend timeout for restores

2f8231e

handle state/restore errors

8043871

gulducat force-pushed the dhv-client-state branch from 436fd23 to 6a0cdee Compare December 3, 2024 19:22

vercel bot deployed to Preview – nomad-ui December 3, 2024 19:24 View deployment

gulducat added 2 commits December 3, 2024 14:27

move HostVolumeState to client/structs

9c6ead7

more precise state bucket name

e3350d2

gulducat force-pushed the dhv-client-state branch from 6a0cdee to e3350d2 Compare December 3, 2024 19:28

vercel bot deployed to Preview – nomad-ui December 3, 2024 19:30 View deployment

tgross approved these changes Dec 3, 2024

View reviewed changes

gulducat merged commit 70bacbd into dynamic-host-volumes Dec 3, 2024
17 checks passed

gulducat deleted the dhv-client-state branch December 3, 2024 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dynamic host volumes: client state #24595

dynamic host volumes: client state #24595

gulducat commented Dec 2, 2024

tgross Dec 3, 2024

gulducat Dec 3, 2024

tgross Dec 3, 2024

tgross Dec 3, 2024

pkazmierczak Dec 3, 2024

pkazmierczak Dec 3, 2024

gulducat Dec 3, 2024

tgross left a comment

tgross Dec 3, 2024

tgross Dec 3, 2024

dynamic host volumes: client state #24595

dynamic host volumes: client state #24595

Conversation

gulducat commented Dec 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment