Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamic host volumes: client state #24595

Merged
merged 7 commits into from
Dec 3, 2024
Merged

Conversation

gulducat
Copy link
Member

@gulducat gulducat commented Dec 2, 2024

Store dynamic host volume creations in client state, so they can be "restored" on agent restart. Restore works by repeating the same Create operation as initial creation, and expecting the plugin to be idempotent.

This is (potentially) especially important after host restarts, which may have dropped mount points or such.

One particular thing I'm not sure of is when to error fatally if there are state operation errors... I don't have good intuition about the failure states there, but it does seem Bad for state to drift at all?

There's a decent chunk of code here, but a lot of it is just satisfying interfaces.

client/client.go Outdated
cfg.HostVolumePluginDir,
cfg.AllocMountsDir)
if err != nil {
return nil, err // db TODO(1.10.0): don't fail the whole client if state restore fails?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sort of depends on the error. When we restore allocations it's not fatal to fail restoring a single allocation, because we assume that the allocation may have been destroyed while we were down (ex. the host rebooted), and there's a process to recover (check back in with the server, restart the alloc). But if we can't read the client state at all, then we're in a fatal scenario and want to stop the client from starting and making things worse / undebuggable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error handling nailed down in 8043871

client/hostvolumemanager/host_volumes.go Outdated Show resolved Hide resolved
client/hostvolumemanager/host_volumes.go Outdated Show resolved Hide resolved
client/hostvolumemanager/host_volumes.go Outdated Show resolved Hide resolved
}
if err := hvm.stateMgr.PutDynamicHostVolume(volState); err != nil {
hvm.log.Error("failed to save volume in state", "volume_id", req.ID, "error", err)
// db TODO: bail or nah?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we create the volume on-disk but not in client state / server state, we'll end up with stray volumes we have no way of cleaning up. I'd try to implement a transaction-like semantic here -- if we can't fully complete the task, we should destroy the volume when we're done.

// db TODO(1.10.0): save the client state!
if err := hvm.stateMgr.DeleteDynamicHostVolume(req.ID); err != nil {
hvm.log.Error("failed to delete volume in state", "volume_id", req.ID, "error", err)
// db TODO: bail or nah?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we fail to delete the client state but don't return an error, then we'll try to restore the volume when the client restarts, leaving a stray volume that shouldn't exist but also doesn't exist in the server and therefore can't be deleted. So we should probably return an error here so the user can retry.

client/state/db_bolt.go Outdated Show resolved Hide resolved
client/hostvolumemanager/host_volumes_test.go Outdated Show resolved Hide resolved
client/client.go Outdated
cfg.HostVolumePluginDir,
cfg.AllocMountsDir)
if err != nil {
return nil, err // db TODO(1.10.0): don't fail the whole client if state restore fails?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is our usual behavior when state restore fails on the client? I think we shouldn't fail the client start completely, but would be good to stay consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see Tim already commented #24595 (comment)
don't mind me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up with this 8043871#diff-bd3a55a72186f59e2e63efb4951573b2f9e4a7cc98086e922b0859f8ccc1dd09R543-R547

let me know if you have other ideas!

@tgross tgross force-pushed the dynamic-host-volumes branch from bef9714 to 8c3d8fe Compare December 3, 2024 19:11
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -165,7 +164,7 @@ func (p *HostVolumePluginExternal) Create(ctx context.Context,
}

var pluginResp HostVolumePluginCreateResponse
err = json.Unmarshal(stdout, &pluginResp)
err = json.Unmarshal(stdout, &pluginResp) // db TODO(1.10.0): if this fails, then the volume may have been created, according to the plugin, but Nomad will not save it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tricky thing here is that this means the plugin isn't behaving according to the spec, which makes it tricky for us to say that it's ok. If it's not behaving according to the spec, maybe it isn't idempotent either so a retry won't work. Not great. But so long as we log it (somewhere upstream of here is fine) and return the error to the user, I think that's about as good as we can do here.

Comment on lines +81 to +84
if _, err := plug.Create(ctx, vol.CreateReq); err != nil {
// plugin execution errors are only logged
hvm.log.Error("failed to restore", "plugin_id", vol.CreateReq.PluginID, "volume_id", vol.ID, "error", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something you'll want to look at when you get to volume fingerprinting is what we do about the fingerprint of a volume we can't restore.

@gulducat gulducat merged commit 70bacbd into dynamic-host-volumes Dec 3, 2024
17 checks passed
@gulducat gulducat deleted the dhv-client-state branch December 3, 2024 21:47
tgross pushed a commit that referenced this pull request Dec 9, 2024
store dynamic host volume creations in client state,
so they can be "restored" on agent restart. restore works
by repeating the same Create operation as initial creation,
and expecting the plugin to be idempotent.

this is (potentially) especially important after host restarts,
which may have dropped mount points or such.
tgross pushed a commit that referenced this pull request Dec 13, 2024
store dynamic host volume creations in client state,
so they can be "restored" on agent restart. restore works
by repeating the same Create operation as initial creation,
and expecting the plugin to be idempotent.

this is (potentially) especially important after host restarts,
which may have dropped mount points or such.
tgross pushed a commit that referenced this pull request Dec 19, 2024
store dynamic host volume creations in client state,
so they can be "restored" on agent restart. restore works
by repeating the same Create operation as initial creation,
and expecting the plugin to be idempotent.

this is (potentially) especially important after host restarts,
which may have dropped mount points or such.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants