-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dynamic host volumes: client state #24595
Conversation
client/client.go
Outdated
cfg.HostVolumePluginDir, | ||
cfg.AllocMountsDir) | ||
if err != nil { | ||
return nil, err // db TODO(1.10.0): don't fail the whole client if state restore fails? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sort of depends on the error. When we restore allocations it's not fatal to fail restoring a single allocation, because we assume that the allocation may have been destroyed while we were down (ex. the host rebooted), and there's a process to recover (check back in with the server, restart the alloc). But if we can't read the client state at all, then we're in a fatal scenario and want to stop the client from starting and making things worse / undebuggable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error handling nailed down in 8043871
} | ||
if err := hvm.stateMgr.PutDynamicHostVolume(volState); err != nil { | ||
hvm.log.Error("failed to save volume in state", "volume_id", req.ID, "error", err) | ||
// db TODO: bail or nah? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we create the volume on-disk but not in client state / server state, we'll end up with stray volumes we have no way of cleaning up. I'd try to implement a transaction-like semantic here -- if we can't fully complete the task, we should destroy the volume when we're done.
// db TODO(1.10.0): save the client state! | ||
if err := hvm.stateMgr.DeleteDynamicHostVolume(req.ID); err != nil { | ||
hvm.log.Error("failed to delete volume in state", "volume_id", req.ID, "error", err) | ||
// db TODO: bail or nah? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we fail to delete the client state but don't return an error, then we'll try to restore the volume when the client restarts, leaving a stray volume that shouldn't exist but also doesn't exist in the server and therefore can't be deleted. So we should probably return an error here so the user can retry.
client/client.go
Outdated
cfg.HostVolumePluginDir, | ||
cfg.AllocMountsDir) | ||
if err != nil { | ||
return nil, err // db TODO(1.10.0): don't fail the whole client if state restore fails? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is our usual behavior when state restore fails on the client? I think we shouldn't fail the client start completely, but would be good to stay consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see Tim already commented #24595 (comment)
don't mind me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up with this 8043871#diff-bd3a55a72186f59e2e63efb4951573b2f9e4a7cc98086e922b0859f8ccc1dd09R543-R547
let me know if you have other ideas!
bef9714
to
8c3d8fe
Compare
436fd23
to
6a0cdee
Compare
6a0cdee
to
e3350d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -165,7 +164,7 @@ func (p *HostVolumePluginExternal) Create(ctx context.Context, | |||
} | |||
|
|||
var pluginResp HostVolumePluginCreateResponse | |||
err = json.Unmarshal(stdout, &pluginResp) | |||
err = json.Unmarshal(stdout, &pluginResp) // db TODO(1.10.0): if this fails, then the volume may have been created, according to the plugin, but Nomad will not save it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tricky thing here is that this means the plugin isn't behaving according to the spec, which makes it tricky for us to say that it's ok. If it's not behaving according to the spec, maybe it isn't idempotent either so a retry won't work. Not great. But so long as we log it (somewhere upstream of here is fine) and return the error to the user, I think that's about as good as we can do here.
if _, err := plug.Create(ctx, vol.CreateReq); err != nil { | ||
// plugin execution errors are only logged | ||
hvm.log.Error("failed to restore", "plugin_id", vol.CreateReq.PluginID, "volume_id", vol.ID, "error", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something you'll want to look at when you get to volume fingerprinting is what we do about the fingerprint of a volume we can't restore.
store dynamic host volume creations in client state, so they can be "restored" on agent restart. restore works by repeating the same Create operation as initial creation, and expecting the plugin to be idempotent. this is (potentially) especially important after host restarts, which may have dropped mount points or such.
store dynamic host volume creations in client state, so they can be "restored" on agent restart. restore works by repeating the same Create operation as initial creation, and expecting the plugin to be idempotent. this is (potentially) especially important after host restarts, which may have dropped mount points or such.
store dynamic host volume creations in client state, so they can be "restored" on agent restart. restore works by repeating the same Create operation as initial creation, and expecting the plugin to be idempotent. this is (potentially) especially important after host restarts, which may have dropped mount points or such.
Store dynamic host volume creations in client state, so they can be "restored" on agent restart. Restore works by repeating the same Create operation as initial creation, and expecting the plugin to be idempotent.
This is (potentially) especially important after host restarts, which may have dropped mount points or such.
One particular thing I'm not sure of is when to error fatally if there are state operation errors... I don't have good intuition about the failure states there, but it does seem Bad for state to drift at all?
There's a decent chunk of code here, but a lot of it is just satisfying interfaces.