-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
csi: volume-consuming tasks fail after client restart #7200
Comments
I've figure out the source this problem (but not yet a solution), and it's related to the delay in plugin fingerprinting discussed in #7296. Here's what the sequence of events looks like: Client starts back up and does the usual fingerprinting of CPU, memory, etc. Note that our dynamic plugins aren't in the mix here at all. And how could they be? We haven't restored our handles to the allocations that run them.
At this point fingerprinting is "complete", but we haven't fingerprinted any CSI plugins! So we start restoring allocs. In the log snippets we'll see below,
The example app starts its CSI hook, but we haven't fingerprinted the plugin yet so this isn't going to go well...
We continue merrily along restoring allocs and firing their prestart hooks. We reach the plugin supervisor prestart hook, which we've already done so we skip it. Is that right? At first I thought this might be the place to fix the problem but it'd be racy with other jobs on the same client.
Our example app's CSI hook has finally hit the server and... uh oh, we're trying to re-claim the volume but the server doesn't know that our plugin on the client is healthy yet. Should we be re-claiming it? That doesn't quite make sense either, especially given we don't have heartyeet / #2185 yet.
We go on to run the plugin's poststart hooks, including the poststart for the plugin supervisor.
Finally the example apps CSI hook gets a message back from server "volume max claim reached", and because the plugin hasn't been fingerprinted we're going to call this a failure and mark the example app alloc for termination.
Ok, well this example task failed, so we run its stop hooks...
The plugin supervisor loop probes the plugin, but until the dynamic plugin manager does the capabilities check it isn't considered healthy for use:
Meanwhile, server is cleaning up the old claim and rescheduling the example app:
It takes us a few tries because the plugin still hasn't been fingerprinted. During client-GC, the plugin is "not found" yet (also, what's up with this error message?):
Finally the plugin fingerprinter says "oh hey, there's a plugin on this client!"
And the scheduler finally catches up with reality on the client and sends it a placement for the alloc:
|
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
This is an early in-development bug for CSI that I want to make sure we track down before we go into testing.
When a Nomad client is restarted, tasks consuming a volume on that client will fail when Nomad tries to reattach to them, saying it can't find the plugin. When the scheduler re-schedules the alloc, everything works fine. (I would suspect a race condition here between when the plugin alloc is reattached and the consuming alloc is, but it looks like at least in this case the plugin alloc was reattached first.)
Plugin job
Example volume-consuming job
Repro, running Nomad current to
f-csi-volumes
(as of this writing) under systemd on Linux:Then restart Nomad with
sudo systemctl restart nomad
. The example job will fail:Trace log outputs from journalctl
The text was updated successfully, but these errors were encountered: