-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs using CSI volume do not recover from the client failure without human intervention #12118
Comments
Hi @PinkLolicorn! Thanks for the detailed report on this! As it turns out the initial failure on the client is a known issue #11477, which I've already got a fix in-flight for in #12113. That'll ship in the upcoming Nomad 1.3.0 (and get backported to 1.2.x and 1.1.x). It looks like on the server you're seeing #11758 which is going to be fixed in #12114. I'm less sure about the error:
Because that's bubbling up from the Hetzner plugin you're using. You might be able to recover from the looping volumewatcher errors by ensuring the plugin is running on the old node. |
@PinkLolicorn I'm going to close this as a duplicate because the underlying issue is known in another issue and staged to be fixed, but I'm happy to keep chatting about how to fix up the volumewatcher error if possible. |
Seems that a moment later the error was gone. I thought it might have been the VM that is turned off, but looks like action on two separate VMs failed with this one. My best guess is this might have been the action of detaching on the Hetzner side, not being cleared yet after the previous attempt.
Unfortunately this does not seem to work, these two nodes that were turned off have the task "running" and manually checking the containers there is nothing obvious wrong:
The cluster leader server is still logging the same errors:
The UI is still showing 2/4 controllers/nodes. This is a bit more confusing, 4 instances but 2 per client with the same ID? There is only a single CSI plugin container running on any of the clients.
And the job itself:
That's great! Any schedule for the 1.3.0 release and backports? |
Ok, the plugin allocation logs you have suggest that Nomad is sending the detach RPCs but they're failing. If you're not seeing the "server is locked" error anymore, are you seeing logs for the detach RPCs?
Your plugin is of type
I can't give dates for our releases, but I can say it's the currently active line of work across the whole team. In the meantime, feel free to check out the |
My previous bugged state was "wiped" due to other experiments on the cluster, but after some quick dive into the Hetzner's CSI plugin, I was able to figure the meaning of the status:
Bringing back part of the log, my knowledge about the architecture is limited, but I think we can see:
So, to my understanding, it just means that any action in the scope of the specific server that was not yer cleared internally can yield this. A bit weird but would make sense: at first client-1/2 targeted attach to client-2, so considering 1 & 2 temporarily locked seeing then two "server is locked" messages for detach actions targeting these makes sense. I wonder if this was not the case due to the better handling in the CSI plugin or more delay between attach and detach cycles, the cluster would recover correctly from the failure, or at least partially. That also makes me wonder why this is even possible, we have the answer |
The concurrency spec for CSI is a little "fuzzy" in that both the CO (Nomad) is supposed to avoid sending multiple in-flight requests for a given volume and the plugin is supposed to handle that case gracefully anyways. I suspect the authors of the plugin are relying on the k8s control-loop architecture to just go ahead and eventually re-sync, which we didn't have in place in Nomad (but the Nomad 1.3.0 patches get us to an equivalent behavior). I'd definitely be interested to pop in to any issue you open up on the driver repo, so feel free to ping me there! |
Just tried this, but I have trouble getting my CSI job to run on the dev version. Were there any changes to the environment variables recently? No matter if I set CSI_ENDPOINT or not, the plugin does not pick it up correctly. It worked just fine before with the same job file.
// CreateListener creates and binds the unix socket in location specified by the CSI_ENDPOINT environment variable.
func CreateListener() (net.Listener, error) {
endpoint := os.Getenv("CSI_ENDPOINT")
if endpoint == "" {
return nil, errors.New("you need to specify an endpoint via the CSI_ENDPOINT env var")
}
if !strings.HasPrefix(endpoint, "unix://") {
return nil, errors.New("endpoint must start with unix://")
}
endpoint = endpoint[7:] // strip unix://
if err := os.Remove(endpoint); err != nil && !os.IsNotExist(err) {
return nil, errors.New(fmt.Sprintf("failed to remove socket file at %s: %s", endpoint, err))
}
return net.Listen("unix", endpoint)
} job "system-csi-hetzner" {
region = "eu-central"
datacenters = ["fsn"]
namespace = "system"
type = "system"
group "node" {
task "plugin" {
driver = "docker"
config {
image = "hetznercloud/hcloud-csi-driver:1.6.0"
privileged = true
}
env {
CSI_ENDPOINT = "unix:///csi/csi.sock"
HCLOUD_TOKEN = "xyz"
ENABLE_METRICS = false
}
csi_plugin {
id = "csi.hetzner.cloud"
type = "monolith"
mount_dir = "/csi"
}
resources {
cpu = 50
memory = 16
memory_max = 256
}
}
}
} Also, is there any complete instruction on installing the required dependencies for the |
Yeah, Nomad is setting the
There's a contributing guide but mostly what you'll want to do is run |
I've opened #12257 with a patch for that, but I need to do some testing which I probably won't get to until tomorrow. |
|
Endpoint env var is fixed in #12257. I ended up going with |
Hey @PinkLolicorn did you managed to get it up & running using nomad 1.3.0-rc1? I'm also using hetzner, but today again I just noticed that few jobs went down. I looked into allocation and noticed error:
It seems tho as it might be Hetzner CSI related problem 🤔 |
Hey @rwojsznis, I actually tested this version yesterday with no success. However, I did not notice this exact error. Here is my setup: nomad run csi-hetzner.nomad job "system-csi-hetzner" {
region = "eu-central"
datacenters = ["fsn"]
type = "system"
group "node" {
task "plugin" {
driver = "docker"
config {
image = "hetznercloud/hcloud-csi-driver:1.6.0"
privileged = true
}
env {
CSI_ENDPOINT = "unix:///csi/csi.sock"
HCLOUD_TOKEN = "xyz"
ENABLE_METRICS = false
}
csi_plugin {
id = "csi.hetzner.cloud"
type = "monolith"
mount_dir = "/csi"
}
resources {
cpu = 50
memory = 16
memory_max = 256
}
}
}
} nomad volume register myvol.nomad
My volumes do mount, but I cannot get nomad to reschedule them in case where client goes out, and it is even worse than before, because manual detach from the hetzner UI does not help to recover. The only solution is to stop the job, deregister volume, register again, and then start. At least clients do not go into unhealthy state forever now. It basically repeats forever that max claims are reached and cannot unmount the volume:
Running
As stated before, even if volume is unmounted from hetzner UI, nomad does not recover from this state. I also spotted some funky looking logs from the hetzner-csi-plugin:
This looks confusing because:
@tgross nomad should move volume to the new client after the job has been rescheduled due to failure, right? 😆 |
I have pretty much the same same setup (volume & csi plugin) and pretty much the same issue - volume doesn't get rescheduled once client goes down (1.3.0-rc1); I even have a systemd service that tries to to a clean nomad shutdown (drain node) before reboot/shutdown but that seems to have no effect 🤔 Once I have a little bit more time will try to provide more feedback, but it's good to hear (?) that I'm not the only one with same problem still 😅 |
Hey folks! The old node must be live and have a running plugin on it in order for Nomad to cleanly unmount the volume. You should drain a node before shutting it down, so that the tasks are rescheduled and the plugins are left in place until after all the other tasks are rescheduled. If you don't, you need to do |
@tgross ok, but shouldn't nomad at least recover after the client is back online? This is not the case in my current setup. I noticed some errors like so, but I'm unsure of their exact context, I think however these were logged when the failed node was turned on again:
Although they seem to be not important, because they're probably from the time before CSI container started. I just feel like bringing the client back shouldn't prevent everything from going back to normal without doing deregister manually. |
It shouldn't so long as the plugin is alive and healthy. You can't crash-out the client without also making sure the node plugin is live and expect Nomad can clean this situation up; Nomad can't do anything with the volumes without a plugin in place. That's a constraint of the CSI specification. It doesn't look like the plugin is healthy given the log entry you're showing. I'm happy to revisit this issue but there's not a lot I can do with one liner log snippets. Can you provide specific steps to reproduce and the logs for client, server, and plugins? Allocation status for plugins would also help. |
@tgross in my current cluster state all clients are online and plugins are too, volume is unmounted in the Hetzner UI, but allocation still can't take place. I noticed that the volume is at 1 allocation, but none is shown, which would make sense considering the displayed error. Apparently I got it previously but did not expect it to be that important as it appears now (noted in previous message):
I think this state was caused by running Plugin Storage Job |
I tried this:
It appears to be stuck for now, with the other previously seen error.
Apparently this made it work? I think the difference is that the state is
After turning the client off (looks ok?): After turning the client on (marked as failed by the client?): After waiting some time, the job:
After running |
Hi @PinkLolicorn! Yeah I think the "lost" state is probably the problem here. An allocation gets marked as lost once the client has missed a heartbeat (unless When the client reconnects, we should see it marked as terminal, and that should allow for the "volume watcher" control loop to clean it up without having to wait for GC. I've got an open issue (in a private repo, I'll try to get it moved over) to have the volume watcher detect events from allocations and not just volumes, but that's going to be a bit of a lift to complete. In any case, I would expect that once the allocation moves from lost to terminal (failed or complete), we'd be able to free the claim. I'll see if I can reproduce that specific behavior in a test rig. As you noticed though, the GC job does clean things up once it executes. The window for that is controlled by the |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Operating system and Environment details
Ubuntu 20.04.3 LTS running on CPX11 VMs from Hetzner
Issue
Jobs using CSI volume do not recover from the client failure without human intervention and require volume to be manually detached from the client using provider's web UI. The client's storage controller/node health does not return to normal after the instance is back online. The process of moving the job to the other node works with draining (excluding system jobs) but not on the failure.
Reproduction steps
hetznercloud/hcloud-csi-driver:1.6.0
as system jobAdditional steps
*Failed allocations are gone after some time.
Expected Result
Actual Result
**After night these returned all back to running, but CSI controllers/nodes are still not considered healthy.
Job file (CSI plugin)
Job file (MySQL using CSI)
Job failure message
CSI plugin logs
Nomad Server logs
Nomad server logs (next day)
Seems like volumes watcher is still trying to detach the volume, even if the volume shows currently only one running allocation on the correct node. The log shows two errors and that would correspond to the two previously turned off clients.
Related
https://discuss.hashicorp.com/t/csi-plugin-health-after-reboot/24723 - similar issue with 2/3 healthy after reboot and "context deadline exceeded" messages (using kubernetes-csi/csi-driver-nfs/)
The text was updated successfully, but these errors were encountered: