-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI allocations may fail first placement after drain/reschedule #8609
Comments
While testing the patches for #10927 I've determined this isn't really specific to node drain with migrations, but can happen any time controller unpublish takes longer than the time it takes for an allocation to be rescheduled and placed on the new client node. The CSI specification has this to say about concurrency:
In Nomad's implementation of the unpublish workflow, we push unpublish RPCs from the client and then from the server once the volumewatcher notices a checkpoint. So we're violating the "should" aspects of the above which the plugins need to (and do!) tolerate. But this becomes a problem when the scheduler tries to replace an existing allocation if the controller unpublish RPCs have not yet completed. I reproduced this today using the jobspecjob "httpd" {
datacenters = ["dc1"]
group "web" {
count = 2
volume "csi_data" {
type = "csi"
read_only = false
source = "webdata"
access_mode = "single-node-writer"
attachment_mode = "file-system"
per_alloc = true
}
network {
mode = "bridge"
port "www" {
to = 8001
}
}
task "http" {
driver = "docker"
config {
image = "busybox:1"
command = "httpd"
args = ["-v", "-f", "-p", "8001", "-h", "/srv"]
ports = ["www"]
}
volume_mount {
volume = "csi_data"
destination = "/srv"
read_only = false
}
resources {
cpu = 128
memory = 128
}
}
}
} On a node drain or job update that forces a reschedule, the replacement alloc will have this event from the
In the GCP CSI controller plugin logs for that disk, we see the following sequence:
controller logs
Seeing this in more detail makes me realize that in a typical case we should be able to simply retry the |
I've been asked some questions out-of-band on this issue and want to clarify some details.
My thinking at one point was to add a new claim state for "requesting claim" that would trigger the volumewatcher, and then have the claim RPC poll for the update. That would serialize the operations but not necessarily in the correct order! Working around that problem was creating a bunch of complexity around maintaining the state across leader elections. K8s controller loops don't need to care about leader elections b/c they use external state (etcd), which simplifies this specific problem (mostly at the cost of performance). So the proposed fix I'm working on is in two parts:
|
Ok, looks like we actually do have this already, but it uses the same test as the feasibility checker which has to be somewhat less strict because it needs to account for removing the previous version of the job. So we need to split the behaviors into a schedule-time and a claim-time check. I've got a very rough first draft of this fix in |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
We have some existing issues around CSI and node drain which we should be closing out in the next few weeks, but this issue is for tracking a specific detail of the problem that we're not planning on fixing in the 0.12.x timeframe.
Migrating allocations during a node drain waits for the allocation to be placed on the new node before shutting down the drained allocation. Because the allocation on the old node is still claiming the volume (and in fact has it attached), the new allocation can't start and will always result in a failure for single-attachment volumes. But once the old alloc has stopped and the volume has been detached, the next attempt to place will succeed. Note that migrating an application with a specific single-attachment volume to a new node would always create an outage for the application no matter what we could come up with Nomad. So avoiding the error is a nice-to-have in that circumstance but it doesn't fundamentally change anything.
The error will look like this, with some details depending on the specific storage provider:
The only way to avoid this problem is to change the migration code so that allocations with CSI volumes get handled specially and get stopped before migrating. This gets complicated for the update stanza and rolling updates. It also conflicts with the
migrate
option for ephemeral disk, so there's some design tradeoffs to make here where an allocation wouldn't be able to have both CSI volumes and ephemeral disk that migrates.The text was updated successfully, but these errors were encountered: