CSI allocations may fail first placement after drain/reschedule #8609

tgross · 2020-08-07T20:26:44Z

We have some existing issues around CSI and node drain which we should be closing out in the next few weeks, but this issue is for tracking a specific detail of the problem that we're not planning on fixing in the 0.12.x timeframe.

Migrating allocations during a node drain waits for the allocation to be placed on the new node before shutting down the drained allocation. Because the allocation on the old node is still claiming the volume (and in fact has it attached), the new allocation can't start and will always result in a failure for single-attachment volumes. But once the old alloc has stopped and the volume has been detached, the next attempt to place will succeed. Note that migrating an application with a specific single-attachment volume to a new node would always create an outage for the application no matter what we could come up with Nomad. So avoiding the error is a nice-to-have in that circumstance but it doesn't fundamentally change anything.

The error will look like this, with some details depending on the specific storage provider:

2020-08-07T14:55:35-04:00  Setup Failure  failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: rpc error: controller publish: attach volume: rpc error: controller attach volume: volume "vol-07b2ea432ef9a8dce" is already published at node "i-0685ed47dc21535db" but with capabilities or a read_only setting incompatible with this request: rpc error: code = AlreadyExists desc = Resource already exists

The only way to avoid this problem is to change the migration code so that allocations with CSI volumes get handled specially and get stopped before migrating. This gets complicated for the update stanza and rolling updates. It also conflicts with the migrate option for ephemeral disk, so there's some design tradeoffs to make here where an allocation wouldn't be able to have both CSI volumes and ephemeral disk that migrates.

The text was updated successfully, but these errors were encountered:

tgross · 2022-01-28T18:33:45Z

While testing the patches for #10927 I've determined this isn't really specific to node drain with migrations, but can happen any time controller unpublish takes longer than the time it takes for an allocation to be rescheduled and placed on the new client node.

The CSI specification has this to say about concurrency:

In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than one call “in-flight” per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The plugin SHOULD handle this as gracefully as possible. The error code ABORTED MAY be returned by the plugin in this case (see the Error Scheme section for details).

In Nomad's implementation of the unpublish workflow, we push unpublish RPCs from the client and then from the server once the volumewatcher notices a checkpoint. So we're violating the "should" aspects of the above which the plugins need to (and do!) tolerate. But this becomes a problem when the scheduler tries to replace an existing allocation if the controller unpublish RPCs have not yet completed.

I reproduced this today using the gcp-compute-persistent-disk-csi-driver and the following jobspec consuming a volume:

jobspec

job "httpd" {
  datacenters = ["dc1"]

  group "web" {

    count = 2

    volume "csi_data" {
      type        = "csi"
      read_only   = false
      source      = "webdata"
      access_mode = "single-node-writer"
      attachment_mode = "file-system"
      per_alloc = true
    }

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    task "http" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8001", "-h", "/srv"]
        ports   = ["www"]
      }

      volume_mount {
        volume      = "csi_data"
        destination = "/srv"
        read_only   = false
      }

      resources {
        cpu    = 128
        memory = 128
      }

    }
  }

}

On a node drain or job update that forces a reschedule, the replacement alloc will have this event from the csi_hook:

Recent Events:
Time                       Type           Description
2022-01-28T11:49:27-05:00  Setup Failure  failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: could not claim volume webdata[1]: rpc error: rpc error: controller publish: attach volume: controller attach volume: CSI.ControllerAttachVolume: controller plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = unknown Attach error: failed when waiting for zonal op: operation operation-1643388564268-5d6a73799e68b-74228985-80ef1d67 failed (RESOURCE_IN_USE_BY_ANOTHER_RESOURCE): The disk resource 'projects/$redacted/zones/us-east1-c/disks/csi-disk-1' is already being used by 'projects/$redacted/zones/us-east1-c/instances/client-2'
2022-01-28T11:49:16-05:00  Received       Task received by client

In the GCP CSI controller plugin logs for that disk, we see the following sequence:

16:52:43.17: ControllerUnpublish gets called
16:52:43.18-19: ControllerUnpublish gets called 3 more times all within the same second (which is weird). But those are rejected with "An operations with the given Volume ID already exists", as we'd expect.
16:52:43.29: the controller plugin logs that it's detaching the disk from the client
16:52:46.46: ControllerPublishVolume gets called for the new client and fails because the disk is still in use.
16:52:46.75: the original ControllerUnpublish completes
16:52:46.76: the new (failed) alloc ends up calling ControllerUnpublishVolume again, which immediately returns because the disk never got attached to that client
and then the next allocation succeeds

controller logs

I0128 16:52:43.176189       1 utils.go:55] /csi.v1.Controller/ControllerUnpublishVolume called with request: volume_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1" node_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1"
I0128 16:52:43.184631       1 utils.go:55] /csi.v1.Controller/ControllerUnpublishVolume called with request: volume_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1" node_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1"
E0128 16:52:43.184711       1 utils.go:58] /csi.v1.Controller/ControllerUnpublishVolume returned with error: rpc error: code = Aborted desc = An operation with the given Volume ID projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1/projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1 already exists
I0128 16:52:43.188513       1 utils.go:55] /csi.v1.Controller/ControllerUnpublishVolume called with request: volume_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1" node_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1"
E0128 16:52:43.188590       1 utils.go:58] /csi.v1.Controller/ControllerUnpublishVolume returned with error: rpc error: code = Aborted desc = An operation with the given Volume ID projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1/projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1 already exists
I0128 16:52:43.191861       1 utils.go:55] /csi.v1.Controller/ControllerUnpublishVolume called with request: volume_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1" node_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1"
E0128 16:52:43.191937       1 utils.go:58] /csi.v1.Controller/ControllerUnpublishVolume returned with error: rpc error: code = Aborted desc = An operation with the given Volume ID projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1/projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1 already exists
I0128 16:52:43.292029       1 gce-compute.go:469] Detaching disk csi-disk-1 from client-1
E0128 16:52:46.466234       1 utils.go:58] /csi.v1.Controller/ControllerPublishVolume returned with error: rpc error: code = Internal desc = unknown Attach error: failed when waiting for zonal op: operation operation-1643388762968-5d6a74371d288-d8b28028-8a52c70b failed (RESOURCE_IN_USE_BY_ANOTHER_RESOURCE): The disk resource 'projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1' is already being used by 'projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1'
I0128 16:52:46.754231       1 controller.go:391] ControllerUnpublishVolume succeeded for disk Key{"csi-disk-1", zone: "us-east1-c"} from node projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1
I0128 16:52:46.762944       1 utils.go:55] /csi.v1.Controller/ControllerUnpublishVolume called with request: volume_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1" node_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1"
I0128 16:52:46.831505       1 controller.go:382] ControllerUnpublishVolume succeeded for disk Key{"csi-disk-1", zone: "us-east1-c"} from node projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1. Already not attached.
I0128 16:53:17.127661       1 utils.go:55] /csi.v1.Controller/ControllerPublishVolume called with request: volume_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1" node_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1" volume_capability:<mount:<> access_mode:<mode:SINGLE_NODE_WRITER > >
I0128 16:53:17.127733       1 gce-compute.go:99] Repairing potentially underspecified volume key Key{"csi-disk-1", zone: "us-east1-c"}
I0128 16:53:17.127755       1 gce-compute.go:181] Getting disk Key{"csi-disk-1", zone: "us-east1-c"}
I0128 16:53:17.286926       1 gce-compute.go:442] Attaching disk Key{"csi-disk-1", zone: "us-east1-c"} to client-1
I0128 16:53:20.768843       1 gce-compute.go:568] Waiting for attach of disk csi-disk-1 to instance client-1 to complete...
I0128 16:53:25.769038       1 gce-compute.go:571] Polling for attach of disk csi-disk-1 to instance client-1 to complete for 5.000154359s
I0128 16:53:25.769075       1 gce-compute.go:181] Getting disk Key{"csi-disk-1", zone: "us-east1-c"}
I0128 16:53:25.866429       1 controller.go:331] ControllerPublishVolume succeeded for disk Key{"csi-disk-1", zone: "us-east1-c"} to instance projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1

Seeing this in more detail makes me realize that in a typical case we should be able to simply retry the csi_hook's initial claim RPC for a little while to see if the volume becomes available. We should probably thread this workflow through volumewatcher as well so that we're only sending one request at a time.

tgross · 2022-02-17T14:24:46Z

I've been asked some questions out-of-band on this issue and want to clarify some details.

We make requests for claims from the client after placement (in the csi_hook), which is different than how k8s does it b/c that's the only way the Nomad architecture allowed us to do so without blocking the whole plan applier for minutes at a time.
This bug is because the client-driven claims don't serialize the operations for a volume the way the spec requires.
For a single-writer volume this means that we try to attach a volume that hasn't been detached and fail.
For a multi-writer volume, everything works just fine, but then you have multiple writers (as the user has requested! 😁 )
(And the same applies to single-reader and multi-reader in the case of read-only volumes.)

My thinking at one point was to add a new claim state for "requesting claim" that would trigger the volumewatcher, and then have the claim RPC poll for the update. That would serialize the operations but not necessarily in the correct order! Working around that problem was creating a bunch of complexity around maintaining the state across leader elections. K8s controller loops don't need to care about leader elections b/c they use external state (etcd), which simplifies this specific problem (mostly at the cost of performance).

So the proposed fix I'm working on is in two parts:

Have the csi_hook on the client retry the RPC that handles claims if we get an "already in use" error (probably for n minutes so we can fail an alloc eventually?). This likely fixes the specific bug by itself but doesn't bring us into spec compliance and doesn't give us any data protection in the face of buggy plugins that return the wrong errors.
Have the RPC that handles that claim verify the claim viability before sending any RPCs to plugins, and return an "already in use" error for the client to retry. This prevents us from sending RPCs out-of-order if we know they're going to fail anyways.

tgross · 2022-02-18T16:33:50Z

Have the RPC that handles that claim verify the claim viability before sending any RPCs to plugins, and return an "already in use" error for the client to retry. This prevents us from sending RPCs out-of-order if we know they're going to fail anyways.

Ok, looks like we actually do have this already, but it uses the same test as the feasibility checker which has to be somewhat less strict because it needs to account for removing the previous version of the job. So we need to split the behaviors into a schedule-time and a claim-time check.

I've got a very rough first draft of this fix in csi-enforce-usage-at-claim-time.

github-actions · 2022-10-11T02:43:05Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added type/bug theme/data migration theme/storage labels Aug 7, 2020

tgross mentioned this issue Aug 7, 2020

Volume is not released during node drain, causing job to fail forever #8232

Closed

tgross added this to the unscheduled milestone Aug 7, 2020

tgross mentioned this issue Sep 30, 2020

[question] How to force CSI Node and Plugin jobs start before any other workload #8994

Closed

tgross removed this from the unscheduled milestone Feb 12, 2021

tgross changed the title ~~CSI allocations will always fail first placement after node drain~~ CSI allocations may fail first placement after drain/reschedule Jan 28, 2022

This was referenced Jan 28, 2022

Constraint "CSI volume has exhausted its available writer claims": 1 nodes excluded by filter #10927

Closed

CSI node unpublish/unstage should avoid requiring server #10833

Closed

tgross self-assigned this Feb 17, 2022

tgross mentioned this issue Feb 17, 2022

CSI: single access_mode is not enforced within a job #10157

Closed

This was referenced Feb 22, 2022

CSI: fix mocked modes in volumewatcher test #12104

Merged

CSI: minor refactoring #12105

Merged

CSI: enforce usage at claim time #12112

Merged

CSI: retry claims from client #12113

Merged

tgross closed this as completed in #12113 Feb 24, 2022

github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI allocations may fail first placement after drain/reschedule #8609

CSI allocations may fail first placement after drain/reschedule #8609

tgross commented Aug 7, 2020 •

edited

Loading

tgross commented Jan 28, 2022

tgross commented Feb 17, 2022

tgross commented Feb 18, 2022 •

edited

Loading

github-actions bot commented Oct 11, 2022

CSI allocations may fail first placement after drain/reschedule #8609

CSI allocations may fail first placement after drain/reschedule #8609

Comments

tgross commented Aug 7, 2020 • edited Loading

tgross commented Jan 28, 2022

tgross commented Feb 17, 2022

tgross commented Feb 18, 2022 • edited Loading

github-actions bot commented Oct 11, 2022

tgross commented Aug 7, 2020 •

edited

Loading

tgross commented Feb 18, 2022 •

edited

Loading