Jobs using CSI volume do not recover from the client failure without human intervention #12118

dasavick · 2022-02-24T01:00:20Z

Nomad version

Nomad v1.2.6 (a6c6b475db5073e33885377b4a5c733e1161020c)

Operating system and Environment details

Ubuntu 20.04.3 LTS running on CPX11 VMs from Hetzner

Issue

Jobs using CSI volume do not recover from the client failure without human intervention and require volume to be manually detached from the client using provider's web UI. The client's storage controller/node health does not return to normal after the instance is back online. The process of moving the job to the other node works with draining (excluding system jobs) but not on the failure.

Reproduction steps

Create cluster with 2 servers and 4 clients (6 VMs, just my testing setup, not sure if matters)
Run hetznercloud/hcloud-csi-driver:1.6.0 as system job
Create/register new volume
Run job with the said volume
Shutdown the client with the said job/volume
Observe failure to alloc job on the new client due to volume being still attached to the previous VM
Detach the volume manually from the Hetzner Cloud Console (optional)
Turn the previous VM back on
Observe the VM back in ready state but plugin dashboard still says "Healthy (3/4)"
Observe the job working on the other client if volume was manually detached, otherwise still failing
Observe write allocations for volume, showing failed on the previously turned off client and running on some other

Additional steps

Shutdown the other client the volume dependent job was moved to
After some time observe the availability showing 0% with "Controller Health 0% (2/0) Node Health 0% (2/0)"
Turn on the second VM that is currently running the job
Observe the availability showing 50% with "Controller Health 50% (2/4) Node Health 50% (2/4)"
Observe the job failing again
Detach volume manually again
Make sure the job allocation is tried again
Observe the availability still at 50%
Observe write allocations for volume, showing failed on the previous two clients and running on some other*
Repeat until zero clients available

*Failed allocations are gone after some time.

Expected Result

The volume is detached when client failure occurs.
Nomad storage support recovers from the client failure.

Actual Result

The volume is not detached and prevents jobs from running.
The previously dead client is no longer a healthy storage controller/node, even after being back online. These seem to have csi job complete but the task is still running**, not sure if that is desirable but these nodes are not considered healthy.

**After night these returned all back to running, but CSI controllers/nodes are still not considered healthy.

Job file (CSI plugin)

job "csi" {
  region      = "eu-central"
  datacenters = ["fsn"]
  type        = "system"

  group "node" {
    task "plugin" {
      driver = "docker"

      config {
        image      = "hetznercloud/hcloud-csi-driver:1.6.0"
        privileged = true
      }

      env {
        CSI_ENDPOINT   = "unix:///csi/csi.sock"
        HCLOUD_TOKEN   = "MY_SECRET_TOKEN"
        ENABLE_METRICS = false
      }

      csi_plugin {
        id        = "csi.hetzner.cloud"
        type      = "monolith"
        mount_dir = "/csi"
      }

      resources {
        cpu        = 50
        memory     = 16
        memory_max = 256
      }
    }
  }
}

Job file (MySQL using CSI)

job "test" {
  region      = "eu-central"
  datacenters = ["fsn"]

  group "db" {
    count = 1

    restart {
      interval = "1m"
      mode     = "delay"
    }

    volume "storage" {
      type   = "csi"
      source = "volume-1"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"
    }

    network {
      port "mysql" { to = 3306 }
    }

    task "mysql" {
      driver = "docker"

      # this actually requires the removal of the lost+found
      # manually before mysql can be initialized on the volume
      volume_mount {
        volume      = "storage"
        destination = "/var/lib/mysql"
        read_only   = false
      }

      env {
        MYSQL_ROOT_PASSWORD = "12345"
        MYSQL_DATABASE      = "test"
        MYSQL_USER          = "test"
        MYSQL_PASSWORD      = "1234"
      }

      config {
        image = "mysql:5-debian"
        ports = ["mysql"]
      }

      service {
        port = "mysql"
      }

      resources {
        cpu    = 1000
        memory = 256
      }
    }
  }
}

Job failure message

failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: could not claim volume MY_VOLUME: rpc error: rpc error: controller publish: attach volume: controller attach volume: CSI.ControllerAttachVolume: volume "MY_VOLUME_ID" is already published on another node and does not have MULTI_NODE volume capability: rpc error: code = FailedPrecondition desc = failed to publish volume: volume is attached

CSI plugin logs

level=info ts=2022-02-23T23:17:13.563580702Z msg="fetched server" server-name=hashi-fsn-client-4
level=info ts=2022-02-23T23:36:47.371589294Z component=api-volume-service msg="attaching volume" volume-id=MY_VOLUME_ID server-id=client-1
level=info ts=2022-02-23T23:36:47.559393123Z component=api-volume-service msg="volume is already attached to another server" volume-id=MY_VOLUME_ID server-id=client-2
level=error ts=2022-02-23T23:36:47.603606394Z component=grpc-server msg="handler failed" err="rpc error: code = FailedPrecondition desc = failed to publish volume: volume is attached"
level=info ts=2022-02-23T23:38:47.650302102Z component=api-volume-service msg="attaching volume" volume-id=MY_VOLUME_ID server-id=client-3
level=info ts=2022-02-23T23:38:47.884207741Z component=api-volume-service msg="volume is already attached to another server" volume-id=MY_VOLUME_ID server-id=client-2
level=error ts=2022-02-23T23:38:47.925305127Z component=grpc-server msg="handler failed" err="rpc error: code = FailedPrecondition desc = failed to publish volume: volume is attached"
level=info ts=2022-02-23T23:39:20.971902903Z component=api-volume-service msg="detaching volume from server" volume-id=MY_VOLUME_ID server-id=client-2
level=info ts=2022-02-23T23:39:21.07116093Z component=api-volume-service msg="failed to detach volume" volume-id=MY_VOLUME_ID err="cannot perform operation because server is locked (locked)"
level=error ts=2022-02-23T23:39:21.07122513Z component=grpc-server msg="handler failed" err="rpc error: code = Unavailable desc = failed to unpublish volume: server is locked"
level=info ts=2022-02-23T23:42:47.973406664Z component=api-volume-service msg="attaching volume" volume-id=MY_VOLUME_ID server-id=client-4
level=info ts=2022-02-23T23:47:18.516977647Z component=api-volume-service msg="detaching volume from server" volume-id=MY_VOLUME_ID server-id=client-1
level=info ts=2022-02-23T23:47:18.629824577Z component=api-volume-service msg="failed to detach volume" volume-id=MY_VOLUME_ID err="cannot perform operation because server is locked (locked)"
level=error ts=2022-02-23T23:47:18.629904217Z component=grpc-server msg="handler failed" err="rpc error: code = Unavailable desc = failed to unpublish volume: server is locked"
level=info ts=2022-02-23T23:47:21.627957436Z component=api-volume-service msg="detaching volume from server" volume-id=MY_VOLUME_ID server-id=client-1
level=info ts=2022-02-23T23:47:21.669399184Z component=api-volume-service msg="volume not attached to a server" volume-id=MY_VOLUME_ID
level=info ts=2022-02-23T23:48:44.41762282Z component=api-volume-service msg="attaching volume" volume-id=MY_VOLUME_ID server-id=client-4
level=info ts=2022-02-23T23:48:44.64604037Z component=api-volume-service msg="volume is already attached to another server" volume-id=MY_VOLUME_ID server-id=client-2
level=error ts=2022-02-23T23:48:44.67891215Z component=grpc-server msg="handler failed" err="rpc error: code = FailedPrecondition desc = failed to publish volume: volume is attached"
level=info ts=2022-02-23T23:48:45.417872798Z component=api-volume-service msg="detaching volume from server" volume-id=MY_VOLUME_ID server-id=client-2
level=info ts=2022-02-23T23:51:48.163935969Z component=api-volume-service msg="attaching volume" volume-id=MY_VOLUME_ID server-id=client-2
level=info ts=2022-02-23T23:51:48.392417837Z component=api-volume-service msg="volume is already attached to another server" volume-id=MY_VOLUME_ID server-id=client-1
level=error ts=2022-02-23T23:51:48.426822191Z component=grpc-server msg="handler failed" err="rpc error: code = FailedPrecondition desc = failed to publish volume: volume is attached"

Nomad Server logs

Feb 23 23:15:56 hashi-fsn-server-1 nomad[1790]:     2022-02-23T23:15:56.847Z [ERROR] nomad.fsm: deregistering job failed: error="DeleteJob failed: deleting job from plugin: plugin missing: csi.hetzner.cloud <nil>" job=csi namespace=default
...
Feb 23 23:47:18 hashi-fsn-server-1 nomad[1790]:     2022-02-23T23:47:18.494Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=MY_VOLUME
Feb 23 23:47:18 hashi-fsn-server-1 nomad[1790]:   error=
Feb 23 23:47:18 hashi-fsn-server-1 nomad[1790]:   | 1 error occurred:
Feb 23 23:47:18 hashi-fsn-server-1 nomad[1790]:   |         * could not detach from controller: controller detach volume: CSI.ControllerDetachVolume: rpc error: code = Unavailable desc = failed to  unpublish volume: server is locked
Feb 23 23:47:18 hashi-fsn-server-1 nomad[1790]:   |
Feb 23 23:47:18 hashi-fsn-server-1 nomad[1790]:
...
Feb 24 00:15:11 hashi-fsn-server-1 nomad[1790]:     2022-02-24T00:15:11.783Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=MY_VOLUME
Feb 24 00:15:11 hashi-fsn-server-1 nomad[1790]:   error=
Feb 24 00:15:11 hashi-fsn-server-1 nomad[1790]:   | 1 error occurred:
Feb 24 00:15:11 hashi-fsn-server-1 nomad[1790]:   |         * could not detach from node: node detach volume: CSI.NodeDetachVolume: context deadline exceeded
Feb 24 00:15:11 hashi-fsn-server-1 nomad[1790]:   |
Feb 24 00:15:11 hashi-fsn-server-1 nomad[1790]:
...
Feb 24 00:30:11 hashi-fsn-server-1 nomad[1790]:     2022-02-24T00:30:11.796Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=MY_VOLUME
Feb 24 00:30:11 hashi-fsn-server-1 nomad[1790]:   error=
Feb 24 00:30:11 hashi-fsn-server-1 nomad[1790]:   | 2 errors occurred:
Feb 24 00:30:11 hashi-fsn-server-1 nomad[1790]:   |         * could not detach from node: node detach volume: CSI.NodeDetachVolume: context deadline exceeded
Feb 24 00:30:11 hashi-fsn-server-1 nomad[1790]:   |         * could not detach from node: No path to node
Feb 24 00:30:11 hashi-fsn-server-1 nomad[1790]:   |
Feb 24 00:30:11 hashi-fsn-server-1 nomad[1790]:
...
Feb 24 00:47:11 hashi-fsn-server-1 nomad[1790]:     2022-02-24T00:47:11.835Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=MY_VOLUME
Feb 24 00:47:11 hashi-fsn-server-1 nomad[1790]:   error=
Feb 24 00:47:11 hashi-fsn-server-1 nomad[1790]:   | 2 errors occurred:
Feb 24 00:47:11 hashi-fsn-server-1 nomad[1790]:   |         * could not detach from node: node detach volume: CSI.NodeDetachVolume: context deadline exceeded
Feb 24 00:47:11 hashi-fsn-server-1 nomad[1790]:   |         * could not detach from node: node detach volume: CSI.NodeDetachVolume: context deadline exceeded
Feb 24 00:47:11 hashi-fsn-server-1 nomad[1790]:   |
Feb 24 00:47:11 hashi-fsn-server-1 nomad[1790]:

Nomad server logs (next day)

Seems like volumes watcher is still trying to detach the volume, even if the volume shows currently only one running allocation on the correct node. The log shows two errors and that would correspond to the two previously turned off clients.

Feb 24 11:53:11 hashi-fsn-server-1 nomad[1790]:     2022-02-24T11:53:11.808Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=MY_VOLUME
Feb 24 11:53:11 hashi-fsn-server-1 nomad[1790]:   error=
Feb 24 11:53:11 hashi-fsn-server-1 nomad[1790]:   | 2 errors occurred:
Feb 24 11:53:11 hashi-fsn-server-1 nomad[1790]:   |         * could not detach from node: node detach volume: CSI.NodeDetachVolume: context deadline exceeded
Feb 24 11:53:11 hashi-fsn-server-1 nomad[1790]:   |         * could not detach from node: node detach volume: CSI.NodeDetachVolume: context deadline exceeded
Feb 24 11:53:11 hashi-fsn-server-1 nomad[1790]:   |
Feb 24 11:53:11 hashi-fsn-server-1 nomad[1790]:
Feb 24 11:57:11 hashi-fsn-server-1 nomad[1790]:     2022-02-24T11:57:11.835Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=MY_VOLUME
Feb 24 11:57:11 hashi-fsn-server-1 nomad[1790]:   error=
Feb 24 11:57:11 hashi-fsn-server-1 nomad[1790]:   | 2 errors occurred:
Feb 24 11:57:11 hashi-fsn-server-1 nomad[1790]:   |         * could not detach from node: node detach volume: CSI.NodeDetachVolume: context deadline exceeded
Feb 24 11:57:11 hashi-fsn-server-1 nomad[1790]:   |         * could not detach from node: rpc error: node detach volume: CSI.NodeDetachVolume: context deadline exceeded
Feb 24 11:57:11 hashi-fsn-server-1 nomad[1790]:   |
Feb 24 11:57:11 hashi-fsn-server-1 nomad[1790]:
Feb 24 12:01:11 hashi-fsn-server-1 nomad[1790]:     2022-02-24T12:01:11.848Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=MY_VOLUME
Feb 24 12:01:11 hashi-fsn-server-1 nomad[1790]:   error=
Feb 24 12:01:11 hashi-fsn-server-1 nomad[1790]:   | 2 errors occurred:
Feb 24 12:01:11 hashi-fsn-server-1 nomad[1790]:   |         * could not detach from node: node detach volume: CSI.NodeDetachVolume: context deadline exceeded
Feb 24 12:01:11 hashi-fsn-server-1 nomad[1790]:   |         * could not detach from node: node detach volume: CSI.NodeDetachVolume: context deadline exceeded
Feb 24 12:01:11 hashi-fsn-server-1 nomad[1790]:   |
Feb 24 12:01:11 hashi-fsn-server-1 nomad[1790]:

root@hashi-fsn-client-1:~# docker ps
CONTAINER ID   IMAGE                                  COMMAND                  CREATED        STATUS        PORTS     NAMES
5f58c3d65be2   hetznercloud/hcloud-csi-driver:1.6.0   "/bin/hcloud-csi-dri…"   15 hours ago   Up 15 hours             plugin-044bf034-ae3e-26f5-7ea0-0b868dc1e60d
root@hashi-fsn-client-1:~# docker logs --tail 1000 plugin-044bf034-ae3e-26f5-7ea0-0b868dc1e60d
level=info ts=2022-02-24T00:02:41.496054884Z msg="fetched server" server-name=hashi-fsn-client-1

root@hashi-fsn-client-3:~# docker ps
CONTAINER ID   IMAGE                                  COMMAND                  CREATED        STATUS        PORTS     NAMES
351e8422c506   hetznercloud/hcloud-csi-driver:1.6.0   "/bin/hcloud-csi-dri…"   15 hours ago   Up 15 hours             plugin-c06f6743-ea63-ce1b-13e7-6f53b53a1d3e
root@hashi-fsn-client-3:~# docker logs --tail 1000 plugin-c06f6743-ea63-ce1b-13e7-6f53b53a1d3e
level=info ts=2022-02-24T00:31:05.252738453Z msg="fetched server" server-name=hashi-fsn-client-3

The cluster leader server is still logging the same errors:

Feb 24 15:13:11 hashi-fsn-server-1 nomad[1790]:     2022-02-24T15:13:11.808Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=MY_VOLUME
Feb 24 15:17:11 hashi-fsn-server-1 nomad[1790]:   error=
Feb 24 15:17:11 hashi-fsn-server-1 nomad[1790]:   | 2 errors occurred:
Feb 24 15:17:11 hashi-fsn-server-1 nomad[1790]:   |         * could not detach from node: node detach volume: CSI.NodeDetachVolume: context deadline exceeded
Feb 24 15:17:11 hashi-fsn-server-1 nomad[1790]:   |         * could not detach from node: node detach volume: CSI.NodeDetachVolume: context deadline exceeded
Feb 24 15:17:11 hashi-fsn-server-1 nomad[1790]:   |
Feb 24 15:17:11 hashi-fsn-server-1 nomad[1790]:

The UI is still showing 2/4 controllers/nodes.

This is a bit more confusing, 4 instances but 2 per client with the same ID? There is only a single CSI plugin container running on any of the clients.

dockerroot@hashi-fsn-client-4:~# docker ps
CONTAINER ID   IMAGE                                  COMMAND                  CREATED        STATUS        PORTS     NAMES
142bbb0863eb   hetznercloud/hcloud-csi-driver:1.6.0   "/bin/hcloud-csi-dri…"   16 hours ago   Up 16 hours             plugin-49261906-b8a4-8c4a-663a-7d69511570ab

root@hashi-fsn-client-2:~# docker ps
CONTAINER ID   IMAGE                                  COMMAND                  CREATED        STATUS        PORTS                                                                         NAMES
843c9d8cb5ea   mysql:5-debian                         "docker-entrypoint.s…"   15 hours ago   Up 15 hours   33060/tcp, ADDR:PORT->3306/tcp, ADDR:PORT->3306/udp   mysql-babf54f5-6ab0-5c14-525e-ff57e063b405
99af28c5d142   redis:6                                "docker-entrypoint.s…"   15 hours ago   Up 15 hours   ADDR:PORT->6379/tcp, ADDR:PORT->6379/udp              redis-7768653d-6f82-c43e-2cf8-ba1c145b2046
9188a27b4fbf   hetznercloud/hcloud-csi-driver:1.6.0   "/bin/hcloud-csi-dri…"   16 hours ago   Up 16 hours                                                                                 plugin-f8d7ff25-627d-edbc-f923-7b99c8df0bb3

And the job itself:

As it turns out the initial failure on the client is a known issue #11477, which I've already got a fix in-flight for in #12113. That'll ship in the upcoming Nomad 1.3.0 (and get backported to 1.2.x and 1.1.x).

That's great! Any schedule for the 1.3.0 release and backports?

tgross · 2022-02-25T20:56:44Z

Seems that a moment later the error was gone. I thought it might have been the VM that is turned off, but looks like action on two separate VMs failed with this one. My best guess is this might have been the action of detaching on the Hetzner side, not being cleared yet after the previous attempt.

Ok, the plugin allocation logs you have suggest that Nomad is sending the detach RPCs but they're failing. If you're not seeing the "server is locked" error anymore, are you seeing logs for the detach RPCs?

This is a bit more confusing, 4 instances but 2 per client with the same ID? There is only a single CSI plugin container running on any of the clients.

Your plugin is of type monolith, which means it serves both the controller and node RPCs. This is treated as separate plugin instances because Nomad (or k8s for that matter) has to make decisions about which plugin instance to talk to differently for both. That has a little bit of a goofy side effect in how it gets modeled in the web UI.

Any schedule for the 1.3.0 release and backports?

I can't give dates for our releases, but I can say it's the currently active line of work across the whole team. In the meantime, feel free to check out the main branch and give a build a try!

dasavick · 2022-02-25T21:48:30Z

My previous bugged state was "wiped" due to other experiments on the cluster, but after some quick dive into the Hetzner's CSI plugin, I was able to figure the meaning of the status:

	ErrorCodeLocked                ErrorCode = "locked"                  // Item is locked (Another action is running)

Bringing back part of the log, my knowledge about the architecture is limited, but I think we can see:

attach action failing with "volume is attached" (client-1, then client-2, both previously dead - targeting client-2)
detach action failing with "server is locked" (client-2 - targeting self previously dead?)
detach action failing with "server is locked" (client-4 - targeting client-1 previously dead)
... other actions that I'm not sure what to think of because I might have detached the volume manually at this time ...

level=info ts=2022-02-23T23:17:13.563580702Z msg="fetched server" server-name=hashi-fsn-client-4

level=info ts=2022-02-23T23:36:47.371589294Z component=api-volume-service msg="attaching volume" volume-id=MY_VOLUME_ID server-id=client-1
level=info ts=2022-02-23T23:36:47.559393123Z component=api-volume-service msg="volume is already attached to another server" volume-id=MY_VOLUME_ID server-id=client-2
level=error ts=2022-02-23T23:36:47.603606394Z component=grpc-server msg="handler failed" err="rpc error: code = FailedPrecondition desc = failed to publish volume: volume is attached"

level=info ts=2022-02-23T23:38:47.650302102Z component=api-volume-service msg="attaching volume" volume-id=MY_VOLUME_ID server-id=client-3
level=info ts=2022-02-23T23:38:47.884207741Z component=api-volume-service msg="volume is already attached to another server" volume-id=MY_VOLUME_ID server-id=client-2
level=error ts=2022-02-23T23:38:47.925305127Z component=grpc-server msg="handler failed" err="rpc error: code = FailedPrecondition desc = failed to publish volume: volume is attached"

level=info ts=2022-02-23T23:39:20.971902903Z component=api-volume-service msg="detaching volume from server" volume-id=MY_VOLUME_ID server-id=client-2
level=info ts=2022-02-23T23:39:21.07116093Z component=api-volume-service msg="failed to detach volume" volume-id=MY_VOLUME_ID err="cannot perform operation because server is locked (locked)"
level=error ts=2022-02-23T23:39:21.07122513Z component=grpc-server msg="handler failed" err="rpc error: code = Unavailable desc = failed to unpublish volume: server is locked"

level=info ts=2022-02-23T23:42:47.973406664Z component=api-volume-service msg="attaching volume" volume-id=MY_VOLUME_ID server-id=client-4
level=info ts=2022-02-23T23:47:18.516977647Z component=api-volume-service msg="detaching volume from server" volume-id=MY_VOLUME_ID server-id=client-1
level=info ts=2022-02-23T23:47:18.629824577Z component=api-volume-service msg="failed to detach volume" volume-id=MY_VOLUME_ID err="cannot perform operation because server is locked (locked)"
level=error ts=2022-02-23T23:47:18.629904217Z component=grpc-server msg="handler failed" err="rpc error: code = Unavailable desc = failed to unpublish volume: server is locked"

level=info ts=2022-02-23T23:47:21.627957436Z component=api-volume-service msg="detaching volume from server" volume-id=MY_VOLUME_ID server-id=client-1
level=info ts=2022-02-23T23:47:21.669399184Z component=api-volume-service msg="volume not attached to a server" volume-id=MY_VOLUME_ID

level=info ts=2022-02-23T23:48:44.41762282Z component=api-volume-service msg="attaching volume" volume-id=MY_VOLUME_ID server-id=client-4
level=info ts=2022-02-23T23:48:44.64604037Z component=api-volume-service msg="volume is already attached to another server" volume-id=MY_VOLUME_ID server-id=client-2
level=error ts=2022-02-23T23:48:44.67891215Z component=grpc-server msg="handler failed" err="rpc error: code = FailedPrecondition desc = failed to publish volume: volume is attached"

level=info ts=2022-02-23T23:48:45.417872798Z component=api-volume-service msg="detaching volume from server" volume-id=MY_VOLUME_ID server-id=client-2
???

level=info ts=2022-02-23T23:51:48.163935969Z component=api-volume-service msg="attaching volume" volume-id=MY_VOLUME_ID server-id=client-2
level=info ts=2022-02-23T23:51:48.392417837Z component=api-volume-service msg="volume is already attached to another server" volume-id=MY_VOLUME_ID server-id=client-1
level=error ts=2022-02-23T23:51:48.426822191Z component=grpc-server msg="handler failed" err="rpc error: code = FailedPrecondition desc = failed to publish volume: volume is attached"

So, to my understanding, it just means that any action in the scope of the specific server that was not yer cleared internally can yield this. A bit weird but would make sense: at first client-1/2 targeted attach to client-2, so considering 1 & 2 temporarily locked seeing then two "server is locked" messages for detach actions targeting these makes sense.

I wonder if this was not the case due to the better handling in the CSI plugin or more delay between attach and detach cycles, the cluster would recover correctly from the failure, or at least partially. That also makes me wonder why this is even possible, we have the answer volume is attached so the next action should be free to go. I think I'm going to try to reproduce this without nomad and open a separate issue for the hetznercloud/csi-driver later.

tgross · 2022-02-28T15:55:17Z

I wonder if this was not the case due to the better handling in the CSI plugin or more delay between attach and detach cycles, the cluster would recover correctly from the failure, or at least partially. That also makes me wonder why this is even possible, we have the answer volume is attached so the next action should be free to go. I think I'm going to try to reproduce this without nomad and open a separate issue for the hetznercloud/csi-driver later.

The concurrency spec for CSI is a little "fuzzy" in that both the CO (Nomad) is supposed to avoid sending multiple in-flight requests for a given volume and the plugin is supposed to handle that case gracefully anyways. I suspect the authors of the plugin are relying on the k8s control-loop architecture to just go ahead and eventually re-sync, which we didn't have in place in Nomad (but the Nomad 1.3.0 patches get us to an equivalent behavior).

I'd definitely be interested to pop in to any issue you open up on the driver repo, so feel free to ping me there!

dasavick · 2022-03-10T14:35:17Z

In the meantime, feel free to check out the main branch and give a build a try!

Just tried this, but I have trouble getting my CSI job to run on the dev version. Were there any changes to the environment variables recently? No matter if I set CSI_ENDPOINT or not, the plugin does not pick it up correctly. It worked just fine before with the same job file.

level=error ts=2022-03-10T14:30:21.717956732Z msg="endpoint must start with unix://"

// CreateListener creates and binds the unix socket in location specified by the CSI_ENDPOINT environment variable.
func CreateListener() (net.Listener, error) {
	endpoint := os.Getenv("CSI_ENDPOINT")
	if endpoint == "" {
		return nil, errors.New("you need to specify an endpoint via the CSI_ENDPOINT env var")
	}
	if !strings.HasPrefix(endpoint, "unix://") {
		return nil, errors.New("endpoint must start with unix://")
	}
	endpoint = endpoint[7:] // strip unix://

	if err := os.Remove(endpoint); err != nil && !os.IsNotExist(err) {
		return nil, errors.New(fmt.Sprintf("failed to remove socket file at %s: %s", endpoint, err))
	}

	return net.Listen("unix", endpoint)
}

https://github.com/hetznercloud/csi-driver/blob/b83bc70a2b350aaac065928f0065d072e53e89d8/app/app.go#L46-L62

job "system-csi-hetzner" {
  region      = "eu-central"
  datacenters = ["fsn"]
  namespace   = "system"
  type        = "system"

  group "node" {
    task "plugin" {
      driver = "docker"

      config {
        image      = "hetznercloud/hcloud-csi-driver:1.6.0"
        privileged = true
      }

      env {
        CSI_ENDPOINT   = "unix:///csi/csi.sock"
        HCLOUD_TOKEN   = "xyz"
        ENABLE_METRICS = false
      }

      csi_plugin {
        id        = "csi.hetzner.cloud"
        type      = "monolith"
        mount_dir = "/csi"
      }

      resources {
        cpu        = 50
        memory     = 16
        memory_max = 256
      }
    }
  }
}

Also, is there any complete instruction on installing the required dependencies for the release or dev-ui builds? I was unable to find any, and I'm currently using the binary from the go build.

tgross · 2022-03-10T21:03:29Z

Just tried this, but I have trouble getting my CSI job to run on the dev version. Were there any changes to the environment variables recently? No matter if I set CSI_ENDPOINT or not, the plugin does not pick it up correctly. It worked just fine before with the same job file.

Yeah, Nomad is setting the CSI_ENDPOINT variable now as of #12050 So it looks like the Hetzer plugin is expecting that to have unix:// but I haven't see it fail on other plugins (I've tested recently on AWS EBS, AWS EFS, GCP, and Democratic NFS). I feel like we probably shouldn't be setting the environment if it's been set by the user though, so I'll fix that for Nomad 1.3.0 and backport.

Also, is there any complete instruction on installing the required dependencies for the release or dev-ui builds? I was unable to find any, and I'm currently using the binary from the go build.

There's a contributing guide but mostly what you'll want to do is run make bootstrap to bootstrap all the dependencies, and then at that point you can go build (or make dev to get a reasonable set of build flags).

tgross · 2022-03-10T21:13:19Z

I've opened #12257 with a patch for that, but I need to do some testing which I probably won't get to until tomorrow.

tgross · 2022-03-18T15:00:14Z

~~In my tests in #12257 it looks like the CSI_ENDPOINT env var isn't being set at all, which is pretty weird. So I still have some work to do to figure out what's going on there.~~ Or I should make sure I'm testing against a current build instead of Nomad 1.2.6 😊

tgross · 2022-03-21T15:49:09Z

Endpoint env var is fixed in #12257. I ended up going with unix:///... by default as a quick survey of the plugins showed that was expected by more than not.

rwojsznis · 2022-05-10T12:45:44Z

Hey @PinkLolicorn did you managed to get it up & running using nomad 1.3.0-rc1?

I'm also using hetzner, but today again I just noticed that few jobs went down. I looked into allocation and noticed error:

failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: could not claim volume MY-VOLUME-REDACTED: rpc error: controller publish: controller attach volume: CSI.ControllerAttachVolume: controller plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = failed to publish volume: Get "https://api.hetzner.cloud/v1/volumes/VOLUME-ID-REDACTED": unexpected EOF

It seems tho as it might be Hetzner CSI related problem 🤔

dasavick · 2022-05-10T13:23:40Z

Hey @rwojsznis, I actually tested this version yesterday with no success. However, I did not notice this exact error. Here is my setup:

nomad run csi-hetzner.nomad

job "system-csi-hetzner" {
  region      = "eu-central"
  datacenters = ["fsn"]
  type        = "system"

  group "node" {
    task "plugin" {
      driver = "docker"

      config {
        image      = "hetznercloud/hcloud-csi-driver:1.6.0"
        privileged = true
      }

      env {
        CSI_ENDPOINT   = "unix:///csi/csi.sock"
        HCLOUD_TOKEN   = "xyz"
        ENABLE_METRICS = false
      }

      csi_plugin {
        id        = "csi.hetzner.cloud"
        type      = "monolith"
        mount_dir = "/csi"
      }

      resources {
        cpu        = 50
        memory     = 16
        memory_max = 256
      }
    }
  }
}

nomad volume register myvol.nomad

id          = "myvol"
name        = "myvol"
external_id = 1234

type      = "csi"
plugin_id = "csi.hetzner.cloud"

capacity_min = "10G"
capacity_max = "20G"

capability {
  access_mode     = "single-node-writer"
  attachment_mode = "file-system"
}

mount_options {
  fs_type = "ext4"
}

My volumes do mount, but I cannot get nomad to reschedule them in case where client goes out, and it is even worse than before, because manual detach from the hetzner UI does not help to recover.

The only solution is to stop the job, deregister volume, register again, and then start. At least clients do not go into unhealthy state forever now.

It basically repeats forever that max claims are reached and cannot unmount the volume:

nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"

Running nomad system gc makes the UI look ok, but does not help in reality:

* Constraint "CSI volume XYZ has exhausted its available writer claims and is claimed by a garbage collected allocation b3c9ce46-ca63-d318-efaa-a5ba6f6538fc; waiting for claim to be released": 2 nodes excluded by filter

As stated before, even if volume is unmounted from hetzner UI, nomad does not recover from this state. I also spotted some funky looking logs from the hetzner-csi-plugin:

level=info ts=2022-05-09T23:26:41.438790658Z msg="fetched server" server-name=hashi-fsn-client-1
level=info ts=2022-05-09T23:28:23.355776305Z component=api-volume-service msg="detaching volume from server" volume-id=17183505 server-id=TARGET_SERVER
level=info ts=2022-05-09T23:28:23.404601321Z component=api-volume-service msg="volume not attached to a server" volume-id=17183505
level=info ts=2022-05-09T23:29:19.749042631Z component=api-volume-service msg="attaching volume" volume-id=17183505 server-id=CURRENT_SERVER

This looks confusing because:

detaching is done with server-id of the allocation target client and not with the old allocation node
it prints some failure, not sure if this is from detaching but if so, I wonder if the previous log means trying to detach the volume only if it is attached to the specified server-id
attaching volume has server-id which is the old allocation client

@tgross nomad should move volume to the new client after the job has been rescheduled due to failure, right? 😆

rwojsznis · 2022-05-10T13:36:25Z

My volumes do mount, but I cannot get nomad to reschedule them in case where client goes out, and it is even worse than before, because manual detach from the hetzner UI does not help to recover.

I have pretty much the same same setup (volume & csi plugin) and pretty much the same issue - volume doesn't get rescheduled once client goes down (1.3.0-rc1); I even have a systemd service that tries to to a clean nomad shutdown (drain node) before reboot/shutdown but that seems to have no effect 🤔

Once I have a little bit more time will try to provide more feedback, but it's good to hear (?) that I'm not the only one with same problem still 😅

tgross · 2022-05-10T13:52:13Z

Hey folks! The old node must be live and have a running plugin on it in order for Nomad to cleanly unmount the volume. You should drain a node before shutting it down, so that the tasks are rescheduled and the plugins are left in place until after all the other tasks are rescheduled. If you don't, you need to do volume detach

dasavick · 2022-05-10T16:00:10Z

The old node must be live and have a running plugin on it in order for Nomad to cleanly unmount the volume.

@tgross ok, but shouldn't nomad at least recover after the client is back online? This is not the case in my current setup. I noticed some errors like so, but I'm unsure of their exact context, I think however these were logged when the failed node was turned on again:

missing external node ID: failed to find storage provider info for client "765bedd9-2897-1089-42c8-5de6f84b2110", node plugin "csi.hetzner.cloud" is not running or has not fingerprinted on this client

Although they seem to be not important, because they're probably from the time before CSI container started. I just feel like bringing the client back shouldn't prevent everything from going back to normal without doing deregister manually.

tgross · 2022-05-10T17:21:13Z

but shouldn't nomad at least recover after the client is back online?

I just feel like bringing the client back shouldn't prevent everything from going back to normal without doing deregister manually.

It shouldn't so long as the plugin is alive and healthy. You can't crash-out the client without also making sure the node plugin is live and expect Nomad can clean this situation up; Nomad can't do anything with the volumes without a plugin in place. That's a constraint of the CSI specification.

It doesn't look like the plugin is healthy given the log entry you're showing. I'm happy to revisit this issue but there's not a lot I can do with one liner log snippets. Can you provide specific steps to reproduce and the logs for client, server, and plugins? Allocation status for plugins would also help.

dasavick · 2022-05-11T11:07:30Z

It shouldn't so long as the plugin is alive and healthy. You can't crash-out the client without also making sure the node plugin is live and expect Nomad can clean this situation up; Nomad can't do anything with the volumes without a plugin in place. That's a constraint of the CSI specification.

@tgross in my current cluster state all clients are online and plugins are too, volume is unmounted in the Hetzner UI, but allocation still can't take place.

I noticed that the volume is at 1 allocation, but none is shown, which would make sense considering the displayed error. Apparently I got it previously but did not expect it to be that important as it appears now (noted in previous message):

Constraint "CSI volume XYZ has exhausted its available writer claims and is claimed by a garbage collected allocation b3c9ce46-ca63-d318-efaa-a5ba6f6538fc; waiting for claim to be released": 2 nodes excluded by filter

I think this state was caused by running nomad system gc manually whilst one client (with job previously using this volume) was turned off.

Plugin

Storage

Job

dasavick · 2022-05-11T11:32:35Z

I tried this:

registering volume again
made a job using it
stopped the client with the job
waited for nomad to notice and try to reschedule
turned the client on again

It appears to be stuck for now, with the other previously seen error.

ran nomad system gc

Apparently this made it work? I think the difference is that the state is failed and not lost?

How long would nomad need to clear this up automatically? (Edit: tried again without manual gc, and it took about 5 min)
Can this be speed up? (Edit: if this is about 5 minutes, that is ok, but I'm still not sure what governs that time)
Is there any way to make nomad automatically recover from such failure without bringing the node back to life?

After turning the client off (looks ok?):

After turning the client on (marked as failed by the client?):

After waiting some time, the job:

failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: could not claim volume test-mysql: rpc error: rpc error: controller publish: failed to find storage provider info for client "281c57a2-bcfd-97aa-22ab-d92a88c1af76", node plugin "csi.hetzner.cloud" is not running or has not fingerprinted on this client

After running nomad system gc

tgross · 2022-05-11T18:02:14Z

Hi @PinkLolicorn! Yeah I think the "lost" state is probably the problem here. An allocation gets marked as lost once the client has missed a heartbeat (unless group.max_client_disconnect is set, for Nomad 1.3.0+), and for a lost client there's no route to the plugin so we can't run a disconnect at the time we try to reschedule.

When the client reconnects, we should see it marked as terminal, and that should allow for the "volume watcher" control loop to clean it up without having to wait for GC. I've got an open issue (in a private repo, I'll try to get it moved over) to have the volume watcher detect events from allocations and not just volumes, but that's going to be a bit of a lift to complete. In any case, I would expect that once the allocation moves from lost to terminal (failed or complete), we'd be able to free the claim. I'll see if I can reproduce that specific behavior in a test rig.

As you noticed though, the GC job does clean things up once it executes. The window for that is controlled by the CSIVolumeClaimGCInterval which as you saw is 5min. That's not user-configurable the way that job_gc_threshold but not for any particular reason other than we never got around to it. I've opened #12952 to expose that.

github-actions · 2022-10-08T02:35:51Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dasavick added the type/bug label Feb 24, 2022

tgross added stage/duplicate theme/storage stage/waiting-reply labels Feb 24, 2022

tgross closed this as completed Feb 24, 2022

tgross self-assigned this Feb 24, 2022

tgross mentioned this issue Mar 10, 2022

CSI: set plugin CSI_ENDPOINT env var only if unset by user #12257

Merged

github-actions bot locked as resolved and limited conversation to collaborators Oct 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs using CSI volume do not recover from the client failure without human intervention #12118

Jobs using CSI volume do not recover from the client failure without human intervention #12118

dasavick commented Feb 24, 2022 •

edited

Loading

tgross commented Feb 24, 2022

tgross commented Feb 24, 2022 •

edited

Loading

dasavick commented Feb 24, 2022 •

edited

Loading

tgross commented Feb 25, 2022

dasavick commented Feb 25, 2022 •

edited

Loading

tgross commented Feb 28, 2022

dasavick commented Mar 10, 2022 •

edited

Loading

tgross commented Mar 10, 2022

tgross commented Mar 10, 2022

tgross commented Mar 18, 2022 •

edited

Loading

tgross commented Mar 21, 2022 •

edited

Loading

rwojsznis commented May 10, 2022

dasavick commented May 10, 2022

rwojsznis commented May 10, 2022

tgross commented May 10, 2022

dasavick commented May 10, 2022 •

edited

Loading

tgross commented May 10, 2022

dasavick commented May 11, 2022 •

edited

Loading

dasavick commented May 11, 2022 •

edited

Loading

tgross commented May 11, 2022

github-actions bot commented Oct 8, 2022

Jobs using CSI volume do not recover from the client failure without human intervention #12118

Jobs using CSI volume do not recover from the client failure without human intervention #12118

Comments

dasavick commented Feb 24, 2022 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (CSI plugin)

Job file (MySQL using CSI)

Job failure message

CSI plugin logs

Nomad Server logs

Nomad server logs (next day)

Related

tgross commented Feb 24, 2022

tgross commented Feb 24, 2022 • edited Loading

dasavick commented Feb 24, 2022 • edited Loading

tgross commented Feb 25, 2022

dasavick commented Feb 25, 2022 • edited Loading

tgross commented Feb 28, 2022

dasavick commented Mar 10, 2022 • edited Loading

tgross commented Mar 10, 2022

tgross commented Mar 10, 2022

tgross commented Mar 18, 2022 • edited Loading

tgross commented Mar 21, 2022 • edited Loading

rwojsznis commented May 10, 2022

dasavick commented May 10, 2022

rwojsznis commented May 10, 2022

tgross commented May 10, 2022

dasavick commented May 10, 2022 • edited Loading

tgross commented May 10, 2022

dasavick commented May 11, 2022 • edited Loading

dasavick commented May 11, 2022 • edited Loading

tgross commented May 11, 2022

github-actions bot commented Oct 8, 2022

dasavick commented Feb 24, 2022 •

edited

Loading

tgross commented Feb 24, 2022 •

edited

Loading

dasavick commented Feb 24, 2022 •

edited

Loading

dasavick commented Feb 25, 2022 •

edited

Loading

dasavick commented Mar 10, 2022 •

edited

Loading

tgross commented Mar 18, 2022 •

edited

Loading

tgross commented Mar 21, 2022 •

edited

Loading

dasavick commented May 10, 2022 •

edited

Loading

dasavick commented May 11, 2022 •

edited

Loading

dasavick commented May 11, 2022 •

edited

Loading