Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to detach ceph csi volume from a down node and migrate to another #13450

Closed
enaftali2 opened this issue Jun 21, 2022 · 4 comments
Closed

Comments

@enaftali2
Copy link

Nomad version

Nomad v1.3.0

Operating system and Environment details

Ubuntu 18.04.6 LTS

Issue

Hi
We are testing ceph storage with nomad volume csi plugin, for the POC iv'e created 3 vm's on GCP with ceph cluster and nomad cluster with client and server role on all 3 vm's, the csi plugin and volumes creation and attachment work very well, i'm running mysql job, when i restart the node running the sql i can see the job migrating to another node with the volume and data.
FYI - to run the csi, sql and volumes creation i used the guide in ceph documentation - https://docs.ceph.com/en/latest/rbd/rbd-nomad/

The issue start when i perform shutdown -h now on the node running the sql , after about 10 minutes the allocation marked as Lost and new allocation is trying to start and get stuck on Pending status forever.
As you can see in the logs below, nomad fails to detach the volume from node that is currently down.
I want to also mention the even tough ceph also lost 1 node in the test i run , it seems working and accessible and there's no errors in the csi plugin.

Reproduction steps

Shutdown of the machine running a job with volume create from ceph by csi plugin.

Expected Result

job with external volume to migrate to another node with the same volume attached.

Actual Result

job try to migrate and stuck on pending.

CSI Job files

ceph-csi-plugin-controller.nomad

job "ceph-csi-plugin-controller" {
  datacenters = ["dc1"]
  group "controller" {
    network {
      port "metrics" {}
    }
    task "ceph-controller" {
      template {
        data        = <<EOF
[{
    "clusterID": "b9127830-b0cc-4e34-aa47-9d1a2e9949a8",
    "monitors": [
        "10.155.0.16",
        "10.155.0.54",
        "10.155.0.59"
    ]
}]
EOF
        destination = "local/config.json"
        change_mode = "restart"
      }
      driver = "docker"
      config {
        image = "quay.io/cephcsi/cephcsi:v3.3.1"
        volumes = [
          "./local/config.json:/etc/ceph-csi-config/config.json"
        ]
        mounts = [
          {
            type     = "tmpfs"
            target   = "/tmp/csi/keys"
            readonly = false
            tmpfs_options = {
              size = 1000000 # size in bytes
            }
          }
        ]
        args = [
          "--type=rbd",
          "--controllerserver=true",
          "--drivername=rbd.csi.ceph.com",
          "--endpoint=unix://csi/csi.sock",
          "--nodeid=${node.unique.name}",
          "--instanceid=${node.unique.name}-controller",
          "--pidlimit=-1",
          "--logtostderr=true",
          "--v=5",
          "--metricsport=$${NOMAD_PORT_metrics}"
        ]
      }
      resources {
        cpu    = 500
        memory = 256
      }
      service {
        name = "ceph-csi-controller"
        port = "metrics"
        tags = [ "prometheus" ]
      }
      csi_plugin {
        id        = "ceph-csi"
        type      = "controller"
        mount_dir = "/csi"
      }
    }
  }
}

ceph-csi-plugin-nodes.nomad

job "ceph-csi-plugin-nodes" {
  datacenters = ["dc1"]
  type        = "system"
  group "nodes" {
    network {
      port "metrics" {}
    }
    task "ceph-node" {
      driver = "docker"
      template {
        data        = <<EOF
[{
    "clusterID": "b9127830-b0cc-4e34-aa47-9d1a2e9949a8",
    "monitors": [
        "10.155.0.16",
        "10.155.0.54",
        "10.155.0.59"
    ]
}]
EOF
        destination = "local/config.json"
        change_mode = "restart"
      }
      config {
        image = "quay.io/cephcsi/cephcsi:v3.3.1"
        volumes = [
          "./local/config.json:/etc/ceph-csi-config/config.json"
        ]
        mounts = [
          {
            type     = "tmpfs"
            target   = "/tmp/csi/keys"
            readonly = false
            tmpfs_options = {
              size = 1000000 # size in bytes
            }
          }
        ]
        args = [
          "--type=rbd",
          "--drivername=rbd.csi.ceph.com",
          "--nodeserver=true",
          "--endpoint=unix://csi/csi.sock",
          "--nodeid=${node.unique.name}",
          "--instanceid=${node.unique.name}-nodes",
          "--pidlimit=-1",
          "--logtostderr=true",
          "--v=5",
          "--metricsport=$${NOMAD_PORT_metrics}"
        ]
        privileged = true
      }
      resources {
        cpu    = 500
        memory = 256
      }
      service {
        name = "ceph-csi-nodes"
        port = "metrics"
        tags = [ "prometheus" ]
      }
      csi_plugin {
        id        = "ceph-csi"
        type      = "node"
        mount_dir = "/csi"
      }
    }
  }
}

Volume file

ceph-volume.hcl

id = "ceph-mysql2"
name = "ceph-mysql2"
external_id = "ceph-mysql2"
type = "csi"
plugin_id = "ceph-csi"
capacity_max = "5G"
capacity_min = "2G"

capability {
  access_mode     = "single-node-writer"
  attachment_mode = "file-system"
}

secrets {
  userID  = "nomad"
  userKey = "AQAlh9Rgg2vrDxAARy25T7KHabs6iskSHpAEAQ=="
}

context {
  clusterID = "b9127830-b0cc-4e34-aa47-9d1a2e9949a8"
  pool = "nomad"
  imageFeatures = "layering"
}

parameters {
  clusterID = "b9127830-b0cc-4e34-aa47-9d1a2e9949a8"
  pool = "nomad"
  imageFeatures = "layering"
}

Mysql job file

mysql.nomad

job "mysql-server2" {
  datacenters = ["dc1"]
  type        = "service"
  group "mysql-server" {
    count = 1
    volume "ceph-mysql2" {
      type      = "csi"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"
      read_only = false
      source    = "ceph-mysql2"
    }
    network {
      port "db" {
        to = 3306
      }
    }
    restart {
      attempts = 10
      interval = "5m"
      delay    = "25s"
      mode     = "delay"
    }
    task "mysql-server" {
      driver = "docker"
      volume_mount {
        volume      = "ceph-mysql2"
        destination = "/srv"
        read_only   = false
      }
      env {
        MYSQL_ROOT_PASSWORD = "password"
      }
      config {
        image = "hashicorp/mysql-portworx-demo:latest"
        args  = ["--datadir", "/srv/mysql"]
        ports = ["db"]
      }
      resources {
        cpu    = 500
        memory = 1024
      }
      service {
        name = "mysql-server"
        port = "db"
        check {
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Nomad logs

2022-06-21T15:35:00.035Z [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"
2022-06-21T15:35:00.036Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.155.0.16:4647
2022-06-21T15:35:00.036Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.155.0.16:4647
2022-06-21T15:35:57.747Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=ceph-mysql2
  error=
  | 1 error occurred:
  | 	* could not detach from node: No path to node
  |
@tgross
Copy link
Member

tgross commented Jun 21, 2022

Hi @enaftali2! This issue has been fixed in #13301 and will ship in Nomad 1.3.2. Essentially the problem is that there's no way for the server to send a node unpublish command to the node plugin that's running on a down node without violating the CSI spec. We've decided to break strict compliance in order to make non-graceful shutdown work.

In the meantime, you can avoid this condition by draining a node before shutting it down.

@enaftali2
Copy link
Author

enaftali2 commented Jun 26, 2022

Hi @tgross , thanks, you were very helpful, since i saw the fix were merged to master i built a new binary with the fix i need from master, the issue is fixed, the cluster behaving as expected.
I have a question, after i ungracefully shutdown the machine, very fast i can see the alloc in Lost state, it takes about 5-6 minutes for the alloc to be on Running state again , it's on Pending all this time.
Is there a way to make this time shorter? or it will be in the future?
This is the log i see in nomad monitor:
2022-06-26T14:33:58.529Z [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"
There is a way to increase the max claims to volume?

@tgross
Copy link
Member

tgross commented Jun 27, 2022

I have a question, after i ungracefully shutdown the machine, very fast i can see the alloc in Lost state, it takes about 5-6 minutes for the alloc to be on Running state again , it's on Pending all this time.
Is there a way to make this time shorter? or it will be in the future?

That timeout is governed by the client heartbeat timeout, which isn't currently configurable. You can also force your jobs to immediately stop by setting a stop_on_client_disconnect timeout but note that this will impact cases where the client agent has simply restarted (which is normally fine). The best way to handle this case is to drain nodes before stopping them whenever possible.

There is a way to increase the max claims to volume?

Use a non-single-node access_mode but note that this may not be available for the kind of volume you're using (this is controlled by the third-party storage provider) and may be unsafe for the consuming application as well if it doesn't have a way of coordinating multi-writer access.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

2 participants