Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad is unable to create CSI plugin due to being unable to probe the CSI driver #7931

Closed
KallynGowdy opened this issue May 12, 2020 · 12 comments
Assignees
Milestone

Comments

@KallynGowdy
Copy link

Nomad version

Nomad v0.11.1 (b434570)

Operating system and Environment details

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04 LTS
Release:        20.04
Codename:       focal

Issue

Nomad is stuck probing an "unavailable" CSI plugin via gRPC.

For some background, I've been following a mixture of the Nomad stateful workloads tutorial and the Nomad Terraform AWS example project.

I've put a longer version down below, but the gist is that while the csi.sock files are being created both on the Nomad side and on the container (CSI driver) side, they seem unable to communicate. This causes Nomad to repeatedly try to probe the plugin but to no avail.

I've checked the normal culprits, namely that both Nomad and Docker (dockerd & containerd) have the correct permissions and that I'm using Nomad jobs from the tutorial. I've also tried restarting the Nomad service to see if it was an initialization issue.

Reproduction steps

Here's my setup so far:

  1. Packer builds a machine image with Docker, Nomad, and Consul. (source)
  2. Use Terraform to deploy a single server instance, configure it, and produce Job files.
  3. Use the Nomad CLI to run the Jobs.
    • Sources
    • Commands
      • $ nomad job run -address $NOMAD_ADDRESS -token $NOMAD_TOKEN ./out/aws-ebs-controller.hcl
      • $ nomad job run -address $NOMAD_ADDRESS -token $NOMAD_TOKEN ./out/aws-ebs-nodes.hcl
  4. Check the job status to ensure they're running:
    $ nomad job status -address $NOMAD_ADDRESS -token $NOMAD_TOKEN
    ID                         Type     Priority  Status   Submit Date
    plugin-aws-ebs-controller  service  50        running  2020-05-12T02:41:07-04:00
    plugin-aws-ebs-nodes       system   50        running  2020-05-12T02:44:02-04:00
    
  5. Check the plugin status to ensure it's running.
    • At this point, I get a result like this:
      $ nomad plugin status -address $NOMAD_ADDRESS -token $NOMAD_TOKEN
      Container Storage Interface
      No CSI plugins
      
    • This lead me to check the task and Nomad logs for more information, but not much was available.
    • The system log (journalctl) also did not appear to contain anything useful.

Job files

aws-ebs-controller.hcl
# A nomad job to run the AWS EBS CSI plugin controller
# See https://learn.hashicorp.com/nomad/stateful-workloads/csi-volumes
job "plugin-aws-ebs-controller" {
  datacenters = ["us-east-1"]

  group "controller" {
      task "plugin" {
          driver = "docker"

          config {
              image = "amazon/aws-ebs-csi-driver:latest"

              args = [
              "controller",
              "--endpoint=unix://csi/csi.sock",
              "--logtostderr",
              "--v=5",
              ]
          }

          csi_plugin {
              id        = "aws-ebs0"
              type      = "controller"
              mount_dir = "/csi"
          }

          resources {
              cpu    = 500
              memory = 256
          }
      }
  }
}
aws-ebs-nodes.hcl
# A nomad job file that deploys a AWS EBS CSI plugin node on each machine
job "plugin-aws-ebs-nodes" {
    datacenters = ["us-east-1"]

    # you can run node plugins as service jobs as well, but this ensures
    # that all nodes in the DC have a copy.
    type = "system"

    group "nodes" {
        task "plugin" {
            driver = "docker"

            config {
                image = "amazon/aws-ebs-csi-driver:latest"

                args = [
                    "node",
                    "--endpoint=unix://csi/csi.sock",
                    "--logtostderr",
                    "--v=5",
                ]

                # node plugins must run as privileged jobs because they
                # mount disks to the host
                privileged = true
            }

            csi_plugin {
                id        = "aws-ebs0"
                type      = "node"
                mount_dir = "/csi"
            }

            resources {
                cpu    = 500
                memory = 256
            }
        }
    }
}

Nomad Client logs

Here's the logs from the Nomad jobs:

plugin-aws-ebs-controller

Logs from the plugin-aws-ebs-controller job.
They're from stderr since that is where the the AWS EBS driver puts them. There are no logs in stdout.

I0512 06:41:08.405847       1 driver.go:62] Driver: ebs.csi.aws.com Version: v0.6.0-dirty
I0512 06:41:11.553676       1 driver.go:132] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
plugin-aws-ebs-nodes

Logs from the plugin-aws-ebs-nodes job.
They're from stderr since that is where the the AWS EBS driver puts them. There are no logs in stdout.

I0512 06:44:02.497100       1 driver.go:62] Driver: ebs.csi.aws.com Version: v0.6.0-dirty
I0512 06:44:05.650222       1 mount_linux.go:163] Cannot run systemd-run, assuming non-systemd OS
I0512 06:44:05.650242       1 mount_linux.go:164] systemd-run failed with: exit status 1
I0512 06:44:05.650253       1 mount_linux.go:165] systemd-run output: Failed to create bus connection: No such file or directory
I0512 06:44:05.650436       1 driver.go:132] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}

Additionally, I think it is useful to include the Nomad logs from journalctl:

nomad.service
May 12 06:19:51 ip-10-0-0-205 systemd[1]: Started "HashiCorp Nomad".
May 12 06:19:51 ip-10-0-0-205 nomad[2765]: ==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
May 12 06:19:51 ip-10-0-0-205 nomad[2765]: ==> Loaded configuration from /etc/nomad.d/nomad_config.hcl
May 12 06:19:51 ip-10-0-0-205 nomad[2765]: ==> Starting Nomad agent...
May 12 06:19:51 ip-10-0-0-205 nomad[2765]: ==> Nomad agent configuration:
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:        Advertise Addrs: HTTP: 10.0.0.205:4646; RPC: 10.0.0.205:4647; Serf: 10.0.0.205:4648
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:             Bind Addrs: HTTP: 0.0.0.0:4646; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:                 Client: true
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:              Log Level: INFO
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:                 Region: global (DC: us-east-1)
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:                 Server: true
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:                Version: 0.11.1
May 12 06:19:51 ip-10-0-0-205 nomad[2765]: ==> Nomad agent started! Log data will stream in below:
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.325Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.325Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.325Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.325Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.325Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.325Z [INFO]  agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.333Z [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:10.0.0.205:4647 Address:10.0.0.205:4647}]"
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.333Z [INFO]  nomad.raft: entering follower state: follower="Node at 10.0.0.205:4647 [Follower]" leader=
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.334Z [INFO]  nomad: serf: EventMemberJoin: ip-10-0-0-205.global 10.0.0.205
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.334Z [INFO]  nomad: starting scheduling worker(s): num_workers=2 schedulers=[system, service, batch, _core]
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.334Z [INFO]  nomad: adding server: server="ip-10-0-0-205.global (Addr: 10.0.0.205:4647) (DC: us-east-1)"
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.334Z [INFO]  client: using state directory: state_dir=/opt/nomad/data/client
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.334Z [INFO]  client: using alloc directory: alloc_dir=/opt/nomad/data/alloc
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.334Z [WARN]  nomad: serf: Failed to re-join any previously known node
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.336Z [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.338Z [INFO]  client.fingerprint_mgr.consul: consul agent is available
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.342Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=ens5
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.352Z [INFO]  client.plugin: starting plugin manager: plugin-type=csi
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.352Z [INFO]  client.plugin: starting plugin manager: plugin-type=driver
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.352Z [INFO]  client.plugin: starting plugin manager: plugin-type=device
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.373Z [INFO]  client: started client: node_id=87c3dfd5-f174-bb45-e833-309d75904164
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.390Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=7212a04c-0644-3c8b-7088-15a86d1e0953 plugin.name=aws-ebs0 plugin.type=controller task=plugin grpc.code=Unavailable duration=169.726µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:19:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:51.392Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=291cbde8-dbb7-b2f0-d4f7-9985258ed4ec plugin.name=aws-ebs0 plugin.type=node task=plugin grpc.code=Unavailable duration=859.151µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:19:52 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:52.645Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader=
May 12 06:19:52 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:52.645Z [INFO]  nomad.raft: entering candidate state: node="Node at 10.0.0.205:4647 [Candidate]" term=4
May 12 06:19:52 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:52.649Z [INFO]  nomad.raft: election won: tally=1
May 12 06:19:52 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:52.649Z [INFO]  nomad.raft: entering leader state: leader="Node at 10.0.0.205:4647 [Leader]"
May 12 06:19:52 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:52.649Z [INFO]  nomad: cluster leadership acquired
May 12 06:19:52 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:52.756Z [INFO]  client: node registration complete
May 12 06:19:56 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:56.390Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=7212a04c-0644-3c8b-7088-15a86d1e0953 plugin.name=aws-ebs0 plugin.type=controller task=plugin grpc.code=Unavailable duration=156.641µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:19:56 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:56.393Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=291cbde8-dbb7-b2f0-d4f7-9985258ed4ec plugin.name=aws-ebs0 plugin.type=node task=plugin grpc.code=Unavailable duration=108.303µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:19:59 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:19:59.483Z [INFO]  client: node registration complete
May 12 06:20:01 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:01.392Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=7212a04c-0644-3c8b-7088-15a86d1e0953 plugin.name=aws-ebs0 plugin.type=controller task=plugin grpc.code=Unavailable duration=146.932µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:01 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:01.398Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=291cbde8-dbb7-b2f0-d4f7-9985258ed4ec plugin.name=aws-ebs0 plugin.type=node task=plugin grpc.code=Unavailable duration=137.413µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:06 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:06.392Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=7212a04c-0644-3c8b-7088-15a86d1e0953 plugin.name=aws-ebs0 plugin.type=controller task=plugin grpc.code=Unavailable duration=181.363µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:06 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:06.399Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=291cbde8-dbb7-b2f0-d4f7-9985258ed4ec plugin.name=aws-ebs0 plugin.type=node task=plugin grpc.code=Unavailable duration=161.475µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:11 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:11.393Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=7212a04c-0644-3c8b-7088-15a86d1e0953 plugin.name=aws-ebs0 plugin.type=controller task=plugin grpc.code=Unavailable duration=103.28µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:11 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:11.399Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=291cbde8-dbb7-b2f0-d4f7-9985258ed4ec plugin.name=aws-ebs0 plugin.type=node task=plugin grpc.code=Unavailable duration=109.305µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:16 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:16.393Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=7212a04c-0644-3c8b-7088-15a86d1e0953 plugin.name=aws-ebs0 plugin.type=controller task=plugin grpc.code=Unavailable duration=135.958µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:16 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:16.402Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=291cbde8-dbb7-b2f0-d4f7-9985258ed4ec plugin.name=aws-ebs0 plugin.type=node task=plugin grpc.code=Unavailable duration=127.133µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:21 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:21.394Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=7212a04c-0644-3c8b-7088-15a86d1e0953 plugin.name=aws-ebs0 plugin.type=controller task=plugin grpc.code=Unavailable duration=112.27µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:21 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:21.402Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=291cbde8-dbb7-b2f0-d4f7-9985258ed4ec plugin.name=aws-ebs0 plugin.type=node task=plugin grpc.code=Unavailable duration=157.461µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:26 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:26.394Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=7212a04c-0644-3c8b-7088-15a86d1e0953 plugin.name=aws-ebs0 plugin.type=controller task=plugin grpc.code=Unavailable duration=123.503µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:26 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:26.403Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=291cbde8-dbb7-b2f0-d4f7-9985258ed4ec plugin.name=aws-ebs0 plugin.type=node task=plugin grpc.code=Unavailable duration=112.719µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:31 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:31.395Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=7212a04c-0644-3c8b-7088-15a86d1e0953 plugin.name=aws-ebs0 plugin.type=controller task=plugin grpc.code=Unavailable duration=389.405µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:31 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:31.403Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=291cbde8-dbb7-b2f0-d4f7-9985258ed4ec plugin.name=aws-ebs0 plugin.type=node task=plugin grpc.code=Unavailable duration=227.113µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:36 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:36.395Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=7212a04c-0644-3c8b-7088-15a86d1e0953 plugin.name=aws-ebs0 plugin.type=controller task=plugin grpc.code=Unavailable duration=139.46µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:36 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:36.404Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=291cbde8-dbb7-b2f0-d4f7-9985258ed4ec plugin.name=aws-ebs0 plugin.type=node task=plugin grpc.code=Unavailable duration=154.106µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:41 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:41.396Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=7212a04c-0644-3c8b-7088-15a86d1e0953 plugin.name=aws-ebs0 plugin.type=controller task=plugin grpc.code=Unavailable duration=183.23µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:41 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:41.404Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=291cbde8-dbb7-b2f0-d4f7-9985258ed4ec plugin.name=aws-ebs0 plugin.type=node task=plugin grpc.code=Unavailable duration=112.348µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:46 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:46.397Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=7212a04c-0644-3c8b-7088-15a86d1e0953 plugin.name=aws-ebs0 plugin.type=controller task=plugin grpc.code=Unavailable duration=117.051µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:46 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:46.405Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=291cbde8-dbb7-b2f0-d4f7-9985258ed4ec plugin.name=aws-ebs0 plugin.type=node task=plugin grpc.code=Unavailable duration=108.27µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:51.397Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=7212a04c-0644-3c8b-7088-15a86d1e0953 plugin.name=aws-ebs0 plugin.type=controller task=plugin grpc.code=Unavailable duration=113.075µs grpc.service=csi.v1.Identity grpc.method=Probe
May 12 06:20:51 ip-10-0-0-205 nomad[2765]:     2020-05-12T06:20:51.405Z [INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=291cbde8-dbb7-b2f0-d4f7-9985258ed4ec plugin.name=aws-ebs0 plugin.type=node task=plugin grpc.code=Unavailable duration=104.191µs grpc.service=csi.v1.Identity grpc.method=Probe

Other research

Note that gRPC calls are being made for the Probe method and they're all returning with an Unavailable code.

A quick search online gives a list of gRPC status codes with this note about Unavailable:

The service is currently unavailable. This is most likely a transient condition, which can be corrected by retrying with a backoff. Note that it is not always safe to retry non-idempotent operations.

I'm assuming this means that Nomad is unable to establish a connection over the /csi/csi.sock socket, but I don't know how to test or validate such a scenario.

I also searched around for where the logs are originating from in the Nomad source.

It seems that the Probe gRPC method is being invoked from the PluginProbe() method. This function is in turn called from supervisorLoopOnce(), which I believe is used by ensureSupervisorLoop() to help setup the plugin.

The log could also come from a call stack that includes the registerPlugin() function since both registerPlugin() and supervisorLoopOnce() specify csi_client as the logger name.

Finally, it seems like Nomad calculates the socket address by joining a mount point with the CSI Socket Name (csi.sock).

The mount point is calculated here, and is the result of joining the CSI "root directory" with the plugin type (controller/node) and id (aws-ebs0).

The root directory was in turn calculated here and was calculated by joining the client "State Directory" with the literal string "csi". As we can see from above, the "state directory" is /opt/nomad/data/client and when I checked there was a csi folder containing csi.sock "files" for both the controller and node subdirectories (with aws-ebs0).

When I try sending some dummy data to the socket I get the following output:

$ echo "hello" | sudo nc -U /opt/nomad/data/client/csi/controller/aws-ebs0/csi.sock
nc: unix connect failed: Connection refused
nc: /opt/nomad/data/client/csi/controller/aws-ebs0/csi.sock: Connection refused

I've also validated that the csi.sock file is being created inside the docker container.

From here, I'm not sure what the issue is. Seems like the socket files are being created but they are unable to communicate with each other for some reason.

@tgross
Copy link
Member

tgross commented May 12, 2020

hi @KallynGowdy! Thanks so much for the thorough report; it definitely helped me eliminate a couple of obvious possibilities right away. That also means I don't have an immediate answer for you on this but I'll dig into it shortly and come back with any questions.

@tgross
Copy link
Member

tgross commented May 12, 2020

Spoke too soon. 😀

I notice you're using the amazon/aws-ebs-csi-driver:latest tag. In #7701 I had to update our e2e tests to use the amazon/aws-ebs-csi-driver:v0.5.0 tag because the upstream EBS driver was broken on master. Could you give it a try with that specific tag and see if that works around the problem for you?

@KallynGowdy
Copy link
Author

Hey @tgross! Thanks for the quick response!

I've swapped the amazon/aws-ebs-csi-driver:latest image for amazon/aws-ebs-csi-driver:v0.5.0 in both the controller and nodes job files. (see below)

This did not appear to change the behavior of the system. The plugin still gets stuck without being able to connect. After updating the jobs, I restarted Nomad again to see if that would change anything but it seems to have had no effect.

Is there some obscure Linux permission that would prevent socket files with different permissions from being able to communicate?

On that line of thought, here's some output from ls and stat on the controller socket file:

$ sudo ls -la /opt/nomad/data/client/csi/controller/aws-ebs0
total 8
drwx------ 2 nomad nomad 4096 May 12 20:47 .
drwx------ 3 nomad nomad 4096 May 12 05:40 ..
srwxr-xr-x 1 root  root     0 May 12 20:47 csi.sock

$ sudo stat -c "%a %n" /opt/nomad/data/client/csi/controller/aws-ebs0/csi.sock
755 /opt/nomad/data/client/csi/controller/aws-ebs0/csi.sock

Seems like the file is owned by root and the permissions don't seem to allow other users to write to it.

So what happens if I do this?

$ sudo su
root@ip-10-0-0-205:/home/ubuntu# echo "hello" | nc -U /opt/nomad/data/client/csi/controller/aws-ebs0/csi.sock
���@^C

Looks like I'm able to establish a connection! What happens when I run as the nomad user?

$ sudo su nomad
$ echo "hello" | nc -U /opt/nomad/data/client/csi/controller/aws-ebs0/csi.sock
nc: unix connect failed: Permission denied
nc: /opt/nomad/data/client/csi/controller/aws-ebs0/csi.sock: Permission denied

And finally, when I change the file permissions of the csi.sock file to 777, I am able to see the plugin:

$ sudo chmod 777 /opt/nomad/data/client/csi/controller/aws-ebs0/csi.sock
$ nomad plugin status -address $NOMAD_ADDRESS -token $NOMAD_TOKEN
Container Storage Interface
ID        Provider         Controllers Healthy/Expected  Nodes Healthy/Expected
aws-ebs0  ebs.csi.aws.com  1/1                           0/0

Just for fun, I decided to try and run the amazon/aws-ebs-csi-driver:latest image again to see if it made any difference. It looks like the csi.sock file was recreated (which makes sense), so I had to fix the permissions on it again.

Once I did, the controller was recognized as healthy:

$ nomad plugin status -address $NOMAD_ADDRESS -token $NOMAD_TOKEN
Container Storage Interface
ID        Provider         Controllers Healthy/Expected  Nodes Healthy/Expected
aws-ebs0  ebs.csi.aws.com  1/1                           0/0

Once I also updated the permissions for the node socket file it showed up in the status report:

$ sudo chmod 777 /opt/nomad/data/client/csi/node/aws-ebs0/csi.sock
$ nomad plugin status -address $NOMAD_ADDRESS -token $NOMAD_TOKEN
Container Storage Interface
ID        Provider         Controllers Healthy/Expected  Nodes Healthy/Expected
aws-ebs0  ebs.csi.aws.com  1/1                           1/1

As for a permenent solution to this issue, is this something handled by Docker or Nomad? Who is in charge of creating that socket file?

For now, a workaround is to simply run Nomad as root but it would be nice to see this fixed.

@KallynGowdy
Copy link
Author

After some more research, it appears the csi.sock file is being created by the aws-ebs-csi-driver when it sets up the TCP listener.

Because of how Docker handles permissions with shared volumes, files on the host side gets the same permissions as docker while files on the container side get root permissions.

I validated this by running the aws-ebs-csi-driver docker container and giving it a custom mount point:

$ docker run -d -v "/home/ubuntu/csi:/csi" amazon/aws-ebs-csi-driver:latest controller --endpoint=unix://csi/csi.sock

When we check the host folder we get:

$ ls -la /home/ubuntu/csi/
total 8
drwxr-xr-x 2 root   root   4096 May 13 04:08 .
drwxr-xr-x 6 ubuntu ubuntu 4096 May 13 04:08 ..
srwxr-xr-x 1 root   root      0 May 13 04:08 csi.sock

It looks like this has been a known issue in Docker for a while. They say the solution to the issue is to use local volumes and specify the correct permisions as part of the driver options. However, this won't allow both the host and the container to access the volume.

According to this comment the only way to fix the permissions issue is to take control of any container-created files manually. (like chown or chmod)

There's a related issue about being able to mount shared volumes with specific permissions inside the container, but some of the discussion relates to host permissions as well.

I'm not sure why Unix sockets are preferred as opposed to normal networking. Indeed, it is possible to setup the aws-ebs-csi-driver to use IP routing. (e.g. --endpoint=tcp://0.0.0.0:4567) It seems likely that other drivers support IP routing as well. For some reason, CSI specification requires Unix sockets, but the wording is confusing.

It states:

Only UNIX Domain Sockets MAY be used as endpoints. This will likely change in a future version of this specification to support non-UNIX platforms.

However, the use of "MAY" in the statement seems to conflict with the RFC used for technical terminology. Seems like a bug/typo.

In any case, I think a fix would be to ensure that the correct permissions are applied to the csi.sock file on the initial connection in the supervisorLoopOnce() function. It already runs a stat on the file so it wouldn't be too much of a stretch to do something like chown it. Especially considering that the file is expected to only be accessed by Nomad.

What do you think?

Extra Background

Nomad handles communication with the CSI driver by creating a Docker shared volume with the CSI driver container.
It is mapped to the /opt/nomad/data/client/csi/controller/aws-ebs0 folder by adding a mount configuration during the CSI Plugin Supervisor Prestart() hook. This mount configuration is then processed when creating the docker container config.

@tgross
Copy link
Member

tgross commented May 13, 2020

For now, a workaround is to simply run Nomad as root but it would be nice to see this fixed.

Ah, this is what I was missing from the background. Generally speaking we recommend that Nomad clients are run as root. Servers can and should be run as unprivileged users. (I'm re-reading that bit of documentation and it could be more clear.)

The Nomad client performs a lot of different operations that can only be run by root above and beyond talking to the Docker socket: configuring network namespaces, mount points, etc. If you never want to use the exec or java task driver, or the Connect integration, or the upcoming CNI integration, you might be able to get this to work if you use user namespace remapping in dockerd's configuration so that root in the plugin maps to nomad on the host. But you're likely to just kick the problem to another place.

I'm not sure why Unix sockets are preferred as opposed to normal networking

I can't speak for the CSI spec authors directly. But most likely because it creates an added complication when dealing with network namespaces between the orchestrator and the plugin. If the plugin uses the host namespace, then it's open to any other process running on the host. Whereas if it's a Unix socket the added difficulty you're experiencing here with file permissions is a security measure -- mounting is a privileged operation (requiring CAP_SYS_ADMIN) and shouldn't be available to arbitrary processes on the host.

In any case, I think a fix would be to ensure that the correct permissions are applied to the csi.sock file on the initial connection in the supervisorLoopOnce() function. It already runs a stat on the file so it wouldn't be too much of a stretch to do something like chown it. Especially considering that the file is expected to only be accessed by Nomad.

Hm, I feel like this could be a configuration footgun. If the Nomad user chowns the file, it won't necessarily be accessible by the plugin unless the user has been careful to map the user namespaces.

@tgross tgross added theme/docs Documentation issues and enhancements and removed stage/needs-investigation labels May 13, 2020
@tgross tgross self-assigned this May 13, 2020
@KallynGowdy
Copy link
Author

The Nomad client performs a lot of different operations that can only be run by root above and beyond

Yes, which is why I configured the nomad user to be part of the sudo group. In this case, Nomad has superuser capabilities but is unable to access the csi.sock file because it wasn't exepcting to have to use them.

Hm, I feel like this could be a configuration footgun. If the Nomad user chowns the file, it won't necessarily be accessible by the plugin unless the user has been careful to map the user namespaces.

I understand, and really what I want out of this is an easier way to debug the issue. I only care a little that I have to run the Nomad client as the root user instead of a sudo-capable normal user, but to me the biggest issue is that the only log from Nomad was:

[INFO]  client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=291cbde8-dbb7-b2f0-d4f7-9985258ed4ec plugin.name=aws-ebs0 plugin.type=node task=plugin grpc.code=Unavailable duration=104.191µs grpc.service=csi.v1.Identity grpc.method=Probe

Indicating that the probe method was unavailable but no mention of why. Because I was unfamiliar with the system, I had to dig around to discover that it was just a permissions issue.

No worries though. It was a good learning experience.

@tgross
Copy link
Member

tgross commented May 13, 2020

I understand, and really what I want out of this is an easier way to debug the issue. I only care a little that I have to run the Nomad client as the root user instead of a sudo-capable normal user, but to me the biggest issue is that the only log from Nomad was:

Totally understood on that! This has been a bit of a thorn in my side because of #7424. But this is a good example of a case where Nomad could provide a better error message even if the plugin can't help us out.

@tgross
Copy link
Member

tgross commented May 13, 2020

Ok so I think we've come to a resolution of the underlying problem. I'm going to add this error message case to #7424 (comment) so as to improve the experience going forward. Thanks again @KallynGowdy for the thorough reporting on this issue!

@tgross tgross closed this as completed May 13, 2020
@tgross tgross removed the theme/docs Documentation issues and enhancements label May 13, 2020
@tgross tgross added this to the 0.11.2 milestone May 13, 2020
@tgross
Copy link
Member

tgross commented May 15, 2020

I've just merged #7965 which will provide better error messages in the case where Nomad can't establish the initial connection. This will ship in 0.11.3.

@tgross tgross modified the milestones: 0.11.2, 0.11.3 May 15, 2020
@toopay
Copy link

toopay commented Oct 28, 2020

Spent a day to figure out this one. Used v.0.12.5 - yet no useful error occured. Once I swap the default nomad user to root, the CSI works
799f2e91-9c79-4dd2-9910-e058cc87f5f5

@tgross
Copy link
Member

tgross commented Oct 28, 2020

Spent a day to figure out this one. Used v.0.12.5 - yet no useful error occured. Once I swap the default nomad user to root, the CSI works

That's a totally different error unfortunately, which is that you're not running the Nomad client under the appropriate level of privilege to do much at all. See the installation guide: https://www.nomadproject.io/docs/install/production/nomad-agent#permissions

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants