Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad NFS CSI Integration doesn't work with exec / containerd-driver #19165

Closed
116davinder opened this issue Nov 23, 2023 · 6 comments
Closed

Comments

@116davinder
Copy link

Nomad version

Output from nomad version
Nomad server: 1.6.3
Nomad Client: 1.6.3

Operating system and Environment details

Ubuntu 20.04

Issue

NFS CSI Volume Mount Fails

failed to setup alloc: pre-run hook "csi_hook" failed: mounting volumes: rpc error: code = Unknown desc = Exception calling application: [Errno 2] No such file or directory: '/local/csi/per-alloc/d992204d-637c-2cde-2ba7-4633fd7688e1/kerberos-backup/rw-file-system-multi-node-multi-writer'

What i haven't understood so far is that why Nomad is asking for this '/local/csi/per-alloc/d992204d-637c-2cde-2ba7-4633fd7688e1/kerberos-backup/rw-file-system-multi-node-multi-writer' instead of what i have mentioned in the job spec /mnt/backups

Any pointers will be much appreciated.

Reproduction steps

  • Step-1 NFS Controller / Node Job --- Working
variable "nfs_server_address" {
  type = string
  default = "random-nfs-server-address"
}

variable "nfs_server_path" {
  type = string
  default = "/backup/nomad-dev-dynamic-volumes"
  description = "this path should exist in the nfs"
}

variable "controller_count" {
  type = number
  default = 1
}

job "csi-nfs-controller" {

# remove the constraint
constraint {
    attribute = "${attr.unique.hostname}"
    value     = "dev-kdc01"
}

  group "nfs" {
    count = var.controller_count

    task "controller" {

      driver = "containerd-driver"

      csi_plugin {
        id   = "rocketduck-nfs"
        type = "monolith"
        mount_dir              = "/csi"
        health_timeout         = "30s"
        stage_publish_base_dir = "/local/csi"
      }

      config {
        image = "registry.gitlab.com/rocketduck/csi-plugin-nfs:0.7.0"
        args = [
          "--type=monolith",
          "--endpoint=${CSI_ENDPOINT}", # provided by csi_plugin{}
          "--node-id=${attr.unique.hostname}",
          "--nfs-server=${var.nfs_server_address}:${var.nfs_server_path}",
          "--mount-options=rw,mountproto=tcp,nfsvers=3,rsize=1048576,wsize=1048576,namlen=255,soft,retrans=5,relatime,nolock",
          "--allow-nested-volumes",
          "--log-level=DEBUG",
        ]
        privileged = true
        host_network = true
        cap_add = [
          "CAP_SYS_ADMIN",
          "CAP_CHOWN",
          "CAP_SYS_CHROOT"
        ]
      }
    }
  }
}
image
  • Step-2 Create Volume HCL ---- working
id = "kerberos-backup"
namespace = "default"
name = "kerberos-backup"
type = "csi"
plugin_id = "rocketduck-nfs"

capability {
  access_mode     = "multi-node-multi-writer"
  attachment_mode = "file-system"
}

parameters {
    mode = "777"
}

mount_options {
  fs_type     = "ext4"
}
image
$mount -l | grep tmpiumip_43
random-nfs-server:/backup on /tmp/tmpiumip_43 type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.79.252.236,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=10.79.252.236)
$ls -lh /tmp/tmpiumip_43/nomad-dev-dynamic-volumes/
total 2.5K
drwxr-xr-x 2 root root 0 Nov 23 22:11 kerberos-backup
  • Step-3 Attach the volume using exec/containerd driver based job ---- not working
job "kerberos" {

  datacenters = ["*"]
  type        = "service"


  group "primary" {

    constraint {
      attribute = "${attr.unique.hostname}"
      value     = "dev-kdc01"
    }

    volume "backup" {
      type            = "csi"
      source          = "kerberos-backup"
      read_only       = false
      attachment_mode = "file-system"
      access_mode     = "multi-node-multi-writer"
      per_alloc       = false
    }

    task "test" {

      driver = "exec"

      config {
        command = "sleep"
        args  = ["infinity"]
        cap_add = ["sys_chroot"]
      }

      resources {
        cpu    = 10
        memory = 32
      }

      volume_mount {
        volume      = "backup"
        destination = "/mnt/backups"
        propagation_mode = "bidirectional"
      }
    }
  }
}

Expected Result

Volume mount inside the exec driver chroot or job folder.

Actual Result

failed to setup alloc: pre-run hook "csi_hook" failed: mounting volumes: rpc error: code = Unknown desc = Exception calling application: [Errno 2] No such file or directory: '/local/csi/per-alloc/d992204d-637c-2cde-2ba7-4633fd7688e1/kerberos-backup/rw-file-system-multi-node-multi-writer'

Nomad Server logs (if appropriate)

N/A

Nomad Client logs (if appropriate)

{"@level":"trace","@message":"running pre-run hook","@module":"client.alloc_runner","@timestamp":"2023-11-23T22:34:11.986144Z","alloc_id":"d992204d-637c-2cde-2ba7-4633fd7688e1","name":"csi_hook","start":"2023-11-23T22:34:11.986142679Z"}
{"@level":"debug","@message":"found CSI plugin","@module":"client.alloc_runner.runner_hook.csi_hook","@timestamp":"2023-11-23T22:34:11.992235Z","alloc_id":"d992204d-637c-2cde-2ba7-4633fd7688e1","name":"rocketduck-nfs","type":"csi-node"}
{"@level":"info","@message":"finished client unary call","@module":"client.csi_manager.rocketduck-nfs","@timestamp":"2023-11-23T22:34:11.994634Z","duration":1867970,"grpc.code":2,"grpc.method":"NodePublishVolume","grpc.service":"csi.v1.Node"}
{"@level":"trace","@message":"finished pre-run hooks","@module":"client.alloc_runner","@timestamp":"2023-11-23T22:34:11.994945Z","alloc_id":"d992204d-637c-2cde-2ba7-4633fd7688e1","duration":15129701,"end":"2023-11-23T22:34:11.994944943Z"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2023-11-23T22:34:11.995221Z","alloc_id":"d992204d-637c-2cde-2ba7-4633fd7688e1","error":"pre-run hook \"csi_hook\" failed: mounting volumes: rpc error: code = Unknown desc = Exception calling application: [Errno 2] No such file or directory: '/local/csi/per-alloc/d992204d-637c-2cde-2ba7-4633fd7688e1/kerberos-backup/rw-file-system-multi-node-multi-writer'"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2023-11-23T22:34:11.995494Z","alloc_id":"d992204d-637c-2cde-2ba7-4633fd7688e1","failed":true,"msg":"failed to setup alloc: pre-run hook \"csi_hook\" failed: mounting volumes: rpc error: code = Unknown desc = Exception calling application: [Errno 2] No such file or directory: '/local/csi/per-alloc/d992204d-637c-2cde-2ba7-4633fd7688e1/kerberos-backup/rw-file-system-multi-node-multi-writer'","task":"40-setup-kerberos-db","type":"Setup Failure"}
{"@level":"trace","@message":"next heartbeat","@module":"client","@timestamp":"2023-11-23T22:34:11.996121Z","period":10791125118}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2023-11-23T22:34:11.997186Z","alloc_id":"d992204d-637c-2cde-2ba7-4633fd7688e1","failed":true,"msg":"failed to setup alloc: pre-run hook \"csi_hook\" failed: mounting volumes: rpc error: code = Unknown desc = Exception calling application: [Errno 2] No such file or directory: '/local/csi/per-alloc/d992204d-637c-2cde-2ba7-4633fd7688e1/kerberos-backup/rw-file-system-multi-node-multi-writer'","task":"kdc","type":"Setup Failure"}
{"@level":"trace","@message":"handling task state update","@module":"client.alloc_runner","@timestamp":"2023-11-23T22:34:11.997742Z","alloc_id":"d992204d-637c-2cde-2ba7-4633fd7688e1","done":false}

NFS Controller/Node Logs (if appropriate)

2023-11-23 22:56:11,062:DEBUG:csi:Executing method '/csi.v1.Node/NodeGetCapabilities', with request:

2023-11-23 22:56:11,062:DEBUG:csi:Finished execution of method '/csi.v1.Node/NodeGetCapabilities', with response:

2023-11-23 22:56:20,694:DEBUG:csi:Executing method '/csi.v1.Node/NodePublishVolume', with request:
  volume_id: "kerberos-backup"
  target_path: "/local/csi/per-alloc/c675ac86-215a-7487-1522-a1416112f087/kerberos-backup/rw-file-system-multi-node-multi-writer"
  volume_capability {
    mount {
      fs_type: "ext4"
    }
    access_mode {
      mode: MULTI_NODE_MULTI_WRITER
    }
  }
  volume_context {
    key: "mode"
    value: "777"
  }

2023-11-23 22:56:20,694:INFO:node:Received mount request for 'kerberos-backup' at '/local/csi/per-alloc/c675ac86-215a-7487-1522-a1416112f087/kerberos-backup/rw-file-system-multi-node-multi-writer'
2023-11-23 22:56:20,696:DEBUG:csi:Finished execution of method '/csi.v1.Node/NodePublishVolume'
2023-11-23 22:56:20,696:ERROR:grpc._server:Exception calling application: [Errno 2] No such file or directory: '/local/csi/per-alloc/c675ac86-215a-7487-1522-a1416112f087/kerberos-backup/rw-file-system-multi-node-multi-writer'
Traceback (most recent call last):
  File "/opt/python/lib/python3.11/site-packages/grpc/_server.py", line 494, in _call_behavior
    response_or_iterator = behavior(argument, context)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/python/lib/python3.11/site-packages/grpc_interceptor/server.py", line 63, in invoke_intercept_method
    return self.intercept(
           ^^^^^^^^^^^^^^^
  File "/opt/python/lib/python3.11/site-packages/csi_plugin_nfs/interceptor.py", line 21, in intercept
    response = method(request, context)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/python/lib/python3.11/site-packages/csi_plugin_nfs/validators.py", line 45, in inner
    return func(self, request, context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/python/lib/python3.11/site-packages/csi_plugin_nfs/node.py", line 32, in NodePublishVolume
    os.mkdir(request.target_path)
FileNotFoundError: [Errno 2] No such file or directory: '/local/csi/per-alloc/c675ac86-215a-7487-1522-a1416112f087/kerberos-backup/rw-file-system-multi-node-multi-writer'
2023-11-23 22:56:20,704:DEBUG:csi:Executing method '/csi.v1.Node/NodeUnpublishVolume', with request:
  volume_id: "kerberos-backup"
  target_path: "/local/csi/per-alloc/c675ac86-215a-7487-1522-a1416112f087/kerberos-backup/rw-file-system-multi-node-multi-writer"

2023-11-23 22:56:20,705:INFO:node:Received unmount request for 'kerberos-backup' at '/local/csi/per-alloc/c675ac86-215a-7487-1522-a1416112f087/kerberos-backup/rw-file-system-multi-node-multi-writer'
2023-11-23 22:56:20,705:WARNING:node:Target path '/local/csi/per-alloc/c675ac86-215a-7487-1522-a1416112f087/kerberos-backup/rw-file-system-multi-node-multi-writer' does not exist for 'kerberos-backup'
2023-11-23 22:56:20,705:DEBUG:csi:Finished execution of method '/csi.v1.Node/NodeUnpublishVolume', with response:

Other Information / References

  1. https://github.com/hashicorp/nomad/tree/main/demo/csi/nfs
  2. https://gitlab.com/rocketduck/csi-plugin-nfs
@116davinder
Copy link
Author

I have checked the code for rocketduck/csi-plugin-nfs and it is dead simple where it expects a system path for mounting a dir but it gets arget_path: "/local/csi/per-alloc/c675ac86-215a-7487-1522-a1416112f087/kerberos-backup/rw-file-system-multi-node-multi-writer" and since it doesn't create folder recursively, it fails with FileNotFoundError as expected from python standpoint

os.mkdir(path, mode=0o777, *, dir_fd=None)
Create a directory named path with numeric mode mode.
If the directory already exists, FileExistsError is raised. If a parent directory in the path does not exist, FileNotFoundError is raised.

@116davinder
Copy link
Author

I manage resolve this error by setting stage_publish_base_dir = "/tmp/csi". For some reason containerd driver doesn't allow creating folders at /local/csi and that's why csi-plugin fails.

Flow of mount process

  1. create volume ( by csi controller only )
  2. when job is started, csi-node mounts the nfs path inside its container at stage_publish_base_dir
  3. csi-node does mount based on driver used by job
    Example docker driver
        "Mounts": [
            {
                "Type": "bind",
                "Source": "/opt/nomad/data/client/csi/monolith/rocketduck-nfs/per-alloc/77c4d1d8-0fb7-83e0-f35a-5c2cf35c35d7/kerberos-backup/rw-file-system-multi-node-multi-writer",
                "Destination": "/alloc/backups",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            }
        ]

Example exec driver: I don't know yet, how it binds the nfs path from container because, I can't make nfs mount working yet.

@116davinder
Copy link
Author

As of now, when I am running csi-plugin with containerd-driver it doesn't expose nfs mount to the system but it does mount the bfs path within the container where as docker driver is able to mount the nfs inside and outside the container.

@lgfa29
Copy link
Contributor

lgfa29 commented Nov 24, 2023

Hi @116davinder 👋

Thanks for the report and the detailed info. Just so I understand the status here, is this a fair summary of things?

  1. You were initially unable to run the rocketduck/csi-plugin-nfs CSI plugin as a containerd task. Changing the value for csi_plugin.stage_publish_base_dir to a path outside the local directory fixed the problem.
  2. A task running the exec driver is not able to mount a CSI volume.

For some reason containerd driver doesn't allow creating folders at /local/csi and that's why csi-plugin fails.

Is https://github.com/Roblox/nomad-driver-containerd the plugin you're using? If so, that's a community plugin that I don't know enough to provide any guidance. Perhaps you could open an issue in that repo? Another thing to try would be to use the NOMAD_TASK_DIR environment variable instead of hardcoding /local.

@116davinder
Copy link
Author

Since i am using containerd and exec in my stack a lot, I am blocked because of missing feature / Issues

Early Notes for Docker Driver with CSI NFS Plugin

  1. I can run csi plugins fine
  2. I can see nfs mounts are being exposed to other driver like exec/docker.
  3. Missing Piece, which I saw permissions for CSI volume mounted to exec driver don't allow task's user #15540 but ignored since I was focused on making containerd working but now I will try one more time docker driver.

Last, I do agree that containerd related issue should be moved to Robox/Containerd Repo.

@116davinder
Copy link
Author

I am closing this issue, since docker driver is only supported and working with CSI NFS Plugins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants