Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csi: Task unable to run with csi_hook error "Device path not provided" #10432

Closed
jrlonan-gt opened this issue Apr 23, 2021 · 16 comments
Closed

Comments

@jrlonan-gt
Copy link

jrlonan-gt commented Apr 23, 2021

Nomad version

Nomad v1.0.2 (4c1d4fc)

Operating system and Environment details

  • RedHat Linux 4.18.0-240.10.1.el8_3.x86_64
  • amazon/aws-ebs-csi-driver v0.9.0

Issue

Stateful job fails with error " failed to setup alloc: pre-run hook "csi_hook" failed: rpc error: code = InvalidArgument desc = Device path not provided". Nomad cluster is running in AWS using EBS volumes and CSI plugin v0.9.0

Reproduction steps

Re-run jobspec with the same jobspec several times.

Expected Result

Task running

Actual Result

Task failed to run with the following error
Time Type Description
2021-04-23T10:34:52+08:00 Setup Failure failed to setup alloc: pre-run hook "csi_hook" failed: rpc error: code = InvalidArgument desc = Device path not provided

Job file (if appropriate)

task "prometheus" {
      driver = "docker"

      resources {
        cpu    = 250
        memory = 512
      }

      config {
        image      = "prom/prometheus:v2.25.0"
        force_pull = true
        ports = ["prometheus_ui"]
        network_mode = "host"


        mount {
          type = "volume"
          target = "/prometheus"
          source = "prometheus_2"
          readonly = false
        }

Nomad Server logs (if appropriate)

N/A

Nomad Client logs (if appropriate)

I0423 02:34:52.458147 1 node.go:234] NodeUnstageVolume: /csi/staging/prometheus_2/rw-file-system-single-node-writer target not mounted

I0423 02:34:52.444934 1 node.go:100] NodeStageVolume: called with args {VolumeId:vol-0460d38250e2ce4e8 PublishContext:map[] StagingTargetPath:/csi/staging/prometheus_2/rw-file-system-single-node-writer VolumeCapability:mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} 

E0423 02:34:52.445112 1 driver.go:115] GRPC error: rpc error: code = InvalidArgument desc = Device path not provided 

I0423 02:34:52.457871 1 node.go:210] NodeUnstageVolume: called with args {VolumeId:vol-0460d38250e2ce4e8 StagingTargetPath:/csi/staging/prometheus_2/rw-file-system-single-node-writer XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

Output: umount: /csi/per-alloc/551e73ca-adc7-2423-2515-5bba10d2065a/prometheus_2/rw-file-system-single-node-writer: no mount point specified.

Unmounting arguments: /csi/per-alloc/551e73ca-adc7-2423-2515-5bba10d2065a/prometheus_2/rw-file-system-single-node-writer

E0423 02:34:52.457340 1 driver.go:115] GRPC error: rpc error: code = Internal desc = Could not unmount "/csi/per-alloc/551e73ca-adc7-2423-2515-5bba10d2065a/prometheus_2/rw-file-system-single-node-writer": unmount failed: exit status 32

I0423 02:34:52.449659 1 mount_linux.go:202] Unmounting /csi/per-alloc/551e73ca-adc7-2423-2515-5bba10d2065a/prometheus_2/rw-file-system-single-node-writer

I0423 02:34:52.449651 1 node.go:378] NodeUnpublishVolume: unmounting /csi/per-alloc/551e73ca-adc7-2423-2515-5bba10d2065a/prometheus_2/rw-file-system-single-node-write

I0423 02:34:52.449608 1 node.go:367] NodeUnpublishVolume: called with args {VolumeId:vol-0460d38250e2ce4e8 TargetPath:/csi/per-alloc/551e73ca-adc7-2423-2515-5bba10d2065a/prometheus_2/rw-file-system-single-node-writer XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} 
@jrlonan-gt jrlonan-gt changed the title csi: csi: Task unable to run with csi_hook error "Device path not provided" Apr 23, 2021
@tgross
Copy link
Member

tgross commented Apr 23, 2021

Hi @jrlonan-gt! The relevant error seems to be this bit:

I0423 02:34:52.457871 1 node.go:210] NodeUnstageVolume: called with args {VolumeId:vol-0460d38250e2ce4e8 StagingTargetPath:/csi/staging/prometheus_2/rw-file-system-single-node-writer XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
Output: umount: /csi/per-alloc/551e73ca-adc7-2423-2515-5bba10d2065a/prometheus_2/rw-file-system-single-node-writer: no mount point specified.
Unmounting arguments: /csi/per-alloc/551e73ca-adc7-2423-2515-5bba10d2065a/prometheus_2/rw-file-system-single-node-writer

Can you provide the volume specification you're using here?

@jrlonan-gt
Copy link
Author

@tgross

Can you provide the volume specification you're using here?

AWS EBS gp2 volume
500 GB
Filesystem: ext4
Access mode: single-node-writer
Attachment Mode: file-system

@jrlonan-gt
Copy link
Author

jrlonan-gt commented Apr 26, 2021

FYI, i applied changes to the config and updated the Nomad jobspec for my other job that has identical config (it was working fine), and encountered the same issue as above.

Were there any changes to the plugin within the last 2 weeks or so?

@durango
Copy link
Contributor

durango commented Apr 26, 2021

I have had this same problem on 1.0.2. Tried upgrading to 1.0.4 but still the same error. From the plugin https://github.com/kubernetes-sigs/aws-ebs-csi-driver I've tried versions: v0.9.1 to latest.

Following the mysql examples nomad https://learn.hashicorp.com/tutorials/nomad/stateful-workloads-csi-volumes?in=nomad/stateful-workloads

with no luck still

This link shows where we're hitting the error, I have not debugged the CSI plugin yet

https://discuss.hashicorp.com/t/ebs-csi-driver-csi-hook-failed-device-path-not-provided/22845/6

@jrlonan-gt
Copy link
Author

Checking the logs, my ebs-controller and ebs-nodes allocations were in a bad state, as at one point the containers were restarted but there was no Plugin became healthy plugin: aws-ebs0 message in the Event logs. Restarting the plugin controller and node jobs fixes the issue for 1 of the jobs

I keep seeing these error logs I0427 02:07:32.822804 1 controller.go:334] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} before and after restarting the controller

@jessequinn
Copy link

jessequinn commented Apr 27, 2021

@jrlonan-gt same thing. I can restart the controller and nodes and possibly get ONE of my jobs to work when mounting a EBS volume. However, this doesnt always work. I have the same error logs that you mention for all my nodes.

csi plugin 0.10.1
nomad 1.0.4

What changed recently that caused this?

controller messages

I0427 04:46:56.632120       1 controller.go:360] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:26.632782       1 controller.go:360] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:56.634103       1 controller.go:360] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:48:26.634813       1 controller.go:360] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

node messages

I0427 04:46:04.780411       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:46:34.781220       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:04.781932       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:34.782659       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:48:04.784919       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:46:34.681205       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:04.681874       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:34.683278       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:50.678816       1 node.go:105] NodeStageVolume: called with args {VolumeId:vol-0998a88f677e69072 PublishContext:map[] StagingTargetPath:/csi/staging/hml_redis/rw-file-system-single-node-writer VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:50.678878       1 node.go:157] NodeStageVolume: volume="vol-0998a88f677e69072" operation finished
I0427 04:47:50.678888       1 inflight.go:69] Node Service: volume="vol-0998a88f677e69072" operation finished
E0427 04:47:50.678893       1 driver.go:119] GRPC error: rpc error: code = InvalidArgument desc = Device path not provided
I0427 04:47:50.681986       1 node.go:407] NodeUnpublishVolume: called with args {VolumeId:vol-0998a88f677e69072 TargetPath:/csi/per-alloc/5b6cf0f1-0250-8f1e-9388-0adde2d702e0/hml_redis/rw-file-system-single-node-writer XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:50.682023       1 node.go:425] NodeUnpublishVolume: unmounting /csi/per-alloc/5b6cf0f1-0250-8f1e-9388-0adde2d702e0/hml_redis/rw-file-system-single-node-writer
I0427 04:47:50.683521       1 node.go:421] NodeUnPublishVolume: volume="vol-0998a88f677e69072" operation finished
I0427 04:47:50.683538       1 inflight.go:69] Node Service: volume="vol-0998a88f677e69072" operation finished
E0427 04:47:50.683543       1 driver.go:119] GRPC error: rpc error: code = Internal desc = Could not unmount "/csi/per-alloc/5b6cf0f1-0250-8f1e-9388-0adde2d702e0/hml_redis/rw-file-system-single-node-writer": unmount failed: exit status 32
Unmounting arguments: /csi/per-alloc/5b6cf0f1-0250-8f1e-9388-0adde2d702e0/hml_redis/rw-file-system-single-node-writer
Output: umount: /csi/per-alloc/5b6cf0f1-0250-8f1e-9388-0adde2d702e0/hml_redis/rw-file-system-single-node-writer: no mount point specified.
I0427 04:47:50.683868       1 node.go:236] NodeUnstageVolume: called with args {VolumeId:vol-0998a88f677e69072 StagingTargetPath:/csi/staging/hml_redis/rw-file-system-single-node-writer XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:50.684090       1 node.go:268] [Debug] NodeUnstageVolume: /csi/staging/hml_redis/rw-file-system-single-node-writer target not mounted
I0427 04:47:50.684097       1 node.go:251] NodeUnStageVolume: volume="vol-0998a88f677e69072" operation finished
I0427 04:47:50.684103       1 inflight.go:69] Node Service: volume="vol-0998a88f677e69072" operation finished
I0427 04:48:04.684056       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:48:34.685577       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:49:04.686269       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:49:34.687335       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

and example of a volume hcl

# volume registration
type = "csi"
id = "hml_redis"
name = "hml_redis"
external_id = "vol-0998a88f677e69072"
access_mode = "single-node-writer"
attachment_mode = "file-system"
plugin_id = "aws-ebs0"

@jrlonan-gt
Copy link
Author

jrlonan-gt commented Apr 27, 2021

@jessequinn For the other job that still failed after I restarted controller and all the node plugins, it returned a different error message (which i didn't capture, unfortunately, but implies the volume is not attached to the correct EC2 instance where the job runs), so I had to manually detach the volume from the Nomad client via AWS console and re-create the Nomad job, and it successfully re-attach the EBS volume to the allocation now.

Still would like to know the root cause, and hoping for a more stable release 🙏

@jessequinn
Copy link

@jrlonan-gt are you suggesting that the issue comes from the fact that the EBS volumes have not properly been removed from a previous attachment? That could make sense. let me get our cloud team to confirm that the EBSs arent still attached in some manner. However, it is weird that would be the case considering the volumes in nomad are "schedulable".

@jessequinn
Copy link

@jrlonan-gt btw, i cannot get any EBS volume to work now, even with restarting nodes and controllers, WHEN i have more than 1 job using a volume (not the same volumes).

@jessequinn
Copy link

FYI, i applied changes to the config and updated the Nomad jobspec for my other job that has identical config (it was working fine), and encountered the same issue as above.

Were there any changes to the plugin within the last 2 weeks or so?

like @durango i tested 0.9.0 to 0.10.1. all the same issues. we tested against nomad 1.0.0 and 1.0.4 same thing.

@jessequinn
Copy link

jessequinn commented Apr 27, 2021

in our situation we rebooted the machine and nomad/csi_plugin lost those connections; however, the EBSs were still attached. Detaching them as suggested by @jrlonan-gt worked.

@durango
Copy link
Contributor

durango commented Apr 27, 2021

I have rotated machines through AWS' ASG as well as manually attaching and detaching the EBS drive from each server and client throughout my cluster. Still same issue. This is also a new volume registered under a unique/different ID and external ID. I have had this same issue since 0.11.0.

@jessequinn
Copy link

I had to restart controllers and nodes, remove volumes and jobs from nomad, and do nomad system gc. In addition, detaching the EBSs.Once completed it all worked out.

@durango
Copy link
Contributor

durango commented Apr 27, 2021

I had to do the following to get this to work today:

  • Detach and deregister all drives from CSI
  • Remove all jobs + tasks related to CSI with -purge
  • nomad system gc
  • [Important] register the controller + nodes with a new, ever used before:
    • plugin ID
    • csi directory (and create this new directory within each client)
    • add user = "root" to the task stanza
    • add AWS_REGION=us-east-1 to the task's env stanza.. I was getting a "metadata not found" error within the logs
  • setup a new volume with a name, ID, and external ID that has never been introduced into the cluster before

If I used the previous plugin ID or CSI directory, I would run into the same issue (despite detaching/deregistering/purging jobs/etc). Hope this helps others down the road

@tgross tgross self-assigned this May 19, 2021
@tgross tgross removed their assignment May 19, 2021
@tgross tgross added the stage/needs-verification Issue needs verifying it still exists label Feb 3, 2022
@tgross
Copy link
Member

tgross commented Feb 3, 2022

Following up on this unfortunately-neglected issue with a few notes as I start on related work:

I'm going to close this one out but if we have similar issue come back up post-1.2.5 please feel free to comment or reopen with more info.

@tgross tgross closed this as completed Feb 3, 2022
@tgross tgross removed the stage/needs-verification Issue needs verifying it still exists label Feb 3, 2022
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

5 participants