You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While running the EBS CSI plugin I have noticed that nomad expects plugin tasks that complete to still report as healthy:
$ nomad plugin status aws-ebs4
ID = aws-ebs4
Provider = ebs.csi.aws.com
Version = v0.6.0-dirty
Controllers Healthy = 1
Controllers Expected = 2
Nodes Healthy = 3
Nodes Expected = 4
Allocations
ID Node ID Task Group Version Desired Status Created Modified
738adb4b 46e6db9e controller 4 run running 33m59s ago 31m16s ago
a999a840 4470dc51 controller 3 stop complete 32m4s ago 31m25s ago
9290e85e 46e6db9e nodes 0 run running 42m23s ago 42m16s ago
eed3459a ec4c06b3 nodes 0 stop complete 42m23s ago 35m8s ago
d9ecfc6b 4470dc51 nodes 0 run running 42m23s ago 42m8s ago
ad2698aa eaac2f32 nodes 0 run running 37m49s ago 37m31s ago
This seems unusual since if a CSI plugin has completed it should no longer be expected to be running and healthy. When this mismatch between healthy and expected plugin task counts occurs, all tasks that need to attach a CSI volume using the plugin in question are unable to do so. Instead of successfully mounting the volume, the following error occurs:
failed to setup alloc: pre-run hook "csi_hook" failed: rpc error: code = InvalidArgument desc = Device path not provided
The plugin returns this error when it is missing information in the PublishContext passed to a NodePublishVolume/NodeStageVolume RPC as seen here.
The PublishContext is returned by a ControllerPublishVolume RPC, however, after checking the logs of my controller plugin it turns out ControllerPublishVolume is never called.
Again, this only occurs when there is a mismatch between healthy and expected counts. Otherwise ControllerPublishVolume is called when a task requesting a CSI volume is scheduled and the volume is successfully attached.
Reproduction steps
The easiest way to create a healthy/expected value mismatch is to increase the number of controller tasks to 2 then decrement back to 1.
Run the CSI controller plugin job:
job "plugin-aws-ebs-controller" {
datacenters = ["dc1"]
group "controller" {
task "plugin" {
driver = "docker"
config {
image = "amazon/aws-ebs-csi-driver:latest"
args = [
"controller",
"--endpoint=unix://csi/csi.sock",
"--logtostderr",
"--v=5",
]
}
csi_plugin {
id = "aws-ebs0"
type = "controller"
mount_dir = "/csi"
}
resources {
cpu = 500
memory = 256
}
# ensuring the plugin has time to shut down gracefully
kill_timeout = "2m"
}
}
}
2 Run the CSI node plugin job:
job "plugin-aws-ebs-nodes" {
datacenters = ["dc1"]
# you can run node plugins as service jobs as well, but this ensures
# that all nodes in the DC have a copy.
type = "system"
group "nodes" {
task "plugin" {
driver = "docker"
config {
image = "amazon/aws-ebs-csi-driver:latest"
args = [
"node",
"--endpoint=unix://csi/csi.sock",
"--logtostderr",
"--v=5",
]
# node plugins must run as privileged jobs because they
# mount disks to the host
privileged = true
}
csi_plugin {
id = "aws-ebs0"
type = "node"
mount_dir = "/csi"
}
resources {
cpu = 500
memory = 256
}
# ensuring the plugin has time to shut down gracefully
kill_timeout = "2m"
}
}
}
Optionally run the example MySQL job to verify that volumes can be attached successfully. Be sure to use constraints to run the task using the volume in the same availability zone as your EBS volume
Increment the count for the number of controller plugin tasks to 2 and wait for the new task to become healthy. Then scale down to 1 task and wait for the other task to complete.
Run nomad plugin status. You should see mismatched healthy/expected values for the controller plugins. E.G.
Hi @tydomitrovich! Thanks for the thorough reproduction!
This seems unusual since if a CSI plugin has completed it should no longer be expected to be running and healthy. When this mismatch between healthy and expected plugin task counts occurs, all tasks that need to attach a CSI volume using the plugin in question are unable to do so.
Yeah, agreed that this is totally a bug. That'll impact updates to plugins too, I think. I don't have a good workaround for you at the moment but I'll dig in and see if I can come up with a fix shortly.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Nomad Server and Clients are both running the following build:
Nomad v0.11.1 (b43457070037800fcc8442c8ff095ff4005dab33)
Operating system and Environment details
Amazon Linux 2:
4.14.173-137.229.amzn2.x86_64
Issue
While running the EBS CSI plugin I have noticed that nomad expects plugin tasks that complete to still report as healthy:
This seems unusual since if a CSI plugin has completed it should no longer be expected to be running and healthy. When this mismatch between healthy and expected plugin task counts occurs, all tasks that need to attach a CSI volume using the plugin in question are unable to do so. Instead of successfully mounting the volume, the following error occurs:
The plugin returns this error when it is missing information in the PublishContext passed to a NodePublishVolume/NodeStageVolume RPC as seen here.
The PublishContext is returned by a ControllerPublishVolume RPC, however, after checking the logs of my controller plugin it turns out ControllerPublishVolume is never called.
Again, this only occurs when there is a mismatch between healthy and expected counts. Otherwise ControllerPublishVolume is called when a task requesting a CSI volume is scheduled and the volume is successfully attached.
Reproduction steps
The easiest way to create a healthy/expected value mismatch is to increase the number of controller tasks to 2 then decrement back to 1.
2 Run the CSI node plugin job:
Create and register an EBS volume with nomad. E.G. https://learn.hashicorp.com/nomad/stateful-workloads/csi-volumes
Optionally run the example MySQL job to verify that volumes can be attached successfully. Be sure to use constraints to run the task using the volume in the same availability zone as your EBS volume
Increment the count for the number of controller plugin tasks to 2 and wait for the new task to become healthy. Then scale down to 1 task and wait for the other task to complete.
Run
nomad plugin status
. You should see mismatched healthy/expected values for the controller plugins. E.G.Additional Notes:
I am also seeing issues where plugins with no running jobs are not being garbage collected as described here:
#7743
Not sure if this could be related but I figured it was worth mentioning.
The text was updated successfully, but these errors were encountered: