-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad is unable to create CSI plugin due to being unable to probe the CSI driver #7931
Comments
hi @KallynGowdy! Thanks so much for the thorough report; it definitely helped me eliminate a couple of obvious possibilities right away. That also means I don't have an immediate answer for you on this but I'll dig into it shortly and come back with any questions. |
Spoke too soon. 😀 I notice you're using the |
Hey @tgross! Thanks for the quick response! I've swapped the This did not appear to change the behavior of the system. The plugin still gets stuck without being able to connect. After updating the jobs, I restarted Nomad again to see if that would change anything but it seems to have had no effect. Is there some obscure Linux permission that would prevent socket files with different permissions from being able to communicate? On that line of thought, here's some output from
Seems like the file is owned by root and the permissions don't seem to allow other users to write to it. So what happens if I do this?
Looks like I'm able to establish a connection! What happens when I run as the nomad user?
And finally, when I change the file permissions of the
Just for fun, I decided to try and run the Once I did, the controller was recognized as healthy:
Once I also updated the permissions for the node socket file it showed up in the status report:
As for a permenent solution to this issue, is this something handled by Docker or Nomad? Who is in charge of creating that socket file? For now, a workaround is to simply run Nomad as root but it would be nice to see this fixed. |
After some more research, it appears the Because of how Docker handles permissions with shared volumes, files on the host side gets the same permissions as docker while files on the container side get root permissions. I validated this by running the aws-ebs-csi-driver docker container and giving it a custom mount point:
When we check the host folder we get:
It looks like this has been a known issue in Docker for a while. They say the solution to the issue is to use local volumes and specify the correct permisions as part of the driver options. However, this won't allow both the host and the container to access the volume. According to this comment the only way to fix the permissions issue is to take control of any container-created files manually. (like There's a related issue about being able to mount shared volumes with specific permissions inside the container, but some of the discussion relates to host permissions as well. I'm not sure why Unix sockets are preferred as opposed to normal networking. Indeed, it is possible to setup the aws-ebs-csi-driver to use IP routing. (e.g. It states:
However, the use of "MAY" in the statement seems to conflict with the RFC used for technical terminology. Seems like a bug/typo. In any case, I think a fix would be to ensure that the correct permissions are applied to the What do you think? Extra BackgroundNomad handles communication with the CSI driver by creating a Docker shared volume with the CSI driver container. |
Ah, this is what I was missing from the background. Generally speaking we recommend that Nomad clients are run as root. Servers can and should be run as unprivileged users. (I'm re-reading that bit of documentation and it could be more clear.) The Nomad client performs a lot of different operations that can only be run by root above and beyond talking to the Docker socket: configuring network namespaces, mount points, etc. If you never want to use the
I can't speak for the CSI spec authors directly. But most likely because it creates an added complication when dealing with network namespaces between the orchestrator and the plugin. If the plugin uses the host namespace, then it's open to any other process running on the host. Whereas if it's a Unix socket the added difficulty you're experiencing here with file permissions is a security measure -- mounting is a privileged operation (requiring
Hm, I feel like this could be a configuration footgun. If the Nomad user chowns the file, it won't necessarily be accessible by the plugin unless the user has been careful to map the user namespaces. |
Yes, which is why I configured the
I understand, and really what I want out of this is an easier way to debug the issue. I only care a little that I have to run the Nomad client as the
Indicating that the probe method was unavailable but no mention of why. Because I was unfamiliar with the system, I had to dig around to discover that it was just a permissions issue. No worries though. It was a good learning experience. |
Totally understood on that! This has been a bit of a thorn in my side because of #7424. But this is a good example of a case where Nomad could provide a better error message even if the plugin can't help us out. |
Ok so I think we've come to a resolution of the underlying problem. I'm going to add this error message case to #7424 (comment) so as to improve the experience going forward. Thanks again @KallynGowdy for the thorough reporting on this issue! |
I've just merged #7965 which will provide better error messages in the case where Nomad can't establish the initial connection. This will ship in 0.11.3. |
That's a totally different error unfortunately, which is that you're not running the Nomad client under the appropriate level of privilege to do much at all. See the installation guide: https://www.nomadproject.io/docs/install/production/nomad-agent#permissions |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.11.1 (b434570)
Operating system and Environment details
Issue
Nomad is stuck probing an "unavailable" CSI plugin via gRPC.
For some background, I've been following a mixture of the Nomad stateful workloads tutorial and the Nomad Terraform AWS example project.
I've put a longer version down below, but the gist is that while the
csi.sock
files are being created both on the Nomad side and on the container (CSI driver) side, they seem unable to communicate. This causes Nomad to repeatedly try to probe the plugin but to no avail.I've checked the normal culprits, namely that both Nomad and Docker (dockerd & containerd) have the correct permissions and that I'm using Nomad jobs from the tutorial. I've also tried restarting the Nomad service to see if it was an initialization issue.
Reproduction steps
Here's my setup so far:
Ubuntu 20.04 LTS
Docker version 19.03.8, build afacb8b7f0
Consul v1.7.3
Nomad v0.11.1 (b43457070037800fcc8442c8ff095ff4005dab33)
$ cd modules/casualos-micro
$ terraform apply
$ nomad job run -address $NOMAD_ADDRESS -token $NOMAD_TOKEN ./out/aws-ebs-controller.hcl
$ nomad job run -address $NOMAD_ADDRESS -token $NOMAD_TOKEN ./out/aws-ebs-nodes.hcl
journalctl
) also did not appear to contain anything useful.Job files
aws-ebs-controller.hcl
aws-ebs-nodes.hcl
Nomad Client logs
Here's the logs from the Nomad jobs:
plugin-aws-ebs-controller
Logs from the
plugin-aws-ebs-controller
job.They're from stderr since that is where the the AWS EBS driver puts them. There are no logs in stdout.
plugin-aws-ebs-nodes
Logs from the
plugin-aws-ebs-nodes
job.They're from stderr since that is where the the AWS EBS driver puts them. There are no logs in stdout.
Additionally, I think it is useful to include the Nomad logs from
journalctl
:nomad.service
Other research
Note that gRPC calls are being made for the
Probe
method and they're all returning with anUnavailable
code.A quick search online gives a list of gRPC status codes with this note about
Unavailable
:I'm assuming this means that Nomad is unable to establish a connection over the
/csi/csi.sock
socket, but I don't know how to test or validate such a scenario.I also searched around for where the logs are originating from in the Nomad source.
It seems that the
Probe
gRPC method is being invoked from thePluginProbe()
method. This function is in turn called fromsupervisorLoopOnce()
, which I believe is used byensureSupervisorLoop()
to help setup the plugin.The log could also come from a call stack that includes the
registerPlugin()
function since bothregisterPlugin()
andsupervisorLoopOnce()
specifycsi_client
as the logger name.Finally, it seems like Nomad calculates the socket address by joining a mount point with the CSI Socket Name (
csi.sock
).The mount point is calculated here, and is the result of joining the CSI "root directory" with the plugin type (
controller
/node
) and id (aws-ebs0
).The root directory was in turn calculated here and was calculated by joining the client "State Directory" with the literal string
"csi"
. As we can see from above, the "state directory" is/opt/nomad/data/client
and when I checked there was acsi
folder containingcsi.sock
"files" for both thecontroller
andnode
subdirectories (withaws-ebs0
).When I try sending some dummy data to the socket I get the following output:
I've also validated that the
csi.sock
file is being created inside the docker container.From here, I'm not sure what the issue is. Seems like the socket files are being created but they are unable to communicate with each other for some reason.
The text was updated successfully, but these errors were encountered: