-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core.sched: failed to GC plugin: plugin_id=<plugin> error="rpc error: Permission denied" #11162
Comments
Hi @urog, thanks for the report. Do you see any error in the plugin logs? Either the node or the controller? Thank you. |
Thanks @lgfa29 Some more logs. There are no logs on either the csi controller or nodes that correspond to the errors on the nomad servers. Server
CSI Controller
CSI Client These are the only logs that come through. The last few every 30 seconds.
|
Thanks for the logs, unfurtunately not much there, and Would you mind increasing the plugin verbosity to see if that provides more clues? From the plugin source code it seems like you can go all the way to Thanks! |
I've run both the controller and nodes with full verbosity. I don't actually see any corresponding events on the CSI controller or node logs when Nomad fails to release the volume. Here's how I've deployed the CSI controller / nodes Controller
Nodes
|
And these are the logs from the Nomad servers. Over and over:
|
Just doing some issue cleanup and saw this issue. I want to note that this error:
Should be fixed in #11891, which will ship in the upcoming Nomad 1.2.5. That's unrelated to the original problem in this issue, which is:
That error looks like evals have somehow been created with the wrong leader ACL token. |
Just tested on Nomad 1.2.5 and it appears to be working. I will test some node draining / job migrations and report back. |
@urog for plugins, not just volumes? |
I have tested:
All resulted in successful volume mounting / claiming. One thing to note; A couple of times when a job or node was killed, and Nomad was trying to place the job on another node, the following message appears in the logs:
Even though there were nodes running in the same availability zone as the volume, and the CSI agent was also running on those nodes. |
That's great! But I asked about plugin GC which was the only open topic in this issue. It looks like we don't have any more data on plugin GC here so I'm going to close this issue out so that we're not side-tracked about . There's some open issues around plugin counts and health that I'm still working through like #11758 #9810 #10073 #11784. If folks have more data to add about plugins, those issues are the best place to add it. Thanks! |
Sorry - was a bit carried away by it all working!
These errors are still present:
|
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Output from
nomad version
Nomad v1.1.4 (acd3d7889328ad1df2895eb714e2cbe3dd9c6d82)
Operating system and Environment details
Issue
I am seeing these errors in my server logs over and over:
The plugin the error refers to is the CSI plugin
pd.csi.storage.gke.io
, and I've tried versions0.7.0
through to1.2.0
- all yield the same result. Scheduling jobs with a csi volume mount work just fine. The issue is when the job is stopped/purged and the volume cannot be mounted to any other job because Nomad thinks it's still allocated to the previous job.I've been experiencing this since nomad version ~
v0.12.*
and was following this issue, hoping these would fix it. Nothing has changed.Nomad has ACLs enabled, and the anonymous policy is disabled.
The text was updated successfully, but these errors were encountered: