-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI: Volume GC Evaluation Fails on Deregistered Volumes #8100
Comments
Hey @kainoaseto we just release 0.11.3 which has some improvements to the GC loop. Can you give that a try to see if that can clean these up? |
I'm running a mix of 0.11.3 and 0.12.0 currently. The leader is 0.12.0 at the moment, and I see in its logs:
|
Wanted to give a quick status update. I've landed a handful of PRs that will be released as part of the upcoming 0.12.2 release:
I believe these fixes combined should get us into pretty good shape, and #8584 will give you an escape hatch to manually detach the volume via |
Thank you @tgross ! This is a really exciting development and we are really looking forward to testing out CSI again when 0.12.2 drops. We really appreciate the follow up on these issues and all the work you have all done to stabilize CSI, this is what keeps us coming back to Nomad time and again. |
Thank you! Just to note: the volume I see in the log message above doesn't appear in |
Testing for 0.12.2 looks good. Going to close this issue out, and 0.12.2 will be shipped shortly. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad servers and clients both running this version
Nomad v0.11.2 (807cfebe90d56f9e5beec3e72936ebe86acc8ce3)
Operating system and Environment details
Amazon Linux 2:
4.14.173-137.229.amzn2.x86_64
1 or 3 Nomad servers (have tested with both sizes of clusters)
Issue
After deregistering a volume, the CSIVolumeGC evaluation will continue to run to check that volume and will fail with "volume not found". This happens consistently on the cluster I've been using to do CSI testing on and it seems like it's being persisted in the Raft state somewhere since I've tried restart the cluster, resizing the cluster, even modifying the evaluation code to always pass on these volume failures but upon restarting with 0.11.2 code these old volumes will continue to fail to be GC'd.
This was noticed when it took down our development servers since we had quite a few volumes we deregistered and on server startup the leader will try to process all of them and run out of CPU.
Reproduction steps
Follow the guide here
Run
nomad volume deregister mysql
The Nomad server logs will periodically have the errors below with seemingly no way to stop them.
Nomad Server logs (if appropriate)
The text was updated successfully, but these errors were encountered: