CSI: Volume GC Evaluation Fails on Deregistered Volumes #8100

kainoaseto · 2020-06-02T22:58:31Z

Nomad version

Nomad servers and clients both running this version
Nomad v0.11.2 (807cfebe90d56f9e5beec3e72936ebe86acc8ce3)

Operating system and Environment details

Amazon Linux 2:
4.14.173-137.229.amzn2.x86_64

1 or 3 Nomad servers (have tested with both sizes of clusters)

Issue

After deregistering a volume, the CSIVolumeGC evaluation will continue to run to check that volume and will fail with "volume not found". This happens consistently on the cluster I've been using to do CSI testing on and it seems like it's being persisted in the Raft state somewhere since I've tried restart the cluster, resizing the cluster, even modifying the evaluation code to always pass on these volume failures but upon restarting with 0.11.2 code these old volumes will continue to fail to be GC'd.

This was noticed when it took down our development servers since we had quite a few volumes we deregistered and on server startup the leader will try to process all of them and run out of CPU.

Reproduction steps

Follow the guide here
Run nomad volume deregister mysql
The Nomad server logs will periodically have the errors below with seemingly no way to stop them.

Nomad Server logs (if appropriate)

nomad.fsm: CSIVolumeClaim failed: error="volume not found: mysql"
worker: error invoking scheduler: error="failed to process evaluation: volume not found: mysql"

The text was updated successfully, but these errors were encountered:

tgross · 2020-06-08T19:00:37Z

Hey @kainoaseto we just release 0.11.3 which has some improvements to the GC loop. Can you give that a try to see if that can clean these up?

tsarna · 2020-07-18T20:28:34Z

I'm running a mix of 0.11.3 and 0.12.0 currently. The leader is 0.12.0 at the moment, and I see in its logs:

2020-07-18T20:26:52.021Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval "799729e4-90b9-9646-4fe2-6b2a39e26508" JobID: "csi-volume-claim-gc:data-test" Namespace: "default">"
2020-07-18T20:26:52.380Z [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume not found: data-test"

tgross · 2020-08-07T19:50:26Z

Wanted to give a quick status update. I've landed a handful of PRs that will be released as part of the upcoming 0.12.2 release:

csi: retry controller client RPCs on next controller #8561 retries controller RPCs so that we take advantage of controllers deployed in a HA configuration.
csi: add unpublish RPC #8572 csi: update volumewatcher to use unpublish RPC #8579 are some improved plumbing that makes volume claim reaping synchronous in common cases, which reduces the number of places where things can go wrong (and makes it easier to reason about).
csi: release claims via csi_hook postrun unpublish RPC #8580 uses that plumbing to drive the volume claim unpublish step from the client, so that in most cases (except when we lose touch with the client) we're running the volume unpublish synchronously as part of allocation shutdown.
CSI unpublish error handling improvements #8605 improves our error handling so that the checkpointing we do will work correctly by ignoring "you already did that" errors.
CSI: fix missing ACL tokens for leader-driven RPCs #8607 fixes some missing ACLs and Region flags in the Nomad leader

I believe these fixes combined should get us into pretty good shape, and #8584 will give you an escape hatch to manually detach the volume via nomad volume detach once that's merged.

kainoaseto · 2020-08-07T20:36:10Z

Thank you @tgross ! This is a really exciting development and we are really looking forward to testing out CSI again when 0.12.2 drops. We really appreciate the follow up on these issues and all the work you have all done to stabilize CSI, this is what keeps us coming back to Nomad time and again.

tsarna · 2020-08-07T21:39:39Z

Thank you!

Just to note: the volume I see in the log message above doesn't appear in nomad volume status, so I don't know if nomad volume detach would help me, but hopefully one of the other changes will fix it (#8605 perhaps?)

tgross · 2020-08-10T17:14:17Z

I've closed #8285, #8145, and #8057 as duplicates of this issue; I'll continue to collect status updates here as we wrap up testing for 0.12.2.

tgross · 2020-08-11T18:35:33Z

Testing for 0.12.2 looks good. Going to close this issue out, and 0.12.2 will be shipped shortly.

github-actions · 2022-11-03T02:33:53Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

analytically mentioned this issue Jun 8, 2020

After removing aws-ebs0 from the cluster deleting nodes fails #8121

Closed

tgross added theme/storage type/bug labels Jun 8, 2020

tgross added the stage/waiting-reply label Jul 9, 2020

stale bot removed the stage/waiting-reply label Jul 18, 2020

tgross self-assigned this Jul 28, 2020

This was referenced Jul 29, 2020

csi: retry controller client RPCs on next controller #8561

Merged

csi: add unpublish RPC #8572

Merged

csi: update volumewatcher to use unpublish RPC #8579

Merged

csi: release claims via csi_hook postrun unpublish RPC #8580

Merged

tgross mentioned this issue Aug 7, 2020

CSI unpublish error handling improvements #8605

Merged

This was referenced Aug 10, 2020

CSI can't deregister volume #8285

Closed

CSI volume keeps references to failed allocations #8145

Closed

[CSI] "volume in use" error deregistering volume previously associated with now stopped job #8057

Closed

tgross closed this as completed Aug 11, 2020

hongkongkiwi mentioned this issue Sep 23, 2020

nomad volume deregister doesn't work without -force #8949

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI: Volume GC Evaluation Fails on Deregistered Volumes #8100

CSI: Volume GC Evaluation Fails on Deregistered Volumes #8100

kainoaseto commented Jun 2, 2020

tgross commented Jun 8, 2020

tsarna commented Jul 18, 2020

tgross commented Aug 7, 2020

kainoaseto commented Aug 7, 2020

tsarna commented Aug 7, 2020

tgross commented Aug 10, 2020

tgross commented Aug 11, 2020

github-actions bot commented Nov 3, 2022

CSI: Volume GC Evaluation Fails on Deregistered Volumes #8100

CSI: Volume GC Evaluation Fails on Deregistered Volumes #8100

Comments

kainoaseto commented Jun 2, 2020

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Server logs (if appropriate)

tgross commented Jun 8, 2020

tsarna commented Jul 18, 2020

tgross commented Aug 7, 2020

kainoaseto commented Aug 7, 2020

tsarna commented Aug 7, 2020

tgross commented Aug 10, 2020

tgross commented Aug 11, 2020

github-actions bot commented Nov 3, 2022