-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
csi: make volume GC in job deregister safely async #7632
Conversation
bf8e1ec
to
b4ae84b
Compare
Passing e2e:
|
b4ae84b
to
d569d6a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, and looks good to me
The CSI plugins uses the external volume ID for all operations, but the Client CSI RPCs uses the Nomad volume ID (human-friendly) for the mount paths. Pass the External ID as an arg in the RPC call so that the unpublish workflows have it without calling back to the server to find the external ID. The controller CSI plugins need the CSI node ID (or in other words, the storage provider's view of node ID like the EC2 instance ID), not the Nomad node ID, to determine how to detach the external volume.
The `Job.Deregister` call will block on the client CSI controller RPCs while the alloc still exists on the Nomad client node. So we need to make the volume claim reaping async from the `Job.Deregister`. This allows `nomad job stop` to return immediately. In order to make this work, this changeset changes the volume GC so that the GC jobs are on a by-volume basis rather than a by-job basis; we won't have to query the (possibly deleted) job at the time of volume GC. We smuggle the volume ID and whether it's a purge into the GC eval ID the same way we smuggled the job ID previously.
d569d6a
to
97f1ccd
Compare
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Partial fix for #7629
Includes commits from #7628
The
Job.Deregister
call will block on the client CSI controller RPCs while the alloc still exists on the Nomad client node. So we need to make the volume claim reaping async from theJob.Deregister
. This allowsnomad job stop
to return immediately. In order to make this work, I've changed the volume GC so that the GC jobs are on a by-volume basis rather than a by-job basis; we won't have to query the (possibly deleted) job at the time of volume GC. I'm smuggling the volume ID and whether it's a purge into the GC eval ID the same way we smuggled the job ID previously.This doesn't entirely fix #7629 because the first GC attempt for volumes with a controller with fail and be re-queued. I've decreased the client-side controller timeout to reduce the wait time as a stop-gap, but we'll want to revisit it.
This leaves the E2E tests with
nomad stop -purge
flaky, depending on timing. For now I've changed them so as not to purge until we've corrected the problem more permanently.