-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
csi: move volume claim release into volumewatcher #7794
Conversation
60bb6f3
to
75ed335
Compare
75ed335
to
8c2ac25
Compare
This changeset adds a subsystem to run on the leader, similar to the deployment watcher or node drainer. The `Watcher` performs a blocking query on updates to the `CSIVolumes` table and triggers reaping of volume claims. This will avoid tying up scheduling workers by immediately sending volume claim workloads into their own loop, rather than blocking the scheduling workers in the core GC job doing things like talking to CSI controllers (This commit does not include wiring-up the leader or removing the old GC mechanism.)
Enable the volume watcher on leader step-up and disable it on leader step-down.
8c2ac25
to
e0b9f9d
Compare
The volume claim GC mechanism now makes an empty claim RPC for the volume to trigger an index bump. That in turn unblocks the blocking query in the volume watcher so it can assess which claims can be released for a volume.
e0b9f9d
to
588acb5
Compare
@@ -0,0 +1,125 @@ | |||
package volumewatcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for review: this file draws heavily on https://github.com/hashicorp/nomad/blob/master/nomad/deploymentwatcher/batcher.go
@@ -1156,33 +1158,35 @@ func (n *nomadFSM) applyCSIVolumeDeregister(buf []byte, index uint64) interface{ | |||
return nil | |||
} | |||
|
|||
func (n *nomadFSM) applyCSIVolumeClaim(buf []byte, index uint64) interface{} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note for review: it's worth looking at this file without the diff... the diff really tangled this up because the applyCSIVolumeBatchClaim
is similar to the applyCSIVolumeClaim
. This should be entirely an addition.
@@ -1,11 +0,0 @@ | |||
package nomad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for review: this only supported the now-removed nomad/core_sched_test.go
file.
@@ -0,0 +1,28 @@ | |||
package volumewatcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for review: these interfaces exist solely to give us handles for RPC mocks in the tests (so we don't have to set up clients with CSI plugins in unit tests).
return nil | ||
} | ||
|
||
for _, claim := range vol.PastClaims { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for review: from here down in this file we're almost entirely copying the code over from the nomad/core_sched.go
file that's been removed. Changes mostly include better trace logging and naming/comments.
|
||
// TODO: this is currently dead code; we'll call a public remove | ||
// method on the Watcher once we have a periodic GC job | ||
// remove stops watching a volume and should only be called when locked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref #7825
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! I've left a couple of non-blocking comments through here. This is a pile of work! I think it looks good and the reasoning is pretty straightforward.
claims[upd.VolumeID+upd.RequestNamespace()] = upd | ||
} | ||
w.f <- future | ||
case <-timerCh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general for this pattern, I'd expect that we'd want to have a batch max size or timer, where either of those would apply the batch. I know the raft lib has some recent support for splitting large commits, but it seems like we're missing batch max from this (and deployments). I think we're good to go ahead, but should consider using an interface to allow this and deployments to use the same batcher code later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Spun out to #7838
|
||
// Kill everything associated with the watcher | ||
if w.exitFn != nil { | ||
w.exitFn() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this safe to call on a follower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.
- If the follower was recently a leader, we drop the blocking query it was making. Any in-flight work will be canceled except for client RPCs (which we can't yet cancel but are idempotent).
- If the follower was not recently a leader but we call the
flush()
on it anyways for some spurious reason, it's safe to call a cancel function twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd convinced myself it was already, sorry, that was a note I made on the first pass and meant to remove before I hit the button. First read I thought exitFn
looked like it was actually running the rpcs.
} | ||
|
||
// Start the long lived watcher that scans for allocation updates | ||
w.Start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the long term, I'm a little worried about starting a go routine for every volume. It seems at least possible that some operators could set us up for more goroutines than we want running for these, maybe in a followup if warranted we could use a fixed pool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a great idea. But it occurs to me we could also simply Stop()
the goroutine when we've completed processing an incoming update (when we've determined either there are no more past claims to process or no more claims at all). This is a small change but will require some test reworking so I'm going to spin it off into #7837
Following the new volumewatcher in #7794 and performance improvements to it that landed afterwards, there's no particular reason we should be threading claim releases through the GC eval rather than writing an empty `CSIVolumeClaimRequest` with the mode set to `CSIVolumeClaimRelease`, just as the GC evaluation would do. Also, by batching up these raft messages, we can reduce the amount of raft writes by 1 and cross-server RPCs by 1 per volume we release claims on.
Following the new volumewatcher in #7794 and performance improvements to it that landed afterwards, there's no particular reason we should be threading claim releases through the GC eval rather than writing an empty `CSIVolumeClaimRequest` with the mode set to `CSIVolumeClaimRelease`, just as the GC evaluation would do. Also, by batching up these raft messages, we can reduce the amount of raft writes by 1 and cross-server RPCs by 1 per volume we release claims on.
Following the new volumewatcher in #7794 and performance improvements to it that landed afterwards, there's no particular reason we should be threading claim releases through the GC eval rather than writing an empty `CSIVolumeClaimRequest` with the mode set to `CSIVolumeClaimRelease`, just as the GC evaluation would do. Also, by batching up these raft messages, we can reduce the amount of raft writes by 1 and cross-server RPCs by 1 per volume we release claims on.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Fixes #7629
This changeset adds a subsystem to run on the leader, similar to the deployment watcher or node drainer. The
Watcher
performs a blocking query on updates to theCSIVolumes
table and triggers reaping of volume claims.This will avoid tying up scheduling workers by immediately sending volume claim workloads into their own loop, rather than blocking the scheduling workers in the core GC job doing things like talking to CSI controllers.
Notes: