Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csi: move volume claim release into volumewatcher #7794

Merged
merged 5 commits into from
Apr 30, 2020
Merged

Conversation

tgross
Copy link
Member

@tgross tgross commented Apr 23, 2020

Fixes #7629

This changeset adds a subsystem to run on the leader, similar to the deployment watcher or node drainer. The Watcher performs a blocking query on updates to the CSIVolumes table and triggers reaping of volume claims.

This will avoid tying up scheduling workers by immediately sending volume claim workloads into their own loop, rather than blocking the scheduling workers in the core GC job doing things like talking to CSI controllers.


Notes:

@tgross tgross force-pushed the f-csi-volume-watcher branch 18 times, most recently from 60bb6f3 to 75ed335 Compare April 27, 2020 20:12
tgross added 2 commits April 28, 2020 16:03
This changeset adds a subsystem to run on the leader, similar to the
deployment watcher or node drainer. The `Watcher` performs a blocking
query on updates to the `CSIVolumes` table and triggers reaping of
volume claims.

This will avoid tying up scheduling workers by immediately sending
volume claim workloads into their own loop, rather than blocking the
scheduling workers in the core GC job doing things like talking to CSI
controllers

(This commit does not include wiring-up the leader or removing the old
GC mechanism.)
Enable the volume watcher on leader step-up and disable it on leader
step-down.
@tgross tgross force-pushed the f-csi-volume-watcher branch from 8c2ac25 to e0b9f9d Compare April 28, 2020 20:03
@tgross tgross requested a review from langmartin April 28, 2020 20:26
@tgross tgross marked this pull request as ready for review April 28, 2020 20:27
The volume claim GC mechanism now makes an empty claim RPC for the
volume to trigger an index bump. That in turn unblocks the blocking
query in the volume watcher so it can assess which claims can be
released for a volume.
@tgross tgross force-pushed the f-csi-volume-watcher branch from e0b9f9d to 588acb5 Compare April 28, 2020 20:30
@@ -0,0 +1,125 @@
package volumewatcher
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1156,33 +1158,35 @@ func (n *nomadFSM) applyCSIVolumeDeregister(buf []byte, index uint64) interface{
return nil
}

func (n *nomadFSM) applyCSIVolumeClaim(buf []byte, index uint64) interface{} {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note for review: it's worth looking at this file without the diff... the diff really tangled this up because the applyCSIVolumeBatchClaim is similar to the applyCSIVolumeClaim. This should be entirely an addition.

@@ -1,11 +0,0 @@
package nomad
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for review: this only supported the now-removed nomad/core_sched_test.go file.

@@ -0,0 +1,28 @@
package volumewatcher
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for review: these interfaces exist solely to give us handles for RPC mocks in the tests (so we don't have to set up clients with CSI plugins in unit tests).

return nil
}

for _, claim := range vol.PastClaims {
Copy link
Member Author

@tgross tgross Apr 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for review: from here down in this file we're almost entirely copying the code over from the nomad/core_sched.go file that's been removed. Changes mostly include better trace logging and naming/comments.


// TODO: this is currently dead code; we'll call a public remove
// method on the Watcher once we have a periodic GC job
// remove stops watching a volume and should only be called when locked.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref #7825

Copy link
Contributor

@langmartin langmartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! I've left a couple of non-blocking comments through here. This is a pile of work! I think it looks good and the reasoning is pretty straightforward.

claims[upd.VolumeID+upd.RequestNamespace()] = upd
}
w.f <- future
case <-timerCh:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general for this pattern, I'd expect that we'd want to have a batch max size or timer, where either of those would apply the batch. I know the raft lib has some recent support for splitting large commits, but it seems like we're missing batch max from this (and deployments). I think we're good to go ahead, but should consider using an interface to allow this and deployments to use the same batcher code later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Spun out to #7838


// Kill everything associated with the watcher
if w.exitFn != nil {
w.exitFn()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this safe to call on a follower?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.

  • If the follower was recently a leader, we drop the blocking query it was making. Any in-flight work will be canceled except for client RPCs (which we can't yet cancel but are idempotent).
  • If the follower was not recently a leader but we call the flush() on it anyways for some spurious reason, it's safe to call a cancel function twice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd convinced myself it was already, sorry, that was a note I made on the first pass and meant to remove before I hit the button. First read I thought exitFn looked like it was actually running the rpcs.

}

// Start the long lived watcher that scans for allocation updates
w.Start()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the long term, I'm a little worried about starting a go routine for every volume. It seems at least possible that some operators could set us up for more goroutines than we want running for these, maybe in a followup if warranted we could use a fixed pool?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a great idea. But it occurs to me we could also simply Stop() the goroutine when we've completed processing an incoming update (when we've determined either there are no more past claims to process or no more claims at all). This is a small change but will require some test reworking so I'm going to spin it off into #7837

nomad/volumewatcher/volume_watcher.go Outdated Show resolved Hide resolved
@tgross tgross merged commit 775de0d into master Apr 30, 2020
@tgross tgross deleted the f-csi-volume-watcher branch April 30, 2020 13:13
tgross added a commit that referenced this pull request May 20, 2020
Following the new volumewatcher in #7794 and performance improvements
to it that landed afterwards, there's no particular reason we should
be threading claim releases through the GC eval rather than writing an
empty `CSIVolumeClaimRequest` with the mode set to
`CSIVolumeClaimRelease`, just as the GC evaluation would do.

Also, by batching up these raft messages, we can reduce the amount of
raft writes by 1 and cross-server RPCs by 1 per volume we release
claims on.
tgross added a commit that referenced this pull request May 20, 2020
Following the new volumewatcher in #7794 and performance improvements
to it that landed afterwards, there's no particular reason we should
be threading claim releases through the GC eval rather than writing an
empty `CSIVolumeClaimRequest` with the mode set to
`CSIVolumeClaimRelease`, just as the GC evaluation would do.

Also, by batching up these raft messages, we can reduce the amount of
raft writes by 1 and cross-server RPCs by 1 per volume we release
claims on.
tgross added a commit that referenced this pull request May 27, 2020
Following the new volumewatcher in #7794 and performance improvements
to it that landed afterwards, there's no particular reason we should
be threading claim releases through the GC eval rather than writing an
empty `CSIVolumeClaimRequest` with the mode set to
`CSIVolumeClaimRelease`, just as the GC evaluation would do.

Also, by batching up these raft messages, we can reduce the amount of
raft writes by 1 and cross-server RPCs by 1 per volume we release
claims on.
@github-actions
Copy link

github-actions bot commented Jan 8, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

csi: controller plugin timeouts
2 participants