Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csi: volume claim garbage collection #7125

Merged
merged 4 commits into from
Feb 19, 2020
Merged

csi: volume claim garbage collection #7125

merged 4 commits into from
Feb 19, 2020

Conversation

tgross
Copy link
Member

@tgross tgross commented Feb 11, 2020

When an alloc is marked terminal (and after node unstage/unpublish
have been called), the client syncs the terminal alloc state with the
server via Node.UpdateAlloc RPC.

For each job that has a terminal alloc, the Node.UpdateAlloc RPC
handler at the server will emit an eval for a new core job to garbage
collect CSI volume claims. When this eval is handled on the core
scheduler, it will call a volumeReap method to release the claims
for all terminal allocs on the job.

The volume reap will issue a ControllerUnpublishVolume RPC for any
alloc that has volumes with a controller plugin. Once this returns (or
is skipped), the volume reap will send a new CSIVolume.Claim RPC
that releases the volume claim for that allocation in the state store,
making it available for scheduling again.

This same volumeReap method will be called from the core job GC,
which gives us a second chance to reclaim volumes during GC if there
were controller RPC failures.

@tgross tgross added this to the 0.11.0 milestone Feb 11, 2020
@tgross tgross force-pushed the f-csi-volume-gc branch 2 times, most recently from 877e159 to ef11c8e Compare February 13, 2020 14:02
@tgross tgross force-pushed the f-csi-volume-gc branch 6 times, most recently from e213e55 to 1942fc3 Compare February 13, 2020 20:57
@tgross tgross marked this pull request as ready for review February 13, 2020 21:35
err = core.Process(eval)
require.NoError(t, err)

// Verify the claim was released
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@langmartin wanted to flag this for you in particular because you implemented the state store work. As far as I can tell there's no real reason to bother with tracking PastAllocs with this implementation, and we can swap out the implementation of structs.CSIVolume.ClaimRelease with that of structs.CSIVolume.GCAlloc. Do you have any thoughts here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PastAllocs at this point is there to serve as a debugging tool. If there's a user sensible gap of several minutes between the reapVolumes that would detect the ReadAlloc + Terminal state and the eval reap that deletes the allocations, it's worth keeping both stages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PastAllocs at this point is there to serve as a debugging tool. If there's a user sensible gap of several minutes between the reapVolumes that would detect the ReadAlloc + Terminal state and the eval reap that deletes the allocations, it's worth keeping both stages.

I was trying to wrap up the RFC section to explain this, but PastAllocs as it stands right now doesn't get us anything. When we get Node.UpdateAlloc for a terminal alloc, we can't move that alloc out of Read/WriteAllocs because that would make it eligible for scheduling before we've released the claim. At the very least we'd need to have a PastReadAllocs and a PastWriteAllocs and check that during scheduling, but right now we're not looking in PastAllocs at all during scheduling.

Alternately, we could add the alloc to PastAllocs but not remove it from ReadAllocs/WriteAllocs until it's GC'd, but I'm not sure that helps us in any way that isn't better served by checking if the alloc is terminal.

Copy link
Contributor

@langmartin langmartin Feb 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way you have it working is correct. Here's how I think it should go:

  1. UpdateAlloc informs the state store that the alloc is terminal
  2. The volume and eval are marked for GC
  3. If it's unclaimed, we ControllerUnpublish
  4. we ClaimRelease, and move the alloc to PastAllocs
  5. EvalGCThreshold later, the the eval is garbage collected, and on alloc GC we delete it from PastAllocs

The gap between 4 & 5 is the user perceivable bit that PastAllocs gets us, I think. I'm a bit fuzzy on the details of where the eval or job is marked for GC to actually reap the eval (and allocs) but I think that configured duration kicks in either way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I get what you're proposing at least.

But when we GC the job/alloc we end up having to run the volume GC process anyways, because we don't have any guarantees they're not running concurrently (we can interleave transactions with the lengthy ControllerUnpublish). So the PastAllocs might be useful but could just as easily be removed instantly depending on timing, which makes for an unreliable debugging instrument that we have to pay for in extra raft transactions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, without anything currently consuming PastAllocs I'm feeling extra-skeptical about its use at this stage of the design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's fair. We can always re-introduce it if we need to.

nomad/core_sched.go Outdated Show resolved Hide resolved
nomad/core_sched.go Outdated Show resolved Hide resolved
nomad/core_sched.go Outdated Show resolved Hide resolved
@tgross tgross force-pushed the f-csi-volume-gc branch 2 times, most recently from c06d075 to 6b219bd Compare February 14, 2020 19:07
err = core.Process(eval)
require.NoError(t, err)

// Verify the claim was released
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PastAllocs at this point is there to serve as a debugging tool. If there's a user sensible gap of several minutes between the reapVolumes that would detect the ReadAlloc + Terminal state and the eval reap that deletes the allocations, it's worth keeping both stages.

nomad/core_sched.go Outdated Show resolved Hide resolved
nomad/core_sched.go Outdated Show resolved Hide resolved
When an alloc is marked terminal (and after node unstage/unpublish
have been called), the client syncs the terminal alloc state with the
server via `Node.UpdateAlloc RPC`.

For each job that has a terminal alloc, the `Node.UpdateAlloc` RPC
handler at the server will emit an eval for a new core job to garbage
collect CSI volume claims. When this eval is handled on the core
scheduler, it will call a `volumeReap` method to release the claims
for all terminal allocs on the job.

The volume reap will issue a `ControllerUnpublishVolume` RPC for any
node that has no alloc claiming the volume. Once this returns (or
is skipped), the volume reap will send a new `CSIVolume.Claim` RPC
that releases the volume claim for that allocation in the state store,
making it available for scheduling again.

This same `volumeReap` method will be called from the core job GC,
which gives us a second chance to reclaim volumes during GC if there
were controller RPC failures.
@tgross tgross merged commit 3cf905d into f-csi-volumes Feb 19, 2020
@tgross tgross deleted the f-csi-volume-gc branch February 19, 2020 14:05
tgross added a commit that referenced this pull request Mar 4, 2020
When an alloc is marked terminal (and after node unstage/unpublish
have been called), the client syncs the terminal alloc state with the
server via `Node.UpdateAlloc RPC`.

For each job that has a terminal alloc, the `Node.UpdateAlloc` RPC
handler at the server will emit an eval for a new core job to garbage
collect CSI volume claims. When this eval is handled on the core
scheduler, it will call a `volumeReap` method to release the claims
for all terminal allocs on the job.

The volume reap will issue a `ControllerUnpublishVolume` RPC for any
node that has no alloc claiming the volume. Once this returns (or
is skipped), the volume reap will send a new `CSIVolume.Claim` RPC
that releases the volume claim for that allocation in the state store,
making it available for scheduling again.

This same `volumeReap` method will be called from the core job GC,
which gives us a second chance to reclaim volumes during GC if there
were controller RPC failures.
tgross added a commit that referenced this pull request Mar 4, 2020
When an alloc is marked terminal (and after node unstage/unpublish
have been called), the client syncs the terminal alloc state with the
server via `Node.UpdateAlloc RPC`.

For each job that has a terminal alloc, the `Node.UpdateAlloc` RPC
handler at the server will emit an eval for a new core job to garbage
collect CSI volume claims. When this eval is handled on the core
scheduler, it will call a `volumeReap` method to release the claims
for all terminal allocs on the job.

The volume reap will issue a `ControllerUnpublishVolume` RPC for any
node that has no alloc claiming the volume. Once this returns (or
is skipped), the volume reap will send a new `CSIVolume.Claim` RPC
that releases the volume claim for that allocation in the state store,
making it available for scheduling again.

This same `volumeReap` method will be called from the core job GC,
which gives us a second chance to reclaim volumes during GC if there
were controller RPC failures.
tgross added a commit that referenced this pull request Mar 9, 2020
When an alloc is marked terminal (and after node unstage/unpublish
have been called), the client syncs the terminal alloc state with the
server via `Node.UpdateAlloc RPC`.

For each job that has a terminal alloc, the `Node.UpdateAlloc` RPC
handler at the server will emit an eval for a new core job to garbage
collect CSI volume claims. When this eval is handled on the core
scheduler, it will call a `volumeReap` method to release the claims
for all terminal allocs on the job.

The volume reap will issue a `ControllerUnpublishVolume` RPC for any
node that has no alloc claiming the volume. Once this returns (or
is skipped), the volume reap will send a new `CSIVolume.Claim` RPC
that releases the volume claim for that allocation in the state store,
making it available for scheduling again.

This same `volumeReap` method will be called from the core job GC,
which gives us a second chance to reclaim volumes during GC if there
were controller RPC failures.
langmartin pushed a commit that referenced this pull request Mar 13, 2020
When an alloc is marked terminal (and after node unstage/unpublish
have been called), the client syncs the terminal alloc state with the
server via `Node.UpdateAlloc RPC`.

For each job that has a terminal alloc, the `Node.UpdateAlloc` RPC
handler at the server will emit an eval for a new core job to garbage
collect CSI volume claims. When this eval is handled on the core
scheduler, it will call a `volumeReap` method to release the claims
for all terminal allocs on the job.

The volume reap will issue a `ControllerUnpublishVolume` RPC for any
node that has no alloc claiming the volume. Once this returns (or
is skipped), the volume reap will send a new `CSIVolume.Claim` RPC
that releases the volume claim for that allocation in the state store,
making it available for scheduling again.

This same `volumeReap` method will be called from the core job GC,
which gives us a second chance to reclaim volumes during GC if there
were controller RPC failures.
tgross added a commit that referenced this pull request Mar 16, 2020
When an alloc is marked terminal (and after node unstage/unpublish
have been called), the client syncs the terminal alloc state with the
server via `Node.UpdateAlloc RPC`.

For each job that has a terminal alloc, the `Node.UpdateAlloc` RPC
handler at the server will emit an eval for a new core job to garbage
collect CSI volume claims. When this eval is handled on the core
scheduler, it will call a `volumeReap` method to release the claims
for all terminal allocs on the job.

The volume reap will issue a `ControllerUnpublishVolume` RPC for any
node that has no alloc claiming the volume. Once this returns (or
is skipped), the volume reap will send a new `CSIVolume.Claim` RPC
that releases the volume claim for that allocation in the state store,
making it available for scheduling again.

This same `volumeReap` method will be called from the core job GC,
which gives us a second chance to reclaim volumes during GC if there
were controller RPC failures.
tgross added a commit that referenced this pull request Mar 23, 2020
When an alloc is marked terminal (and after node unstage/unpublish
have been called), the client syncs the terminal alloc state with the
server via `Node.UpdateAlloc RPC`.

For each job that has a terminal alloc, the `Node.UpdateAlloc` RPC
handler at the server will emit an eval for a new core job to garbage
collect CSI volume claims. When this eval is handled on the core
scheduler, it will call a `volumeReap` method to release the claims
for all terminal allocs on the job.

The volume reap will issue a `ControllerUnpublishVolume` RPC for any
node that has no alloc claiming the volume. Once this returns (or
is skipped), the volume reap will send a new `CSIVolume.Claim` RPC
that releases the volume claim for that allocation in the state store,
making it available for scheduling again.

This same `volumeReap` method will be called from the core job GC,
which gives us a second chance to reclaim volumes during GC if there
were controller RPC failures.
tgross added a commit that referenced this pull request Mar 23, 2020
When an alloc is marked terminal (and after node unstage/unpublish
have been called), the client syncs the terminal alloc state with the
server via `Node.UpdateAlloc RPC`.

For each job that has a terminal alloc, the `Node.UpdateAlloc` RPC
handler at the server will emit an eval for a new core job to garbage
collect CSI volume claims. When this eval is handled on the core
scheduler, it will call a `volumeReap` method to release the claims
for all terminal allocs on the job.

The volume reap will issue a `ControllerUnpublishVolume` RPC for any
node that has no alloc claiming the volume. Once this returns (or
is skipped), the volume reap will send a new `CSIVolume.Claim` RPC
that releases the volume claim for that allocation in the state store,
making it available for scheduling again.

This same `volumeReap` method will be called from the core job GC,
which gives us a second chance to reclaim volumes during GC if there
were controller RPC failures.
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants