Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of CSI: failed allocation should not block its own controller unpublish into release/1.1.x #14506

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #14484 to be assessed for backporting due to the inclusion of the label backport/1.1.x.

WARNING automatic cherry-pick of commits failed. Commits will require human attention.

The below text is copied from the body of the original PR.


A Nomad user reported problems with CSI volumes associated with failed allocations, where the Nomad server did not send a controller unpublish RPC.

The controller unpublish is skipped if other non-terminal allocations on the same node claim the volume. The check has a bug where the allocation belonging to the claim being freed was included in the check incorrectly. During a normal allocation stop for job stop or a new version of the job, the allocation is terminal so that's ok. But allocations that fail are not yet marked terminal at the point in time when the client sends the unpublish RPC to the server.

For CSI plugins that support controller attach/detach, this means that the controller will not be able to detach the volume from the allocation's host and the replacement claim will fail until a GC is run. This changeset fixes the conditional so that the claim's own allocation is not included, and makes the logic easier to read. Include a test case covering this path.

This PR includes two other tiny bug fixes that were going to be a pain if I had to backport 3 different PRs. They're in their own commits:

  • Fix missing copies in the volume unpublish workflow. Entities we get from the state store should always be copied before altering. Ensure that we copy the volume in the top-level unpublish workflow before handing off to the steps.
  • The list stub object for volumes in nomad/structs did not match the stub object in api. The api package also did not include the current readers/writers fields that are expected by the UI. True up the two objects and add the previously undocumented fields to the docs.

@hc-github-team-nomad-core hc-github-team-nomad-core force-pushed the backport/b-csi-controller-unpublish/strictly-present-caiman branch from cc49a1c to 84297b6 Compare September 8, 2022 17:30
@tgross tgross force-pushed the backport/b-csi-controller-unpublish/strictly-present-caiman branch from 84297b6 to 2e9ba31 Compare September 8, 2022 18:45
…14484)

A Nomad user reported problems with CSI volumes associated with failed
allocations, where the Nomad server did not send a controller unpublish RPC.

The controller unpublish is skipped if other non-terminal allocations on the
same node claim the volume. The check has a bug where the allocation belonging
to the claim being freed was included in the check incorrectly. During a normal
allocation stop for job stop or a new version of the job, the allocation is
terminal. But allocations that fail are not yet marked terminal at the point in
time when the client sends the unpublish RPC to the server.

For CSI plugins that support controller attach/detach, this means that the
controller will not be able to detach the volume from the allocation's host and
the replacement claim will fail until a GC is run. This changeset fixes the
conditional so that the claim's own allocation is not included, and makes the
logic easier to read. Include a test case covering this path.

Also includes two minor extra bugfixes:

* Entities we get from the state store should always be copied before
altering. Ensure that we copy the volume in the top-level unpublish workflow
before handing off to the steps.

* The list stub object for volumes in `nomad/structs` did not match the stub
object in `api`. The `api` package also did not include the current
readers/writers fields that are expected by the UI. True up the two objects and
add the previously undocumented fields to the docs.
@tgross tgross force-pushed the backport/b-csi-controller-unpublish/strictly-present-caiman branch from 2e9ba31 to d5a9aae Compare September 8, 2022 18:56
@tgross tgross merged commit b756b2c into release/1.1.x Sep 8, 2022
@tgross tgross deleted the backport/b-csi-controller-unpublish/strictly-present-caiman branch September 8, 2022 19:43
@github-actions
Copy link

github-actions bot commented Jan 7, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants