Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csi: fix plugin counts on node update #7844

Merged
merged 8 commits into from
May 5, 2020
Merged

Conversation

tgross
Copy link
Member

@tgross tgross commented Apr 30, 2020

For #7817 and #7743. Best reviewed commit-by-commit, but unfortunately all the changes are required to finish out this multi-layered bug. The bulk of the new lines of code are tests... it was a bit of a pain to suss-out this behavior!

In this changeset:

  • If a Nomad client node is running both a controller and a node plugin (which is a common case), then if only the controller or the node is removed, the plugin was not being updated with the correct counts.
  • The existing test for plugin cleanup didn't go back to the state store, which normally is ok but is complicated in this case by denormalization which changes the behavior. This commit makes the test more comprehensive.
  • Set "controller required" when plugin has PUBLISH_READONLY. All known controllers that support PUBLISH_READONLY also support PUBLISH_UNPUBLISH_VOLUME but we shouldn't assume this.
  • Only create plugins when the allocs for those plugins are healthy. If we allow a plugin to be created for the first time when the alloc is not healthy, then we'll recreate deleted plugins when the job's allocs all get marked terminal.
  • Terminal plugin alloc updates should cleanup the plugin. The client fingerprint can't tell if the plugin is unhealthy intentionally (for the case of updates or job stop). Allocations that are
    server-terminal should delete themselves from the plugin and trigger a plugin self-GC, the same as an unused node.

@tgross
Copy link
Member Author

tgross commented May 4, 2020

End-to-end tested. Note this doesn't include the final GC step we'll need to clean up multi-node plugins without a job purge, which will be under separate PR.

▶ nomad node status
ID        DC   Name              Class   Drain  Eligibility  Status
8655e86e  dc1  ip-172-31-89-154  <none>  false  eligible     ready
19caa509  dc1  ip-172-31-91-220  <none>  false  eligible     ready

▶ nomad job run ./csi/input/plugin-aws-ebs-controller.nomad
==> Monitoring evaluation "b851d4f5"
    Evaluation triggered by job "plugin-aws-ebs-controller"
    Evaluation within deployment: "92af4f65"
    Allocation "18e3203b" created: node "8655e86e", group "controller"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "b851d4f5" finished with status "complete"

▶ nomad job run ./csi/input/plugin-aws-ebs-nodes.nomad
==> Monitoring evaluation "d58d47e6"
    Evaluation triggered by job "plugin-aws-ebs-nodes"
    Allocation "4e2a2040" created: node "8655e86e", group "nodes"
    Allocation "ee39bc02" created: node "19caa509", group "nodes"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "d58d47e6" finished with status "complete"

▶ nomad plugin status
Container Storage Interface
ID        Provider         Controllers Healthy/Expected  Nodes Healthy/Expected
aws-ebs0  ebs.csi.aws.com  1/1                           2/2

▶ nomad job status plugin-aws-ebs-controller | tail -3
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created   Modified
18e3203b  8655e86e  controller  0        run      running  1m8s ago  44s ago

▶ nomad node eligibility -disable 8655
Node "8655e86e-6619-6122-d924-2dd6694b7a4e" scheduling eligibility set: ineligible for scheduling

▶ # edit the controller job
▶ nomad job run ./csi/input/plugin-aws-ebs-controller.nomad
==> Monitoring evaluation "5222fdf8"
    Evaluation triggered by job "plugin-aws-ebs-controller"
    Evaluation within deployment: "b5e65a11"
    Allocation "4db1e0d2" created: node "19caa509", group "controller"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "5222fdf8" finished with status "complete"

▶ nomad plugin status
Container Storage Interface
ID        Provider         Controllers Healthy/Expected  Nodes Healthy/Expected
aws-ebs0  ebs.csi.aws.com  0/0                           2/2

▶ nomad plugin status
Container Storage Interface
ID        Provider         Controllers Healthy/Expected  Nodes Healthy/Expected
aws-ebs0  ebs.csi.aws.com  1/1                           2/2

▶ nomad job stop plugin-aws-ebs-controller
==> Monitoring evaluation "9047a8c6"
    Evaluation triggered by job "plugin-aws-ebs-controller"
    Evaluation within deployment: "b5e65a11"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "9047a8c6" finished with status "complete"

▶ nomad plugin status
Container Storage Interface
ID        Provider         Controllers Healthy/Expected  Nodes Healthy/Expected
aws-ebs0  ebs.csi.aws.com  0/0                           2/2

▶ nomad job stop plugin-aws-ebs-nodes
==> Monitoring evaluation "0ec1a123"
    Evaluation triggered by job "plugin-aws-ebs-nodes"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "0ec1a123" finished with status "complete"

▶ nomad plugin status
Container Storage Interface
ID        Provider         Controllers Healthy/Expected  Nodes Healthy/Expected
aws-ebs0  ebs.csi.aws.com  0/0                           0/0

@tgross tgross requested a review from langmartin May 4, 2020 20:55
@tgross tgross marked this pull request as ready for review May 4, 2020 20:56
Copy link
Contributor

@langmartin langmartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks for the commit by commit breakdown, super easy to read

@@ -729,7 +731,9 @@ func (p *CSIPlugin) AddPlugin(nodeID string, info *CSIInfo) error {
p.NodesHealthy -= 1
}
}
p.Nodes[nodeID] = info
if prev != nil || prev == nil && info.Healthy {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: isn't this the same as prev != nil || info.Healthy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Will fix.

@tgross
Copy link
Member Author

tgross commented May 5, 2020

Looks like I've got a merge conflict from the earlier merge with the docstring fixes. Will address the open item and that conflict and this should be good-to-go.

tgross added 7 commits May 5, 2020 13:50
The existing test didn't go back to the state store, which normally is
ok but is complicated in this case by denormalization which changes
the behavior. This commit makes the test more comprehensive.
All known controllers that support `PUBLISH_READONLY` also support
`PUBLISH_UNPUBLISH_VOLUME` but we shouldn't assume this.
If a Nomad client node is running both a controller and a node
plugin (which is a common case), then if only the controller or the
node is removed, the plugin was not being updated with the correct
counts.
Plugins are first created when a Nomad client sends a node update RPC
that includes allocs with plugins. We use the same mechanism to update
plugin health. But if we allow a plugin to be created for the first
time when the alloc is not healthy, then we'll recreate deleted
plugins when the job's allocs all get marked terminal. This changeset
fixes that by only creating a plugin when the alloc is healthy.
When an allocation that implements a CSI plugin becomes terminal the
client fingerprint can't tell if the plugin is unhealthy intentionally
(for the case of updates or job stop). Allocations that are
server-terminal should delete themselves from the plugin and trigger a
plugin self-GC, the same as an unused node.
@tgross tgross force-pushed the csi_plugin_count_cleanup branch from 81c0702 to e339037 Compare May 5, 2020 17:52
@tgross tgross merged commit 1531db8 into master May 5, 2020
@tgross tgross deleted the csi_plugin_count_cleanup branch May 5, 2020 19:39
tgross added a commit that referenced this pull request Nov 20, 2020
Plugin health for controllers should show "Node Only" in the UI only when both
conditions are true: controllers are not required, and no controllers have
registered themselves (0 expected controllers). This accounts for "monolith"
plugins which might register as both controllers and nodes but not necessarily
have `ControllerRequired = true` because they don't implement the Controller
RPC endpoints we need (this requirement was added in #7844)

This changeset includes the following fixes:

* Update the Plugins tab of the UI so that monolith plugins don't show "Node
  Only" once they've registered.
* Add the missing "Node Only" logic to the Volumes tab of the UI.
tgross added a commit that referenced this pull request Nov 25, 2020
…9416)

Plugin health for controllers should show "Node Only" in the UI only when both
conditions are true: controllers are not required, and no controllers have
registered themselves (0 expected controllers). This accounts for "monolith"
plugins which might register as both controllers and nodes but not necessarily
have `ControllerRequired = true` because they don't implement the Controller
RPC endpoints we need (this requirement was added in #7844)

This changeset includes the following fixes:

* Update the Plugins tab of the UI so that monolith plugins don't show "Node
  Only" once they've registered.
* Add the missing "Node Only" logic to the Volumes tab of the UI.
@github-actions
Copy link

github-actions bot commented Jan 7, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants