Bug 1823667: UPSTREAM: <carry>: Add mutex to DeleteNodes #149

elmiko · 2020-05-04T17:59:11Z

This change adds a mutex to the MachineController structure which is
used to gate access to the DeleteNodes function.

This is one in a series of PRs to mitigate kubernetes#3104

openshift-ci-robot · 2020-05-04T17:59:15Z

@elmiko: This pull request references Bugzilla bug 1823667, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1823667: UPSTREAM: : Add mutex to DeleteNodes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

elmiko · 2020-05-04T18:00:16Z

i had thought about using an RWMutex instead and attempting to make the access more granular, but thinking about our discussion this morning i opted to keep the patch simpler and more tightly focused on the gating behavior for DeleteNodes.

elmiko · 2020-05-04T18:00:47Z

@enxebre @JoelSpeed ptal

JoelSpeed

Have you tested this patch and tried to repro the bug?

i had thought about using an RWMutex instead and attempting to make the access more granular, but thinking about our discussion this morning i opted to keep the patch simpler and more tightly focused on the gating behavior for DeleteNodes.

I like the idea of putting an RWMutex in and ensuring all NG methods are locked personally, but as you say, it does add some extra complexity to the code

JoelSpeed · 2020-05-05T09:47:25Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go

@@ -62,6 +63,7 @@ type machineController struct {
 	machineSetResource        *schema.GroupVersionResource
 	machineResource           *schema.GroupVersionResource
 	machineDeploymentResource *schema.GroupVersionResource
+	accessLock                *sync.Mutex


Could make this a non-pointer, has the same effect, plus the zero value works as well so you don't need the initialisation either. I don't mind either way since both work

i'll change it in favor a non-pointer reference

elmiko · 2020-05-05T12:36:20Z

Have you tested this patch and tried to repro the bug?

i haven't had conclusive tests yet. i started working on an automated test and didn't try this patch out as deeply. today i should get to beat it up more.

I like the idea of putting an RWMutex in and ensuring all NG methods are locked personally, but as you say, it does add some extra complexity to the code

so, my initial thought was to keep this change tightly scoped to what we had talked about. that said, i could see making this change more thorough by putting mutexes on all the methods if that's what we want. i'm open either way.

elmiko · 2020-05-05T14:59:18Z

@JoelSpeed just wanted to confirm, that after testing a bunch this morning, this patch in isolation does not fix the taint issue. it looked promising the first test run, but then went steadily downhill.

elmiko · 2020-05-05T19:18:14Z

updated the pr to include @JoelSpeed 's comments.

fwiw, i tested out putting an RWMutex in and then trying to isolate the various calls but this resulted in some deadlock behavior during scaling events. i'm defaulting back to the original position that we should make these changes slowly, or at least focus on the lock behavior in another pull request.

elmiko · 2020-05-05T19:23:17Z

/retest

This change adds a mutex to the MachineController structure which is used to gate access to the DeleteNodes function. This is one in a series of PRs to mitigate kubernetes#3104 Once a solution has been reached, this will be contribued upstream.

elmiko · 2020-05-05T21:09:49Z

/retest

elmiko · 2020-05-05T21:22:05Z

i'm not sure why yet, but this test is now failing the double delete unit test. i am not seeing this failure locally though.

i am also seeing some failures locally with the e2e tests that are not happening in CI.

on the upside, it my "real world" testing this does appear to fix the issue of taints being left on nodes.

elmiko · 2020-05-06T18:56:50Z

just wanted to confirm that with this change in place i am now able to pass the proposed test for this[0]

[0] openshift/cluster-api-actuator-pkg#151

elmiko · 2020-05-18T13:01:58Z

/retest

JoelSpeed

/approve

Though I would like to see a follow up eventually so that all access to the NodeGroups comes under lock to prevent clashes

openshift-ci-robot · 2020-05-18T13:17:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/clusterapi/OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

michaelgugino · 2020-05-21T19:07:11Z

/lgtm

openshift-bot · 2020-05-21T20:29:37Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-05-21T21:57:05Z

@elmiko: Some pull requests linked via external trackers have merged: openshift/kubernetes-autoscaler#147. The following pull requests linked via external trackers have not merged:

openshift/kubernetes-autoscaler#148 is closed
Bugzilla bug 1823667 has been moved to the MODIFIED state.

In response to this:

Bug 1823667: UPSTREAM: : Add mutex to DeleteNodes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. label May 4, 2020

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label May 4, 2020

openshift-ci-robot requested review from alexander-demicev and frobware May 4, 2020 17:59

JoelSpeed reviewed May 5, 2020

View reviewed changes

elmiko force-pushed the DeleteNodes-mutex branch from e40bd66 to c2e44a6 Compare May 5, 2020 19:14

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 5, 2020

elmiko force-pushed the DeleteNodes-mutex branch from c2e44a6 to 9e22789 Compare May 5, 2020 19:26

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 5, 2020

elmiko mentioned this pull request May 5, 2020

add a test for deletion taints after scale down openshift/cluster-api-actuator-pkg#151

Merged

JoelSpeed reviewed May 18, 2020

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2020

openshift-ci-robot assigned michaelgugino May 21, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 21, 2020

openshift-merge-robot merged commit 9ac8ed2 into openshift:master May 21, 2020

elmiko deleted the DeleteNodes-mutex branch June 3, 2020 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1823667: UPSTREAM: <carry>: Add mutex to DeleteNodes #149

Bug 1823667: UPSTREAM: <carry>: Add mutex to DeleteNodes #149

elmiko commented May 4, 2020

openshift-ci-robot commented May 4, 2020

elmiko commented May 4, 2020

elmiko commented May 4, 2020

JoelSpeed left a comment

JoelSpeed May 5, 2020

elmiko May 5, 2020

elmiko commented May 5, 2020

elmiko commented May 5, 2020

elmiko commented May 5, 2020

elmiko commented May 5, 2020

elmiko commented May 5, 2020

elmiko commented May 5, 2020 •

edited

Loading

elmiko commented May 6, 2020

elmiko commented May 18, 2020

JoelSpeed left a comment

openshift-ci-robot commented May 18, 2020

michaelgugino commented May 21, 2020

openshift-bot commented May 21, 2020

openshift-ci-robot commented May 21, 2020

Bug 1823667: UPSTREAM: <carry>: Add mutex to DeleteNodes #149

Bug 1823667: UPSTREAM: <carry>: Add mutex to DeleteNodes #149

Conversation

elmiko commented May 4, 2020

openshift-ci-robot commented May 4, 2020

elmiko commented May 4, 2020

elmiko commented May 4, 2020

JoelSpeed left a comment

Choose a reason for hiding this comment

JoelSpeed May 5, 2020

Choose a reason for hiding this comment

elmiko May 5, 2020

Choose a reason for hiding this comment

elmiko commented May 5, 2020

elmiko commented May 5, 2020

elmiko commented May 5, 2020

elmiko commented May 5, 2020

elmiko commented May 5, 2020

elmiko commented May 5, 2020 • edited Loading

elmiko commented May 6, 2020

elmiko commented May 18, 2020

JoelSpeed left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented May 18, 2020

michaelgugino commented May 21, 2020

openshift-bot commented May 21, 2020

openshift-ci-robot commented May 21, 2020

elmiko commented May 5, 2020 •

edited

Loading