Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1823667: UPSTREAM: <carry>: Add mutex to DeleteNodes #149

Merged
merged 1 commit into from
May 21, 2020

Conversation

elmiko
Copy link

@elmiko elmiko commented May 4, 2020

This change adds a mutex to the MachineController structure which is
used to gate access to the DeleteNodes function.

This is one in a series of PRs to mitigate kubernetes#3104

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. label May 4, 2020
@openshift-ci-robot
Copy link

@elmiko: This pull request references Bugzilla bug 1823667, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1823667: UPSTREAM: : Add mutex to DeleteNodes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label May 4, 2020
@elmiko
Copy link
Author

elmiko commented May 4, 2020

i had thought about using an RWMutex instead and attempting to make the access more granular, but thinking about our discussion this morning i opted to keep the patch simpler and more tightly focused on the gating behavior for DeleteNodes.

@elmiko
Copy link
Author

elmiko commented May 4, 2020

@enxebre @JoelSpeed ptal

Copy link

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this patch and tried to repro the bug?

i had thought about using an RWMutex instead and attempting to make the access more granular, but thinking about our discussion this morning i opted to keep the patch simpler and more tightly focused on the gating behavior for DeleteNodes.

I like the idea of putting an RWMutex in and ensuring all NG methods are locked personally, but as you say, it does add some extra complexity to the code

@@ -62,6 +63,7 @@ type machineController struct {
machineSetResource *schema.GroupVersionResource
machineResource *schema.GroupVersionResource
machineDeploymentResource *schema.GroupVersionResource
accessLock *sync.Mutex

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make this a non-pointer, has the same effect, plus the zero value works as well so you don't need the initialisation either. I don't mind either way since both work

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll change it in favor a non-pointer reference

@elmiko
Copy link
Author

elmiko commented May 5, 2020

Have you tested this patch and tried to repro the bug?

i haven't had conclusive tests yet. i started working on an automated test and didn't try this patch out as deeply. today i should get to beat it up more.

I like the idea of putting an RWMutex in and ensuring all NG methods are locked personally, but as you say, it does add some extra complexity to the code

so, my initial thought was to keep this change tightly scoped to what we had talked about. that said, i could see making this change more thorough by putting mutexes on all the methods if that's what we want. i'm open either way.

@elmiko
Copy link
Author

elmiko commented May 5, 2020

@JoelSpeed just wanted to confirm, that after testing a bunch this morning, this patch in isolation does not fix the taint issue. it looked promising the first test run, but then went steadily downhill.

@elmiko elmiko force-pushed the DeleteNodes-mutex branch from e40bd66 to c2e44a6 Compare May 5, 2020 19:14
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 5, 2020
@elmiko
Copy link
Author

elmiko commented May 5, 2020

updated the pr to include @JoelSpeed 's comments.

fwiw, i tested out putting an RWMutex in and then trying to isolate the various calls but this resulted in some deadlock behavior during scaling events. i'm defaulting back to the original position that we should make these changes slowly, or at least focus on the lock behavior in another pull request.

@elmiko
Copy link
Author

elmiko commented May 5, 2020

/retest

This change adds a mutex to the MachineController structure which is
used to gate access to the DeleteNodes function.

This is one in a series of PRs to mitigate kubernetes#3104
Once a solution has been reached, this will be contribued upstream.
@elmiko elmiko force-pushed the DeleteNodes-mutex branch from c2e44a6 to 9e22789 Compare May 5, 2020 19:26
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 5, 2020
@elmiko
Copy link
Author

elmiko commented May 5, 2020

/retest

@elmiko
Copy link
Author

elmiko commented May 5, 2020

i'm not sure why yet, but this test is now failing the double delete unit test. i am not seeing this failure locally though.

i am also seeing some failures locally with the e2e tests that are not happening in CI.

on the upside, it my "real world" testing this does appear to fix the issue of taints being left on nodes.

@elmiko
Copy link
Author

elmiko commented May 6, 2020

just wanted to confirm that with this change in place i am now able to pass the proposed test for this[0]

[0] openshift/cluster-api-actuator-pkg#151

@elmiko
Copy link
Author

elmiko commented May 18, 2020

/retest

Copy link

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Though I would like to see a follow up eventually so that all access to the NodeGroups comes under lock to prevent clashes

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2020
@michaelgugino
Copy link

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 21, 2020
@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 9ac8ed2 into openshift:master May 21, 2020
@openshift-ci-robot
Copy link

@elmiko: Some pull requests linked via external trackers have merged: openshift/kubernetes-autoscaler#147. The following pull requests linked via external trackers have not merged:

In response to this:

Bug 1823667: UPSTREAM: : Add mutex to DeleteNodes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@elmiko elmiko deleted the DeleteNodes-mutex branch June 3, 2020 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants