-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1823667: UPSTREAM: <carry>: Add mutex to DeleteNodes #149
Bug 1823667: UPSTREAM: <carry>: Add mutex to DeleteNodes #149
Conversation
@elmiko: This pull request references Bugzilla bug 1823667, which is valid. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
i had thought about using an RWMutex instead and attempting to make the access more granular, but thinking about our discussion this morning i opted to keep the patch simpler and more tightly focused on the gating behavior for DeleteNodes. |
@enxebre @JoelSpeed ptal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested this patch and tried to repro the bug?
i had thought about using an RWMutex instead and attempting to make the access more granular, but thinking about our discussion this morning i opted to keep the patch simpler and more tightly focused on the gating behavior for DeleteNodes.
I like the idea of putting an RWMutex in and ensuring all NG methods are locked personally, but as you say, it does add some extra complexity to the code
@@ -62,6 +63,7 @@ type machineController struct { | |||
machineSetResource *schema.GroupVersionResource | |||
machineResource *schema.GroupVersionResource | |||
machineDeploymentResource *schema.GroupVersionResource | |||
accessLock *sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could make this a non-pointer, has the same effect, plus the zero value works as well so you don't need the initialisation either. I don't mind either way since both work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll change it in favor a non-pointer reference
i haven't had conclusive tests yet. i started working on an automated test and didn't try this patch out as deeply. today i should get to beat it up more.
so, my initial thought was to keep this change tightly scoped to what we had talked about. that said, i could see making this change more thorough by putting mutexes on all the methods if that's what we want. i'm open either way. |
@JoelSpeed just wanted to confirm, that after testing a bunch this morning, this patch in isolation does not fix the taint issue. it looked promising the first test run, but then went steadily downhill. |
updated the pr to include @JoelSpeed 's comments. fwiw, i tested out putting an RWMutex in and then trying to isolate the various calls but this resulted in some deadlock behavior during scaling events. i'm defaulting back to the original position that we should make these changes slowly, or at least focus on the lock behavior in another pull request. |
/retest |
This change adds a mutex to the MachineController structure which is used to gate access to the DeleteNodes function. This is one in a series of PRs to mitigate kubernetes#3104 Once a solution has been reached, this will be contribued upstream.
/retest |
i'm not sure why yet, but this test is now failing the double delete unit test. i am not seeing this failure locally though. i am also seeing some failures locally with the e2e tests that are not happening in CI. on the upside, it my "real world" testing this does appear to fix the issue of taints being left on nodes. |
just wanted to confirm that with this change in place i am now able to pass the proposed test for this[0] |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Though I would like to see a follow up eventually so that all access to the NodeGroups comes under lock to prevent clashes
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JoelSpeed The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
/retest Please review the full test history for this PR and help us cut down flakes. |
@elmiko: Some pull requests linked via external trackers have merged: openshift/kubernetes-autoscaler#147. The following pull requests linked via external trackers have not merged:
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This change adds a mutex to the MachineController structure which is
used to gate access to the DeleteNodes function.
This is one in a series of PRs to mitigate kubernetes#3104