-
Notifications
You must be signed in to change notification settings - Fork 47
Only Mark Node Schedulable If Agent Marked Unschedulable #176
Conversation
Can one of the admins verify this patch? |
11 similar comments
Can one of the admins verify this patch? |
Can one of the admins verify this patch? |
Can one of the admins verify this patch? |
Can one of the admins verify this patch? |
Can one of the admins verify this patch? |
Can one of the admins verify this patch? |
Can one of the admins verify this patch? |
Can one of the admins verify this patch? |
Can one of the admins verify this patch? |
Can one of the admins verify this patch? |
Can one of the admins verify this patch? |
Instance of jenkinsci/ghprb-plugin#292. Honestly, we should probably move CoreOS projects related to k8s over to prow now that they've finished moving upstream off jenkins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think other than the ordering of some calls this looks good. Thank you for the contribution! Apologies that it took us so long to get to it
return err | ||
if !alreadyUnschedulable { | ||
glog.Info("Marking node as unschedulable") | ||
if err := k8sutil.Unschedulable(k.nc, k.node, true); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there's a slight race here.
If the node is marked unscheduleable, but the annotation call fails (e.g. due to apiserver issues or this code failing), the failure condition is that the node will be unschedulable and never be made schedulable again.
I think that the k8sutil.SetNodeAnnotations
call several lines above should include this annotation if necessary.
I think that setting the annotation before marking it unschedulable leaves it in a state where it will ultimately do the right thing once it succeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. Good catch. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this was done at line 165.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this got updated after this comment and github just didn't mark it as resolved
glog.Info("Marking node as schedulable") | ||
if err := k8sutil.Unschedulable(k.nc, k.node, false); err != nil { | ||
return err | ||
if makeSchedulable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be moved to before the SetNodeAnnotationLabels
for MadeUnschedulable: False
. As is, I believe if it crashes or has issues with this call, it will not restore state correctly afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point but I'm concerned with moving it above if err := k.waitForNotOkToReboot(); err != nil {
. From the comment above that line, it seems like there might be trouble if we preemptively mark the node as schedulable before waiting for ok-to-reboot to clear. How about if I set the MadeUnschedulable
annotation to false after the if makeSchedulable
block (around line 129)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, inded, you're right. Moving the annotation down seems like it should fix the concern I have correctly.
ok to test |
In addition to @euank's comments, if you don't mind squashing your cleanup and merge commits together, that would be awesome. Probably just the first three commits you have should cover it pretty well. Just ping us whenever that is done and we will give it a final review. |
Can one of the admins verify this patch? |
Will do! A bit swamped but have this on my to-do list. |
Can one of the admins verify this patch? |
@hankjacobs any updates on this? just a friendly ping, no rush. |
still on my to-do list. sorry about that! I swear I'll get around to it in the next two weeks. |
1814fe4
to
1b765c7
Compare
Thanks for pushing the fix! I'll take a look at this again sometime soon. My mistake with the issue reference, I had the link in my copy buffer and accidentally put it in a comment on an unrelated issue. |
return fmt.Errorf("failed to get node %q: %v", node, getErr) | ||
} | ||
|
||
apiNode = n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Seems like apiNode
could be set where n
is set. Not worth blocking over though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, OTOH doing it this way makes it easy to verify that apiNode
can only ever be set if we're done retrying.
return err | ||
if !alreadyUnschedulable { | ||
glog.Info("Marking node as unschedulable") | ||
if err := k8sutil.Unschedulable(k.nc, k.node, true); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this was done at line 165.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks sane to me overall! Just some nits below.
Also, might be worth rebasing this PR to avoid the merge commit in this series.
return fmt.Errorf("failed to get node %q: %v", node, getErr) | ||
} | ||
|
||
apiNode = n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, OTOH doing it this way makes it easy to verify that apiNode
can only ever be set if we're done retrying.
pkg/agent/agent.go
Outdated
} | ||
|
||
// Only make a node schedulable if a reboot was in progress. This prevents a node from being made schedulable | ||
// if it was made unschedulable by something other thin the agent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor/optional: s/thin/than/
pkg/agent/agent.go
Outdated
} | ||
|
||
glog.Infof("Setting annotations %#v", anno) | ||
if err := k8sutil.SetNodeAnnotations(k.nc, k.node, anno); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need to do this if we're the ones who marked the node unschedulable, right? I.e. we can move this hunk to under the if makeSchedulable {
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it can be moved. I agree with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hankjacobs sorry that it took so long for us to come back to this. It seems like with just a couple of minor changes it will be good to merge in. If you could do the rebasing that @jlebon mentioned above too, that would be awesome.
I would like to get this in before I release a new version of CLUO. However, if you aren't going to be able to get to it soon, that's totally fine, just let me know and I'll release the new version without this change, and then release another version later when we do get this change merged. Thanks!
pkg/agent/agent.go
Outdated
|
||
// Only make a node schedulable if a reboot was in progress. This prevents a node from being made schedulable | ||
// if it was made unschedulable by something other thin the agent | ||
makeSchedulable := node.Annotations[constants.AnnotationAgentMadeUnschedulable] == constants.True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the agent runs for the first time, this annotation doesn't exist, so this value is set as false, and the log message that says "node was marked unschedulable by external source" is printed out, which could be confusing. We should probably use the map multi-return value to check for the key existence and use that in the if statement below.
pkg/agent/agent.go
Outdated
} | ||
|
||
glog.Infof("Setting annotations %#v", anno) | ||
if err := k8sutil.SetNodeAnnotations(k.nc, k.node, anno); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it can be moved. I agree with this.
…a generic retry function that retries on any error.
…lable. This is accomplished using a new annotation
…annotation if agent made node schedulable.
@sdemos All set with the requested changes. I haven't had a chance to test these changes since I had torn down my original testing infrastructure a while back. I may have some time to set it up tomorrow but no guarantees. Would you be able to test? If not, no biggie. |
Changes look good to me. I can test it out. Thanks for the quick fixes! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested out the functionality on a test cluster, everything works as expected. Thanks for your patience!
This PR changes the agent to mark a node schedulable only if the agent was responsible for marking it unschedulable.
Prior to this change, the agent would always mark a node as schedulable when starting up. This could result in a node that had been marked unschedulable by an external process (eg. by an admin using kubectl) being marked schedulable by it's update agent if the agent restarted for some reason.