Skip to content
This repository has been archived by the owner on Sep 18, 2020. It is now read-only.

Only Mark Node Schedulable If Agent Marked Unschedulable #176

Merged
merged 5 commits into from
Jun 13, 2018

Conversation

hankjacobs
Copy link
Contributor

This PR changes the agent to mark a node schedulable only if the agent was responsible for marking it unschedulable.

Prior to this change, the agent would always mark a node as schedulable when starting up. This could result in a node that had been marked unschedulable by an external process (eg. by an admin using kubectl) being marked schedulable by it's update agent if the agent restarted for some reason.

@coreosbot
Copy link

Can one of the admins verify this patch?

11 similar comments
@coreosbot
Copy link

Can one of the admins verify this patch?

@gojihotsauce
Copy link

Can one of the admins verify this patch?

@gojihotsauce
Copy link

Can one of the admins verify this patch?

@gojihotsauce
Copy link

Can one of the admins verify this patch?

@coreosbot
Copy link

Can one of the admins verify this patch?

@coreosbot
Copy link

Can one of the admins verify this patch?

@coreosbot
Copy link

Can one of the admins verify this patch?

@coreosbot
Copy link

Can one of the admins verify this patch?

@coreosbot
Copy link

Can one of the admins verify this patch?

@coreosbot
Copy link

Can one of the admins verify this patch?

@coreosbot
Copy link

Can one of the admins verify this patch?

@euank
Copy link
Contributor

euank commented Mar 19, 2018

Instance of jenkinsci/ghprb-plugin#292.

Honestly, we should probably move CoreOS projects related to k8s over to prow now that they've finished moving upstream off jenkins.

Copy link
Contributor

@euank euank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think other than the ordering of some calls this looks good. Thank you for the contribution! Apologies that it took us so long to get to it

return err
if !alreadyUnschedulable {
glog.Info("Marking node as unschedulable")
if err := k8sutil.Unschedulable(k.nc, k.node, true); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there's a slight race here.

If the node is marked unscheduleable, but the annotation call fails (e.g. due to apiserver issues or this code failing), the failure condition is that the node will be unschedulable and never be made schedulable again.

I think that the k8sutil.SetNodeAnnotations call several lines above should include this annotation if necessary.

I think that setting the annotation before marking it unschedulable leaves it in a state where it will ultimately do the right thing once it succeeds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. Good catch. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this was done at line 165.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this got updated after this comment and github just didn't mark it as resolved

glog.Info("Marking node as schedulable")
if err := k8sutil.Unschedulable(k.nc, k.node, false); err != nil {
return err
if makeSchedulable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be moved to before the SetNodeAnnotationLabels for MadeUnschedulable: False. As is, I believe if it crashes or has issues with this call, it will not restore state correctly afterwards.

Copy link
Contributor Author

@hankjacobs hankjacobs Mar 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point but I'm concerned with moving it above if err := k.waitForNotOkToReboot(); err != nil {. From the comment above that line, it seems like there might be trouble if we preemptively mark the node as schedulable before waiting for ok-to-reboot to clear. How about if I set the MadeUnschedulable annotation to false after the if makeSchedulable block (around line 129)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, inded, you're right. Moving the annotation down seems like it should fix the concern I have correctly.

@euank
Copy link
Contributor

euank commented Mar 19, 2018

ok to test

@sdemos
Copy link
Contributor

sdemos commented Mar 27, 2018

In addition to @euank's comments, if you don't mind squashing your cleanup and merge commits together, that would be awesome. Probably just the first three commits you have should cover it pretty well.

Just ping us whenever that is done and we will give it a final review.

@coreosbot
Copy link

Can one of the admins verify this patch?

@hankjacobs
Copy link
Contributor Author

Will do! A bit swamped but have this on my to-do list.

@coreosbot
Copy link

Can one of the admins verify this patch?

@sdemos
Copy link
Contributor

sdemos commented Apr 9, 2018

@hankjacobs any updates on this? just a friendly ping, no rush.

@hankjacobs
Copy link
Contributor Author

still on my to-do list. sorry about that! I swear I'll get around to it in the next two weeks.

@hankjacobs hankjacobs force-pushed the master branch 2 times, most recently from 1814fe4 to 1b765c7 Compare April 17, 2018 16:37
@hankjacobs
Copy link
Contributor Author

@euank @sdemos Good to go. Thanks for your patience!

@sdemos
Copy link
Contributor

sdemos commented May 7, 2018

Thanks for pushing the fix! I'll take a look at this again sometime soon.

My mistake with the issue reference, I had the link in my copy buffer and accidentally put it in a comment on an unrelated issue.

return fmt.Errorf("failed to get node %q: %v", node, getErr)
}

apiNode = n
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Seems like apiNode could be set where n is set. Not worth blocking over though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, OTOH doing it this way makes it easy to verify that apiNode can only ever be set if we're done retrying.

return err
if !alreadyUnschedulable {
glog.Info("Marking node as unschedulable")
if err := k8sutil.Unschedulable(k.nc, k.node, true); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this was done at line 165.

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sane to me overall! Just some nits below.

Also, might be worth rebasing this PR to avoid the merge commit in this series.

return fmt.Errorf("failed to get node %q: %v", node, getErr)
}

apiNode = n
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, OTOH doing it this way makes it easy to verify that apiNode can only ever be set if we're done retrying.

}

// Only make a node schedulable if a reboot was in progress. This prevents a node from being made schedulable
// if it was made unschedulable by something other thin the agent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor/optional: s/thin/than/

}

glog.Infof("Setting annotations %#v", anno)
if err := k8sutil.SetNodeAnnotations(k.nc, k.node, anno); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to do this if we're the ones who marked the node unschedulable, right? I.e. we can move this hunk to under the if makeSchedulable { above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it can be moved. I agree with this.

Copy link
Contributor

@sdemos sdemos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hankjacobs sorry that it took so long for us to come back to this. It seems like with just a couple of minor changes it will be good to merge in. If you could do the rebasing that @jlebon mentioned above too, that would be awesome.

I would like to get this in before I release a new version of CLUO. However, if you aren't going to be able to get to it soon, that's totally fine, just let me know and I'll release the new version without this change, and then release another version later when we do get this change merged. Thanks!


// Only make a node schedulable if a reboot was in progress. This prevents a node from being made schedulable
// if it was made unschedulable by something other thin the agent
makeSchedulable := node.Annotations[constants.AnnotationAgentMadeUnschedulable] == constants.True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the agent runs for the first time, this annotation doesn't exist, so this value is set as false, and the log message that says "node was marked unschedulable by external source" is printed out, which could be confusing. We should probably use the map multi-return value to check for the key existence and use that in the if statement below.

}

glog.Infof("Setting annotations %#v", anno)
if err := k8sutil.SetNodeAnnotations(k.nc, k.node, anno); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it can be moved. I agree with this.

…a generic retry function that retries on any error.
@hankjacobs
Copy link
Contributor Author

hankjacobs commented Jun 13, 2018

@sdemos All set with the requested changes. I haven't had a chance to test these changes since I had torn down my original testing infrastructure a while back. I may have some time to set it up tomorrow but no guarantees. Would you be able to test? If not, no biggie.

@sdemos
Copy link
Contributor

sdemos commented Jun 13, 2018

Changes look good to me. I can test it out. Thanks for the quick fixes!

Copy link
Contributor

@sdemos sdemos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested out the functionality on a test cluster, everything works as expected. Thanks for your patience!

@sdemos sdemos merged commit 7996c46 into coreos:master Jun 13, 2018
@sdemos sdemos mentioned this pull request Jun 13, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants