Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continue node drain after reboot #232

Merged
merged 2 commits into from
Jan 27, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 39 additions & 19 deletions pkg/daemon/daemon.go
Original file line number Diff line number Diff line change
Expand Up @@ -498,20 +498,23 @@ func (dn *Daemon) nodeStateSyncHandler(generation int64) error {
}

if reqDrain {
if !dn.disableDrain {
ctx, cancel := context.WithCancel(context.TODO())
defer cancel()

glog.Infof("nodeStateSyncHandler(): get drain lock for sriov daemon")
done := make(chan bool)
go dn.getDrainLock(ctx, done)
<-done
}
if !dn.isNodeDraining() {
adrianchiris marked this conversation as resolved.
Show resolved Hide resolved
if !dn.disableDrain {
ctx, cancel := context.WithCancel(context.TODO())
defer cancel()

glog.Infof("nodeStateSyncHandler(): get drain lock for sriov daemon")
done := make(chan bool)
go dn.getDrainLock(ctx, done)
<-done

if utils.ClusterType == utils.ClusterTypeOpenshift {
glog.Infof("nodeStateSyncHandler(): pause MCP")
if err := dn.pauseMCP(); err != nil {
return err
}

if utils.ClusterType == utils.ClusterTypeOpenshift {
glog.Infof("nodeStateSyncHandler(): pause MCP")
if err := dn.pauseMCP(); err != nil {
return err
}
}
}

Expand Down Expand Up @@ -545,15 +548,17 @@ func (dn *Daemon) nodeStateSyncHandler(generation int64) error {
glog.Errorf("nodeStateSyncHandler(): fail to restart device plugin pod: %v", err)
return err
}
if anno, ok := dn.node.Annotations[annoKey]; ok && (anno == annoDraining || anno == annoMcpPaused) {
if dn.isNodeDraining() {
if err := dn.completeDrain(); err != nil {
glog.Errorf("nodeStateSyncHandler(): failed to complete draining: %v", err)
return err
}
} else if !ok {
if err := dn.annotateNode(dn.name, annoIdle); err != nil {
glog.Errorf("nodeStateSyncHandler(): failed to annotate node: %v", err)
return err
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are now calling dn.annotateNode() in the else clause even if annotation is already set. this will invoke an extra GET to k8s API.

this IMO should be called like before only if annotation is not set WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only two places where we add 'Idle' annotation: the line above and 'completeDrain' function, so IMO it doesn't make sense to check for it.

Probably we need to refactor the part of code where we do annotations to make it more clear

Copy link
Collaborator

@adrianchiris adrianchiris Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just to save a call to k8s API in annotateNode()

Think of a case where nodeStateSyncHandler runs but (for some reason) it does not require drain and the sriov-node-state annotation is already set to idle. in this case we dont really need to call annotateNode.

not sure how often we might hit this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, will fix it into 'annotateNode' function

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be a corner case where dn.node is not the latest if not get from annotateNode call?

Copy link
Collaborator

@adrianchiris adrianchiris Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be a corner case where dn.node is not the latest if not get from annotateNode call?

dn.node will get updated via informer. it will eventually be consistent (thats how config daemon is designed).

Copy link
Collaborator

@adrianchiris adrianchiris Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, will fix it into 'annotateNode' function

I think this else clause aims to add idle annotation if annotation does not exist in dn.node. i prefer to keep this logic
(add a check : hasSriovNodeStateAnnot()) or similar.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be a corner case where dn.node is not the latest if not get from annotateNode call?

dn.node will get updated via informer.

I was thinking if there is delay between informer receive the node update event and dn.node be update
or a delay between node being update in api server and informer receive the node update event.

In the above case, dn.node is not the latest, which may result in inaccurate return from nodeHasAnnotation? (just thinking the possibility, I didn't test myself or it may be easy to test)

it will eventually be consistent (thats how config daemon is designed).

If the nodeHasAnnotation returns incorrectly, I don't see how it can resolve by itself, guess the node will still maintain its old annotation forever unless new policy or state event?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the nodeHasAnnotation returns incorrectly, I don't see how it can resolve by itself, guess the node will still maintain its old annotation forever unless new policy or state event?

i dont see how its different from what we had before, the change just moved the logic to a separate function.

your concern is valid theoretically, we have not encountered this. and its general for config daemon design.

if !dn.nodeHasAnnotation(annoKey, annoIdle) {
adrianchiris marked this conversation as resolved.
Show resolved Hide resolved
if err := dn.annotateNode(dn.name, annoIdle); err != nil {
glog.Errorf("nodeStateSyncHandler(): failed to annotate node: %v", err)
return err
}
}
}
glog.Info("nodeStateSyncHandler(): sync succeeded")
Expand All @@ -567,6 +572,21 @@ func (dn *Daemon) nodeStateSyncHandler(generation int64) error {
return nil
}

func (dn *Daemon) nodeHasAnnotation(annoKey string, value string) bool {
// Check if node already contains annotation
if anno, ok := dn.node.Annotations[annoKey]; ok && (anno == value) {
return true
}
return false
}

func (dn *Daemon) isNodeDraining() bool {
if anno, ok := dn.node.Annotations[annoKey]; ok && (anno == annoDraining || anno == annoMcpPaused) {
return true
}
return false
}

func (dn *Daemon) completeDrain() error {
if !dn.disableDrain {
if err := drain.RunCordonOrUncordon(dn.drainer, dn.node, false); err != nil {
Expand Down Expand Up @@ -810,7 +830,7 @@ func (dn *Daemon) getDrainLock(ctx context.Context, done chan bool) {
done <- true
return
}
glog.V(3).Info("getDrainLock(): other node is draining, wait...")
glog.V(2).Info("getDrainLock(): other node is draining, wait...")
}
},
OnStoppedLeading: func() {
Expand Down