-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better support for openshift single node #213
Better support for openshift single node #213
Conversation
This commit allow to still pause the MCP when running on OCP. This is needed when you have only 1 node in the cluster and the drainSkip is true Signed-off-by: Sebastian Sch <[email protected]>
return reconcile.Result{}, err | ||
} | ||
|
||
disableDrain := len(nodeList.Items) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check that the the controlPlaneTopology = SingleReplica
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The machineconfiguration.openshift.io/controlPlaneTopology: SingleReplica
is a MCO label. I want this feature to work also for plain k8s
@@ -78,6 +78,14 @@ func (r *SriovOperatorConfigReconciler) Reconcile(ctx context.Context, req ctrl. | |||
Name: constants.DEFAULT_CONFIG_NAME, Namespace: namespace}, defaultConfig) | |||
if err != nil { | |||
if errors.IsNotFound(err) { | |||
// Check if we only have one node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error won't be triggered unless user delete the default SriovOperatorConfig explicitly. Is it what we want? The default SriovOperatorConfig is created by OperatorConfig controller: https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/master/main.go#L154
Do we need to consider the upgrade scenario? e.g. from lower version to 4.10 which contains this fix. If not, I think adding the disableDrain setting in the above 154 line shall be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zshi-redhat thanks! I miss that file!
I will like to also work on updates so I put the change in both places.
The reason I put it on both places is a was able to make a race from when I install the operator + a policy and until there is a reconcile to add the skip
fb7f39c
to
cd10a0a
Compare
|
||
// Update the disableDrain field if needed | ||
if len(nodeList.Items) == 1 && !defaultConfig.Spec.DisableDrain { | ||
defaultConfig.Spec.DisableDrain = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, add logging here. We need to let users know that configuration is changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a one way switch, what if the cluster scales down to one then scales up to more that 1 node ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrianchiris do you think we should support this case?
This function is only if the config doesn't exist this means on install time or if the user for some reason removes this file.
If the user change the cluster topology he can
- update the sriov operator config
- remove the config and let the operator understand the cluster topology
I will not like to make the sriovOperatorConfig reconcile to check the cluster status.
I will not like also to add a trigger for this object on a node object change it will just spam the operator on a large cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is only if the config doesn't exist this means on install time or if the user for some reason removes this file.
This is not under the if block at L80, so it will be executed every time Reconcile is called.
maybe user put some configuration, then cluster scaled down (to 1) and operator restarted. this logic would then change the user configuration. later on cluster scaled up and we ended up with drain disabled.
So Here are my thoughts:
- we document that in case of a single Node cluster, the user should explicitly set DisableDrain
- we do this logic only when (re)creating the default object (within if statement block in L80 or in main.go)
im fine with either 1 or 2
2. probably having a better user experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also prefer 2 changing the code as requested
main.go
Outdated
@@ -232,13 +233,23 @@ func createDefaultOperatorConfig(cfg *rest.Config) error { | |||
if err != nil { | |||
return fmt.Errorf("Couldn't create client: %v", err) | |||
} | |||
|
|||
// Check if we only have one node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this commit message doesn't align with a code below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i think reading L237 is pretty clear, you can drop this comment if you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
done and not removed ? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry about that
@@ -99,6 +100,25 @@ func (r *SriovOperatorConfigReconciler) Reconcile(ctx context.Context, req ctrl. | |||
return reconcile.Result{}, err | |||
} | |||
|
|||
// Check if we only have one node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same comment a for main.go change. Do we want to move this check to pkg.utils module as 'GetNodesConut' function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done can you please have another look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i think reading L104 is pretty clear, you can drop this comment if you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cd10a0a
to
9d02c76
Compare
Hi @e0ne @pliurh @adrianchiris can you please give this another look? |
pkg/daemon/daemon.go
Outdated
done := make(chan bool) | ||
go dn.getDrainLock(ctx, done) | ||
<-done | ||
if utils.ClusterType != utils.ClusterTypeOpenshift { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move (and inverse) this check to daemon.go L510 ? and only call pauseMCP in openshift cluster ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure no problem
pkg/utils/cluster.go
Outdated
} | ||
|
||
if len(nodeList.Items) == 1 { | ||
glog.Infof("only one node found in the cluster marking disableDrain as true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i dont think this log is needed.
if you want to keep it, please update the message as disableDrain is now out of context.
(IsSingleNodeCluster doesnt really care what operations may be done due to its invocation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will like to leave the message makes debug easier :)
I just change the comment
@SchSeba might be a dumb question, but why do we want to disable drain on a single node cluster ? :) |
Hi @adrianchiris there are cases when the user have a 1 node cluster and deploy some applications using |
9d02c76
to
4d40435
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
4d40435
to
06d2fbe
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
/lgtm |
pkg/daemon/daemon.go
Outdated
|
||
glog.Infof("nodeStateSyncHandler(): get drain lock for sriov daemon") | ||
done := make(chan bool) | ||
go dn.getDrainLock(ctx, done) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in Kubernetes cluster Type we will now always try to get the drain lock, even if disableDrain is enabled.
previously it skipped getting the lock.
not sure its a major issue, as i think disableDrain is used mainly for testing
@zshi-redhat @pliurh WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a minor issue. If the cluster is a single node cluster, there is no other node that needs to wait for the lock. For multi-node clusters, this flag is mainly for testing. The drawback is that, as the LeaseDuration
is 5s, we will have to waste 5s on each node when it requires draining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack thx @pliurh ! id prefer to condition on (disableDrain or Openshift ) to avoid api calls for the lease and the 5s delay per node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrianchiris done please have a look I also test it internally
06d2fbe
to
58827f1
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
Signed-off-by: Sebastian Sch <[email protected]>
58827f1
to
0e08eb7
Compare
Thanks for your PR,
To skip the vendors CIs use one of:
|
/test-all |
Failure related to #218 |
/test-all |
Fix proposed in #219 |
/test-all |
Hi @adrianchiris anything we are missing here or we can merge this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder @SchSeba ! LGTM
This PR allows to still pause the MCP when running on OCP even if it's a single node with no drain option.
Another change is to the sriovOperatorConfig to mark the DisableDrain when there is only one node in the cluster