daemon: skip or perform node reboot based on rebootAction #2254

sinnykumari · 2020-11-24T16:27:18Z

Today MCO defaults to rebooting node in order to apply successfully
changes to the node. This is the safest way to apply a change on the
node as we don't have to worry about things like which change is safe
to skip reboot.

In certain environments, rebooting is expensive and due to complex
hardware setup sometimes can cause problems and node won't boot up.

This will help to avoid rebooting nodes for certain cases where
MCO thinks it is safe to do so.

See - openshift/enhancements#159

Note In the current state, this will default to rebooting node. When MCO learns what rebootAction to be taken based on old and new rendered MC diff, we can add the logic.

openshift-ci-robot · 2020-11-24T16:27:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sinnykumari]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Today MCO defaults to rebooting node in order to apply successfully changes to the node. This is the safest way to apply a change on the node as we don't have to worry about things like which change is safe to skip reboot. In certain environments, rebooting is expensive and due to complex hardware setup sometimes can cause problems and node won't boot up. This will help to avoid rebooting nodes for certain cases where MCO thinks it is safe to do so. See - openshift/enhancements#159

openshift-merge-robot · 2020-11-24T18:48:02Z

@sinnykumari: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws	`84362bf`	link	`/test e2e-aws`
ci/prow/e2e-gcp-op	`84362bf`	link	`/test e2e-gcp-op`
ci/prow/e2e-aws-serial	`84362bf`	link	`/test e2e-aws-serial`
ci/prow/e2e-aws-workers-rhel7	`84362bf`	link	`/test e2e-aws-workers-rhel7`
ci/prow/okd-e2e-aws	`84362bf`	link	`/test okd-e2e-aws`
ci/prow/e2e-agnostic-upgrade	`84362bf`	link	`/test e2e-agnostic-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

yuqi-zhang · 2020-11-24T19:12:32Z

pkg/daemon/update.go

+	case rebootActionReloadCrio:
+		if err := reloadCrioConfig(); err != nil {
+			dn.logSystem("Reloading crio configuration failed, rebooting: %v", err)
+			dn.reboot(fmt.Sprintf("Node will reboot into config %s", configName))


One thing I'm debating is whether we want to hard-reboot here or block + degrade. It seems to me that if we are explicitly supporting "rebootless updates" that we should not reboot unexpectedly because of a failure. We should probably do e.g. what we do when a file write fails. This might also allow us not to drain (although recovery is another thing we need to consider) What do you think @sinnykumari ?

f you degrade, the customer can't really recover from it and do something to make it work on a retry... so degrading might mean that it can't proceed. not sure how that cuts but since it's an optimization it feels like rebooting should be the fall back.

(at least to me rebootless upgrades isn't a promise, it's a best effort optimization in certain cases where it can be safely done)

We can but isn't it better to reboot (crio will restart on reboot) and fix the problem by itself rather than degrade? I think we are not promising rebootless update rather it is a best effort basis i.e. MCO tries its best not to reboot for certain config change but if an error occur it reboots. As this is new feature, we can frame documentation accordingly.

we can definitely become smarter in future.

ha jinx! sinny and i both wrote our responses at the same time 😆

yuqi-zhang · 2020-11-24T23:01:16Z

pkg/daemon/daemon.go

 		}

 		state.currentConfig = state.pendingConfig
 	}

-	if state.bootstrapping {
-		if err := dn.storeCurrentConfigOnDisk(state.currentConfig); err != nil {


is this no longer needed?

good catch, seems like it got lost during function split up.

crawford · 2020-11-25T19:51:56Z

pkg/daemon/update.go

-// be called as a special case for the "bootstrap pivot".
-func (dn *Daemon) finalizeAndReboot(newConfig *mcfgv1.MachineConfig) (retErr error) {
+func reloadCrioConfig() error {
+	_, err := runGetOut("pkill", "-HUP", "crio")


It would be nice if we could use systemctl reload crio.service instead. That way, crio can specify how exactly it needs to be reloaded (ExecReload) without the MCD needing to know the details. I don't think this will work today, since crio.service doesn't define an ExecReload (systemd will restart the service in this case).

Agreed but I think for this release we are ok with this operation as it stands

As Jerry said we can but we are trying to keep impact as minimal as possible. Since, crio supports live configuration reload for container registry, we are making use of it in our usecase.

crawford · 2020-11-25T19:56:52Z

pkg/daemon/update.go

+// For non-reboot action, it applies configuration, updates node's config and state.
+// In the end uncordon node to schedule workload.
+// If at any point an error occurs, we reboot the node so that node has correct configuration.
+func (dn *Daemon) performRebootAction(action rebootAction, configName string) error {


I would rename this so that it doesn't include "reboot" in the name (since we might not be rebooting). I'm also not sure that an enum is going to be expressive enough (feel free to tell me to stop over-complicating things and to come back in six months). I had envisioned this being expressed as a list of actions that needed to be taken (e.g. [ ] (rebootActionNone), [ reboot ] (rebootActionReboot), [ reload crio.service, reload foobar.service] (rebootActionReloadCrio + reload foobar) so that we could build up the list as we enumerate the diffed config (e.g. any changes to a particular set of paths triggers a reload of crio, any unrecognized change triggers a reboot).

I've changed it to "PostConfigChangeAction" in an attempt to be more descriptive but that doesn't quite feel right either. Naming open to suggestions.

As for the enum, I've changed it to a list. I initially created the enum as a skeleton structure since we were handling very limited cases, but I agree that parsing a list will be needed in the future. I've updated that at: #2259

Agree with better regrouping of actions.

yuqi-zhang · 2020-11-26T04:29:20Z

PR that builds on this: #2259

sinnykumari · 2020-11-26T15:03:43Z

For better co-ordination, review and testing, we have updated changes from this PR into Jerry's PR #2259 . Closing this as further chanes will be pushed directly into #2259

openshift-ci-robot requested review from kikisdeliveryservice and runcom November 24, 2020 16:27

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 24, 2020

sinnykumari requested a review from yuqi-zhang November 24, 2020 16:28

sinnykumari added the team-mco label Nov 24, 2020

sinnykumari force-pushed the reboot-actions branch from 9ce5184 to 84362bf Compare November 24, 2020 16:48

yuqi-zhang reviewed Nov 24, 2020

View reviewed changes

cgwalters mentioned this pull request Nov 25, 2020

Support a provisioning token for the Machine Config Server openshift/enhancements#443

Closed

crawford reviewed Nov 25, 2020

View reviewed changes

yuqi-zhang mentioned this pull request Nov 26, 2020

daemon: selectively reboot based on diffs of applied MCs #2259

Merged

sinnykumari closed this Nov 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

daemon: skip or perform node reboot based on rebootAction #2254

daemon: skip or perform node reboot based on rebootAction #2254

sinnykumari commented Nov 24, 2020

openshift-ci-robot commented Nov 24, 2020

openshift-merge-robot commented Nov 24, 2020

yuqi-zhang Nov 24, 2020

kikisdeliveryservice Nov 24, 2020

kikisdeliveryservice Nov 24, 2020

sinnykumari Nov 24, 2020

sinnykumari Nov 24, 2020

kikisdeliveryservice Nov 24, 2020

yuqi-zhang Nov 24, 2020

sinnykumari Nov 26, 2020

crawford Nov 25, 2020

yuqi-zhang Nov 26, 2020

sinnykumari Nov 26, 2020

crawford Nov 25, 2020

yuqi-zhang Nov 26, 2020

sinnykumari Nov 26, 2020

yuqi-zhang commented Nov 26, 2020

sinnykumari commented Nov 26, 2020

daemon: skip or perform node reboot based on rebootAction #2254

daemon: skip or perform node reboot based on rebootAction #2254

Conversation

sinnykumari commented Nov 24, 2020

openshift-ci-robot commented Nov 24, 2020

openshift-merge-robot commented Nov 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuqi-zhang commented Nov 26, 2020

sinnykumari commented Nov 26, 2020