Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Change Management and Maintenance Schedules #1571

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

jupierce
Copy link
Contributor

No description provided.

@jupierce jupierce changed the title Add Change Management and Maintenance Schedules WIP: Add Change Management and Maintenance Schedules Feb 23, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 23, 2024
Copy link
Contributor

@dhellmann dhellmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about the complexity of the first 2 workflows. I think everything here will work as described, but having 2 separate operators in the system evaluating schedules and rules about what updates are allowed when seems more complicated than having 1 thing toggling the on/off flag for those components. Centralizing the schedule management means we only need 1 thing to track time, and everything else can just watch it's input API for changes, including to that on/off flag, to know what it needs to be doing at any given time. It also gives us the option of delivering that centralized thing separately (in an operator, ACM, OCM, etc.) or the owner of self-managed clusters providing their own implementation applying some rules we wouldn't want to have to support.

When this expectation is violated because of problematic user workloads, the update process is
often called into question. For example, if an update appears stalled after 12 hours, a
user is likely to have a poor perception of the platform and open a support ticket before
successfully diagnosing an underlying undrainable workload.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a bug in our existing upgrade status reporting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're working to simplify the experience for operations teams on several fronts:

  1. oc adm update status which will provide a single place to find detailed status information and SOPs.
  2. Another enhancement I will be writing where node drain timeouts are an easily enabled part of a fully managed update.
  3. This enhancement which:
  4. Creates the foundation for control-plane and worker-node separation which will exploited by other oc commands to support diverse personas.
  5. Makes this type of issue less disruptive. Even if a drain takes 12 hours, you don't have to sit there and watch it. When it finishes, if it is outside your maintenance schedule, your cluster will not be subject to any more disruption.


By acknowledging the need for scheduled maintenance in the platform, we reduce the need for Service
Delivery to develop and maintain custom tooling to manage the platform while
simultaneously reducing simplifying management for all customer facing similar challenges.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If those tools already exist, could we just take over maintenance of those tools instead of moving the implementation of the feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an authority on its implementation, but I believe it has aspects spread across OCM, Hive, and MUO. I think the implementation would also differ for HCP.
Assuming we could take it over, it would not facilitate the separation of control-plane and worker-node updates since it only creates a window for an entire cluster update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some also in managed-cluster-config (MCC) which is plumbed through hive, and it's not a great experience to manage. We have to pick a set of verisons when a configuration can apply and then react when the cluster version is changed. We're in the process of reworking MCC so this is less painful but ultimately still is not a great experience to manage. Having the ability to just apply a change to a MachineConfig etc would be far better in my opinion.

1. Because not all worker-nodes have been updated, changes are still reported as pending via metrics for NodePool. **TODO: Review with HyperShift. Pausing progress should be possible, but a metric indicating changes still pending may not since they interact only through CAPI.**
1. The HCP cluster runs with worker-nodes at mixed versions throughout the month. The N-1 skew between the old kubelet versions and control-plane is supported.
1. **TODO: Review with Service Delivery. If the user requested another minor bump to their control-plane, how does OCM prevent unsupported version skew today?**
1. On the next first Saturday, the worker-nodes updates are completed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This detailed workflow narrative is amazing!


### Implement maintenance schedules via an external control system (e.g. ACM)
We do not have an offering in this space. ACM is presently devoted to cluster monitoring and does
not participate in cluster lifecycle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add it there, though, right? Or as an OLM-installed operator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it could. The conclusion presented here was with guidance from Derek.

cluster roll out a bad configuration change without knowing exactly how to stop the damage
without causing further disruption. This is not a moment when you want to be figuring out how to format
a date string, calculating timezones, or copying around cluster configuration so that you can restore
it after you stop the bleeding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is driving the request for this feature, managed services or users with self-managed clusters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It unfolded as follows (see motivation section for more depth on each point):

  1. There is a negative perception of upgrades due to worker nodes being entangled with the end to end process. It also forces unnecessary maintenance on nodes and risks for customers.
  2. We can eliminate those problems and empower users by separating the control-plane and worker-node updates.
  3. @derekwaynecarr raised the concern that doing so would double operational burden. Also guidance that whatever improvements we make in standalone should be in HCP.
  4. Maintenance windows were identified as a tool which could prevent new operational burden for customers who may not benefit from the separation. Maintenance windows are also a common configuration in other managed kubernetes solutions.
  5. SRE, as part of all these conversations, indicates product level support for maintenance windows would help simplify their existing implementation.
  6. Exploration of all other benefits of change management & scheduling which simplify operations.

So, item (1) was the genesis (particularly in standalone since SRE manages to insulate customers from gaps today), but we found additional value along the path.

Comment on lines 183 to 184
resources, but material change to the cluster will not be initiated until a time permitted
by the Maintenance Schedule. Users do not require special insight into the implications of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear on expectations, this includes the material change will not be applied if a node reboots for other reasons or a node is created (i.e. MHC replacement, scale up, new machineset / machine)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I need to make this explicit -- there is no guarantee. The schedule assures that no material change is initiated, but the state of rebooted nodes / new nodes is dependent on the profile and state of rollout. For example, in standalone, if a new release payload is targeted by the CVO, but cluster change management is paused, the CVO will not roll out machine configs which ultimately inform new nodes joining the cluster. Thus, new nodes joining the cluster will receive the old configuration.
If, however, the MCP is halfway through updating old nodes when change management is paused, new nodes joining the cluster will receive the new configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding something like the following to the summary in the next push:

Change management enforcement _does not_ attempt to define or control the detailed state of the
system. It only pertains to whether controllers which support change management 
will attempt to initiate material change themselves. For example, if changes are paused in the middle
of a cluster update and a node is manually rebooted, change management does not define
whether the node will rejoin the cluster with the new or old version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make that more deterministic somehow? How hard would it be to apply a simple rule like "new nodes will always be deployed with the latest config" or "new nodes will always be deployed with a consistent configuration"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"new nodes will always be deployed with a consistent configuration"

That isn't hard. There will be a distinct moment where the MCS will start serving the new vs the old.

"new nodes will always be deployed with the latest config"

Since we need to implement this for multiple controllers, the safe bet is to leave that choice up to the implementation. Intuitively, any control loop can be paused, so, as written, the enhancement should be applicable across any controller. However, to provide this guarantee, we would need to say something like "no external sign of inconsistency can be visible until reconciliation is complete".

There is inherent friction with eventual consistency in that statement.

Consider the questions it raises in standalone:

  • In a paused state with MachineConfig changes pending, there is a clear difference between "new" vs "old" config. Since the "new" is not incorporated into an MCP yet, new machines booted into the cluster must necessarily receive the old.
  • Once the CVO unpauses and introduces a MachineConfig update, the MCP will be updated. If we pause at this point, what should a new machine receive? If you say "new", we've violated the proposed contract.
  • Once the MCO has updated 75% of the machines, should it still be serving the old config? Is consistency more important than the inevitable prolonging of the update when those new nodes running old config need to be drained & rebooted? Do we start serving new config while they newish nodes are being rebooted (if we do, its possible that we will serve old config indefinitely if new machines constantly scale into and out of the cluster)?

There are detailed answers we could give here that would make the standalone update process subtly more predictable. But is CAPI going to implement this contract? Would ?

In short, I think all can reasonably insist upon is the ability to pause your progress towards consistency. If we try to define what the behavior of the system must be when we pause you, things get complex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like the rule needs to be more complex than the simple examples I gave. It still seems like we want to be able to explain clearly to users what they are going to get, though. How about "the most recent config that was being applied to nodes before the MCO was paused"? That is at least predictable, even if it depends on the timing of the pause step. And to cope with the timing, we could say that new nodes don't get a config until at least 1 existing node from the pool is updated to use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any documented contract for the behavior in the existing pausing primitives provided by MCP or CAPI, so I'll need to leave this for SMEs to weigh in on the what is possible here.


By acknowledging the need for scheduled maintenance in the platform, we reduce the need for Service
Delivery to develop and maintain custom tooling to manage the platform while
simultaneously reducing simplifying management for all customer facing similar challenges.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some also in managed-cluster-config (MCC) which is plumbed through hive, and it's not a great experience to manage. We have to pick a set of verisons when a configuration can apply and then react when the cluster version is changed. We're in the process of reworking MCC so this is less painful but ultimately still is not a great experience to manage. Having the ability to just apply a change to a MachineConfig etc would be far better in my opinion.

2. Requiring the use of maintenance schedules for OpenShift upgrades (the changes should be compatible with various upgrade methodologies – including being manually triggered).
3. Allowing Standalone worker-nodes to upgrade to a different payload version than the control-plane (this is supported in HCP, but is not a goal for standalone).
4. Exposing maintenance schedule controls from the oc CLI. This may be a future goal but is not required by this enhancement.
5. Providing strict promises around the exact timing of upgrade processes. Maintenance schedules will be honored to a reasonable extent (e.g. upgrade actions will only be initiated during a window), but long-running operations may exceed the configured end of a maintenance schedule.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How far into a maintenance schedule's window will an update event trigger? I.e. if something stops the upgrade of workers to start, maybe the operator doing this work is degraded, but it then recovers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How far into a maintenance schedule's window will an update event trigger?

Change can be initiated up to the last instant of the permissive window.

if something stops the upgrade of workers to start, maybe the operator doing this work is degraded, but it then recovers

I think of the implementation as pausing and resuming the reconciliation control loop(s). So relevant operators might go degraded, but only for reasons they go degraded prior to the introduction of change management.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavior for upgrades today is we will cancel if it takes too long to start. Would need to find out what "too long" actually is. We'll need to be careful about how this is communicated when we switch over to using this for managed. More thinking about the duration for control plane upgrades. I know we don't guarentee how long it will take, but perhaps something should consider how long those upgrades take and estimate if it would roughly fit into the "window" of the schedule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Control plane averages about an hour. Are you thinking that OCM might want to prevent users from scheduling less than say, 2 hours to prevent their expectations from being violated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For control plane I wouldn't set an expectation on an end time, just allow scheduling the start. Having an end implies an ability to stop the upgrade. This is not the intention of this EP.

3. Allowing Standalone worker-nodes to upgrade to a different payload version than the control-plane (this is supported in HCP, but is not a goal for standalone).
4. Exposing maintenance schedule controls from the oc CLI. This may be a future goal but is not required by this enhancement.
5. Providing strict promises around the exact timing of upgrade processes. Maintenance schedules will be honored to a reasonable extent (e.g. upgrade actions will only be initiated during a window), but long-running operations may exceed the configured end of a maintenance schedule.
6. Implementing logic to defend against impractical maintenance schedules (e.g. if a customer configures a 1-second maintenance schedule every year). Service Delivery may want to implement such logic to ensure upgrade progress can be made.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we (SD) need to put something in place to manage this it's going to mean keeping some code somewhere. Today scheduling is via OCM. If this enhancement is accepted this becomes a platform capbility. We (SD) have not talked about the implications of such a feature on how we enable customer to manage upgrades. Can we see a world where SD gets out of the business of managing schedules completely at some point in the future? If so, having the right controls in place to ensure sane configurations will generally be a good idea.

That being said, we only care about the version of the cluster in a few cases.

  1. if the cluster version is end of life (EOL)
  2. if the cluster's control plane is running on RH SRE infra

In the #1 scenario, the cluster goes into Limited Support.
In the #2 scenario, it only applys for HCP toploogies where the pods of the control plane are running on RH infrastructure. SRE will force a control plane upgrade if the control plane version for signifigent bugs that need remediated, there is a vulnerability we need to remediate, or the control plane version is now EOL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we see a world where SD gets out of the business of managing schedules completely at some point in the future?

Yes. SD could let customers fully define their own maintenance schedules and simply override them when necessary (e.g. using disabledUntil). The reason SD might want to play a larger role in constraining the setting is to make sure, when necessary, that forced updates are performed within a customer's desired window. Your #2 scenario, for example: you have to do the update and ideally want users to be OK with when it happens. If you force them to choose a non-zero change management window each month, hopefully they are less surprised when forced changes happen during that window.

### Change Management Metrics
Cluster wide change management information will be made available through cluster metrics. Each resource
containing the stanza should expose the following metrics:
- The number of seconds until the next known permitted change window. 0 if changes can currently be initiated. -1 if changes are paused indefinitely. -2 if no permitted window can be computed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why encode meaning into negative values vs having explicit state captured in another field? This can work of course but is mixing use: time until permitted change and status (paused).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My motivation was to simplify the promql necessary for useful alerting by an ops team. A simple rule might be warn "Invalid update schedule" when seconds_until_update < 0 or seconds_until_update_permitted > 60 days vs enumerating booleans for why things might be busted. SD would certainly know best here, so I'm open to alternatives. I just want any rule to survive the introduction of new error conditions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the observability perspective it does reduce cardinality to use the value to encode the state. As noted it will work. It gets a bit brittle when you now have a mapping through various layers and have to know that encoded meaning in the metric value.

1. The update of this worker-node continues, but at 00:00 UTC Sunday, no further material changes are permitted by the change management policy and the worker-node update process is effectively paused.
1. Because not all worker-nodes have been updated, changes are still reported as pending via metrics for NodePool. **TODO: Review with HyperShift. Pausing progress should be possible, but a metric indicating changes still pending may not since they interact only through CAPI.**
1. The HCP cluster runs with worker-nodes at mixed versions throughout the month. The N-1 skew between the old kubelet versions and control-plane is supported.
1. **TODO: Review with Service Delivery. If the user requested another minor bump to their control-plane, how does OCM prevent unsupported version skew today?**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have skew protection (whatever it's called) disabled because we'll force a minor version upgrade of control planes in ROSA HCP when the control plane version is EOL. Validation of allowed skew is possible at other layers like OCM/CS, not sure what we permit at this time.

1. SRE can address the issue with a system configuration file applied in a MachineConfig.
1. SRE creates the MachineConfig for the customer and provides the customer the option to either (a) wait until their
configured maintenance schedule permits the material change from being initiated by the MachineConfigOperator
or (b) having SRE override the maintenance schedule and permitting its immediate application.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect SRE will override the schedule in this scenario. The change would be staged and customer notified of options. Customer can either wait for their next update window or the customer can permit the change to progress.

#### Service Delivery Emergency Patch
1. SRE determines that a significant new CVE threatens the fleet.
1. A new OpenShift release in each z-stream fixes the problem.
1. SRE plans to override customer maintenance schedules in order to rapidly remediate the problem across the fleet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SRE have a couple of options and it depends on if it's HCP or not.

If HCP and the issue is signifigent enough, yes we would stage the update and force the change to land regardless of the customer maintenance schedule.

For Classic, where control plane is in customer's account, there is an important question to answer: does this impact SRE's ability to provide support and/or impact SLA with customer?

If yes, there is a support and/or SLA impcat, then SRE would do the same as with HCP and pushed the upgrade through.

If no, the change would be staged and customer schedule will pick it up on the next update. Customer will be notified and has an option to update earlier. Customer can defer the update as usual.

@jupierce jupierce force-pushed the maintenance_schedules_enhancement branch 2 times, most recently from 037fd22 to 6e4f207 Compare March 8, 2024 21:39
who wants to achieve the simple goal of ensuring no material changes take place outside a well-defined
maintenance schedule, do you want to have to the challenge of keeping every MachineConfigPool's
`changeManagement` stanza in perfect synchronization with the ClusterVersion's? What if a new MCP is created
without your knowledge?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MachineConfigPools consume inputs from many places, and ClusterVersion updates are only one aspect. https://amd64.ocp.releases.ci.openshift.org/ -> 4.16.0-ec.4 -> aws-ovn-serial -> Artifacts -> ... -> gather-extra artifacts:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-serial/1767648317749071872/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/machineconfigs.json | jq -r '.items[].metadata | .creationTimestamp + " " + .name' | grep rendered-worker- | sort
2024-03-12T20:45:43Z rendered-worker-9724b3c3c79d2e58b093eb46d0c2c3ff
2024-03-12T20:51:58Z rendered-worker-33a6a11e659aca048b955f168eb770f3
2024-03-12T20:52:04Z rendered-worker-3e2eab9d46a1db0948347f34d268ea61
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-serial/1767648317749071872/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/machineconfigpools.json | jq -c '.items[] | select(.metadata.name == "worker").status.configuration.source[]'
{"apiVersion":"machineconfiguration.openshift.io/v1","kind":"MachineConfig","name":"00-worker"}
{"apiVersion":"machineconfiguration.openshift.io/v1","kind":"MachineConfig","name":"01-worker-container-runtime"}
{"apiVersion":"machineconfiguration.openshift.io/v1","kind":"MachineConfig","name":"01-worker-kubelet"}
{"apiVersion":"machineconfiguration.openshift.io/v1","kind":"MachineConfig","name":"97-worker-generated-kubelet"}
{"apiVersion":"machineconfiguration.openshift.io/v1","kind":"MachineConfig","name":"98-worker-generated-kubelet"}
{"apiVersion":"machineconfiguration.openshift.io/v1","kind":"MachineConfig","name":"99-worker-generated-registries"}
{"apiVersion":"machineconfiguration.openshift.io/v1","kind":"MachineConfig","name":"99-worker-ssh"}

Instead of having it implicitly assumed that ClusterVersion is going to be the controlling resource taking precedence over local MachineConfigPool configuration, maybe changeManagement could explicitly declare that with references to higher-precedence managers?

Or, because that would make the system even more complicated, how likely is it that folks are creating new MachineConfigPools without looping in cluster lifecycle administrators? If cluster lifecycle administrators cannot agree amongst themselves, that is not a problem we can solve with tooling. If they do agree, but want unified control over the policy, I'm not as down on UpdatePolicy CRD as the current alternatives-section is, and possibly there's a way to mitigate the concerns around that while still avoiding implicit or explicit hierarchy complications.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were two motivations to the current proposal:

  1. It is intended to be implementable with existing CAPI primitives. CAPI exposes "paused" on the two main resources Cluster and MachineDeployment. My understanding is that pausing the Cluster pauses the MachineDeployment but not vice versa. This should allow the HyperShift operator to fulfill the simple changeManagement hierarchy by pausing and unpausing these resources. If the enhancement calls for independence of changeManagement stanzas, we would need to be confident CAPI would accept a significant paradigm shift. I would only feel compelled to do that if we thought there was a significant advantage.
  2. The guiding principle of "simple things should be simple." Setting something globally, once, and not having to keep it in mind for an unrelated activity reduces the mental overhead associated with the product. A diligent enough team can always solve for with policy / clever automation / etc. But the more we ask this theoretical team to solve, the less pleasant the product is to use. "Cluster lifecycle administrator" is often going to be a single person trying to do lifecycle management along with everything else. Ideally, the enhancement will improve their QOL while also being flexible enough for plausible complicated use cases.

Comment on lines +108 to +146
This approach has proven extraordinarily successful in providing a fast and reliable
control-plane update, but, in rare cases, the highly opinionated update process leads
to less than ideal outcomes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we re frame this section to combine more of what is in "Node Update Separation to Address Problems in User Perception" as the case here is more to give customers control on when to update worker nodes due to possible interruptions to their workloads.

Comment on lines +249 to +297
> "As an SRE, tasked with performing non-emergency corrective action, I want
> to be able to apply a desired configuration (e.g. PID limit change) and have that change roll out
> in a minimally disruptive way subject to the customer's configured maintenance schedule."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically for managed OpenShift we will provide the customer the ability to schedule this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That action specific scheduling would go away once this is implemented or just get layered on top?

  1. pid limit MC change goes at HH:MM mm/dd/yyyy
  2. change window opens after above time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would remove our implementation and replace it with a maintenance schedule.


1. A [Cluster Service Consumer](https://hypershift-docs.netlify.app/reference/concepts-and-personas/#personas) requests an HCP cluster via OCM.
1. To comply with their company policy, the service consumer configures a maintenance schedule through OCM.
1. Their first preference, no updates at all, is rejected by OCM policy, and they are referred to service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For worker nodes this would be a valid strategy for a customer. At some point they will be notified that they must upgrade if control plane version skew is too far miss aligned and the running control plane version is going end of life.

Comment on lines 594 to 597
1. SRE applies a change to the relevant control-plane AND worker-node resource's `changeManagement` stanza
(both must be changed because of the change management hierarchy), setting `disabledUntil` to
a time 48 hours in the future. The configured change management schedule is ignored for 48 as the system
initiates all necessary node changes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting as SRE would have to ensure that the disabledUntil window is long enough to update all nodes. Typically we don't apply fleet wide changes like this immediately, we configure the rollout so they are applied after the next z-stream update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - this would be for emergency purposes only. machineconfiguration.openshift.io/bypass-change-management is another option if you didn't need to announce an end of maintenance guarantee to customers.

@jupierce jupierce force-pushed the maintenance_schedules_enhancement branch from 4aabfd6 to e4fe894 Compare May 14, 2024 23:55
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a contender for longest EP, congrats 🎉

I've left a bunch of comments through this, and, from an API perspective at least, have a number of questions about the UX and some comments about it being a separate, referred resource. It's only as I got to the end that I see this is an alternative already discussed. I've got more homework now on seeing what prior discussion there was about this, my current view though is that embedding it within the object is going to create some awkward duplication that could potentially be avoided. Feels like there's a line though on where to separate

This definitely needs review from the MCO folks, there's significant overlap with a project they are currently working on, we need to make sure to de-duplicate the efforts.

There's also some assumptions about how the MCO works that I believe are incorrect, and some of the design may need to be revised to accommodate this

CC @sinnykumari @yuqi-zhang I think you both ought to review this one

an independent `UpdatePolicy` CRD where administrator preferences can be captured. This approach [was explored](https://github.com/jupierce/oc-update-design-ideas/commit/a6364ee2f2c1ebf84ed6d50bc277f9736bf793bd).
Ultimately, it felt less intuitive across profiles. Consider a CRD called `UpdatePolicy` that tries to be a central config.

1. Is it namespaced? If it is, that feels odd for an object that that should control the cluster. If it is not namespaced, the policy feels misplaced for HCP HostedCluster resources which live in a namespace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work for MachineConfig in HCP, since they make sense to be cluster scoped, but not in the case of HCP? This feels the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MachineConfigs serialized as configmaps in the hosted cluster namespace: https://hypershift-docs.netlify.app/how-to/automated-machine-management/configure-machines/ .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eww 😅

I was expecting there to be a XYZ and a ClusterXYZ where XYZ is for managing the cluster you're on and ClusterXYZ is for managing guest clusters and is namespaced, naming could be confusing though 🤔

| date | * | * | Honor permit and exclude values, but only after the specified date. For example, permit: `null` and exclude: `null` implies the strategy is indefinitely permissive after the specified date. |


#### MachineConfigPool Assisted Strategy Configuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not seeing where a forced drain would be configured. In a managed context we need have this as an option, if only for our own clusters. We can decide what default value to use (today it is 1 hour I believe).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forced drain is a separate EP that I hope will also reduce MUO: #1616 .

Copy link
Member

@sdodson sdodson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are parts I need to re-read to ensure I understand the details, but given a lot of my comments here have been pending for a while I'm going to go ahead and post them then come back for a second pass.

authors:
- @jupierce
reviewers:
- TBD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinnykumari @LalatenduMohanty @csrwng I think cover first pass reviewers
Probably want to run this by @rhamilto even if we potentially decide that we don't want a web console UX around this in cluster. To me I think it'd be good to at least expose node maintenance schedules.

Comment on lines +249 to +297
> "As an SRE, tasked with performing non-emergency corrective action, I want
> to be able to apply a desired configuration (e.g. PID limit change) and have that change roll out
> in a minimally disruptive way subject to the customer's configured maintenance schedule."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That action specific scheduling would go away once this is implemented or just get layered on top?

  1. pid limit MC change goes at HH:MM mm/dd/yyyy
  2. change window opens after above time

- "Material Change". A longer definition is provided in the Summary, but in short, any configuration
change a platform operator wishes to apply which would necessitate the reboot or replacement of one
or more nodes is considered a material change. For example, updating the control-plane version is
a material change as it requires rebooting master nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes there can be control plane updates without associated reboots, it's rare given we update the kernel every other calendar week, but it's possible that a z-stream update could update only pod scoped resources. We'd still consider that to be a material change, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an added note, in practice I think I've only ever seen 1 z-stream update not have any OS changes. Changes to openshift-clients/hyperkube/crio/etc. happen pretty frequently so even if there were no kernel changes something in the OS gets rebuilt.

This does imply that we cannot use "did a reboot happen" as whether an update completed or not, but I don't think anything monitors for that.

therefore, reflect expected changes to the worker-node MCPs. Anticipating this change ahead of time is necessary for an operations
team to be able to set an alert with the semantics (worker-node-update changes are pending & time remaining until changes are permitted < 2d).
The MCO will expose its own metric for changes pending when manifests are updated. But this metric will only indicate when
there are machines in the pool that have not achieved the desired configuration. An operations team trying to implement the 2d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever we say telemetry (What gets sent back to Red Hat) we're constrained by cardinality limits which would likely mean that we get one metric from MCO indicating the number of pools which are not currently in an Updated state. However metrics within the cluster could be more fine grained as proposed elsewhere in this doc, per pool and I'm fine with them including time since last fully reconciled and estimated time until next reconciliation whenever there are updates pending and it's not actively rolling out those changes.

15. As part of updating cluster manifests, MachineConfigs have been modified. The MachineConfigOperator (MCO) re-renders a
configuration for worker-nodes. However, because the MCP maintenance schedule precludes initiating material changes,
it will not begin to update Machines with that desired configuration.
16. The MCO will set a metric indicating that desired changes are pending.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe we have those metrics today. We'd be designing them as part of this enhancement.

Comment on lines 654 to 655
1. The cluster lifecycle administrator manually drains and reboots nodes in the cluster. As they come back online,
the MachineConfigServer offers them the desiredConfig requested by the manual policy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this would just be the administrator annotates (labels?) each node with the new desiredConfig. MCO and MCD handle the rest from there. Administrator is responsible for annotating nodes in order they want and rate limiting that labeling however they want.

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. Just wanted to add some points/questions from the MCO perspective

- "Material Change". A longer definition is provided in the Summary, but in short, any configuration
change a platform operator wishes to apply which would necessitate the reboot or replacement of one
or more nodes is considered a material change. For example, updating the control-plane version is
a material change as it requires rebooting master nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an added note, in practice I think I've only ever seen 1 z-stream update not have any OS changes. Changes to openshift-clients/hyperkube/crio/etc. happen pretty frequently so even if there were no kernel changes something in the OS gets rebuilt.

This does imply that we cannot use "did a reboot happen" as whether an update completed or not, but I don't think anything monitors for that.


The design choice is informed by a thought experiment: As a cluster lifecycle administrator for a Standalone cluster,
who wants to achieve the simple goal of ensuring no material changes take place outside a well-defined
maintenance schedule, do you want to have to the challenge of keeping every MachineConfigPool's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a somewhat similar question to Joel here. As I understand it, worker (or other custom pools) can set their independent changeManagements, and the global (cluster) one is AND'ed into it?

What if I wanted to have different cadences for my control plane and workers (e.g. control planes on Mon-Fri but not weekends, and workers on weekends but not Mon-Fri)?

And since master pool is separate, this also goes back to my earlier question on whether that's independent (i.e. the control plane nodes themselves being upgraded is a separate profile than the control plane components rolling out)

Comment on lines 654 to 655
1. The cluster lifecycle administrator manually drains and reboots nodes in the cluster. As they come back online,
the MachineConfigServer offers them the desiredConfig requested by the manual policy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description here more matches with what hosted control planes are doing (except that HCP would spin up an entirely new node)


#### Hypershift / Hosted Control Planes

In the HCP topology, the HostedCluster and NodePool resources are enhanced to support the change management strategies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to call out that:

  1. HCP has both replace and inplace upgrade nodepools, and while that shouldn't matter for maint. schedule, they do upgrade differently, and
  2. HCP differs from standalone in cert management, where the certs there will trigger creation of new nodes, and no rotation will happen if the upgrades are paused (due to a maint. schedule), so you run more risk of certs expiring down the line (although I don't know if they actually have certificate rotation handling for all certs today)


#### Change Management on Master MachineConfigPool
In order to allow control-plane updates as a single material change, the MCO will only honor change the management configuration for the
master MachineConfigPool if user generated MachineConfigs are the cause for a pending change. To accomplish this,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really have a mechanism for this today, since all configuration change calculations are done via rendered ones, and we just do a merge of all base configs.

I'm not sure I fully understand this section. Basically any pool-level configurations on the master pool are allowed the same way as worker, but they aren't honored if there's an incoming upgrade? But any user changes turn us back into worker calculation? (where the CVO level maint. schedule and the master pool are union'ed?)


When `nodeDisruptionPolicy` indicates that a `MachineConfiguration` should not trigger
a node reboot, it becomes a non-material change from the perspective of a maintenance
schedule. In other words, it can be applied immediately, even outside a permissive window.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's safe, since disruption is not just reboots, but also node drains, etc. all of which you can specify via the policy. Besides, the MCO isn't that granular that it can break down a subset of MC changes to happen.

I think node disruption policy should only define how a change is acted on, but otherwise is gated by the maint. schedule like everyone else.

a node reboot, it becomes a non-material change from the perspective of a maintenance
schedule. In other words, it can be applied immediately, even outside a permissive window.

### Risks and Mitigations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also like to double check how we manage slow updates/half updated pools vs blocked upgrades/failing nodes, etc.

It reads to me that we'll have to reconsider how we do status calculation with the new window, since half-upgraded pools will become the norm, which means we'll have to figure out how/when/what we consider errors, and add extra pool status/operator status to reflect some of this

@jupierce jupierce force-pushed the maintenance_schedules_enhancement branch from ed1f465 to 6a0c121 Compare July 7, 2024 17:07
metdata:
# ChangeManagementPolicies are namespaced resources. They will normally reside in the
# namespace associated with the controller initiating material changes.
# For example, in Standalone namespace/openshift-machine-config for the MCO and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there's no openshift-machine-config namespace; all the MCO components are in openshift-machine-config-operator. E.g. in a 4.17.0-ec.1 CI run:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial/1806673110347288576/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/namespaces.json | jq -r '.items[].metadata.name' | grep -1 machine-config
openshift-machine-api
openshift-machine-config-operator
openshift-marketplace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wking - update in next push.

# setting, it overrides the "policy" stanza.
# Date must be RFC3339.
# This field is modeled on HCP's existing HostedCluster.spec.pausedUntil which uses a string.
pausedUntil: <datetime|bool>str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With disabledUntil as a "I want to touch this despite the lack of formal maintenance window" access hatch, I think it makes sense to have that one down on the covered resource in changeManagement. But I'm less clear on why pausedUntil's "despite the usual maintenance window, I have reasons to want to go slowly for a bit" would be down in changeManagement. Maybe pausedUntil belongs up on ChangeManagementPolicy? Or maybe exclude already covers that use case at the ChangeManagementPolicy level, and we don't need pausedUntil at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted pausedUntil to be a knob even without a ChangeManagementPolicy defined. It is in the resource itself because I believe:

  1. It offers a simple way for users to control roll-outs without having to get into the details of creating ChangeManagementPolicy definitions. I want them to be able to easily toggle pausing a rollout of an MCP without the rabbit hole of understanding ChangeManagementPolicy - that rabbit hole is only worth it for someone who wants a real MaintenanceSchedule.
  2. Moving pausedUntil into ChangeManagementPolicy makes the meaning even more challenging to convey / interpret. For example, how would you interpret "pausedUntil: true" for a Restrictive strategy vs a Permissive strategy? A paused restrictive strategy sounds like it would be evaluated as permissive.
  3. It permits CLI tools to pause and unpause rollouts regardless of whether an administrator has configured a ChangeManagementPolicy. I think this is critical in order to implement verbs like oc adm update worker-nodes start|stop|pause and ensure they have MCP granularity.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 8, 2024
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 15, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 21, 2024
Justin Pierce added 13 commits November 5, 2024 12:25
- All times are now UTC.
- Clarity on not defining system state -- only initiation.
- Change from "endTime" to "duration" to easily create windows that span
  days.
- Clarity on how/why CVO shows pending worker-node updates before
  machineconfig is updated.
- Change 1000 day search metric behavior.
@jupierce jupierce force-pushed the maintenance_schedules_enhancement branch from e1092ab to ba06346 Compare November 5, 2024 17:26
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 3, 2025
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 10, 2025
@jupierce
Copy link
Contributor Author

/remove-lifecycle stale

Copy link
Contributor

openshift-ci bot commented Jan 14, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jupierce. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Jan 14, 2025

@jupierce: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.