Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCO-1010: Add node disruption policies to MachineConfiguration CRD #1764

Merged

Conversation

yuqi-zhang
Copy link
Contributor

@yuqi-zhang yuqi-zhang commented Feb 9, 2024

Draft API of openshift/enhancements#1525

This extension of MachineConfiguration object allows uses to specify how their MachineConfig object changes affect node disruption, allowing for non-drain and non-reboot updates to some config files. The MachineConfigController and MachineConfigDaemon will ultimately be implementing and executing on this object.

Also currently based on #1672 by @djoshy since that should probably merge first

Major questions:

  1. re lists, we currently want users to specify policies such as follows:
userPolicies:
  - type: file
    value: "/etc/my-file"
    actions:
      - type: reload
        reload: my-service
      - type: daemon-reload
  - type: file
    value: "/etc/my-other-file"
    actions:
      - type: none
  - type: sshkey
    actions:
      - type: none

such that the list doesn't have unique keys (it would be type + value, but value isn't required for all types so it cannot be made into a listMapKey). Maybe it would be better to make it more MachineConfig like and have it instead be:

userPolicies:
  files:
    - path: "/etc/my-file"
      actions:
        - type: reload
          reload: my-service
        - type: daemon-reload
    - path: "/etc/my-other-file"
      actions:
        - type: none
  sshkey:
    actions:
      - type: none

for the actions, we have the same issue, as we'd like to allow users to specify multiple actions with the same key

  - type: file
    value: "/etc/my-other-file"
    actions:
      - type: reload
        reload: service-1
      - type: reload
        reload: service-2
      - type: daemon-reload

So there is not a unique key here either.

Currently the proposed set up is, actions are union discriminators (so if you reload, you have to specify the services to reload), type/value pairs are also validated, and otherwise there is no verification on the lists. The MachineConfigController will do the rest of the validation, so a split of responsibilities.

Alternatively, we could either:

  1. have no validation in the API, and have the MCC do all the validation
  2. have no validation in the MCC and have the API logic do all the heavy lifting

controller validation and status:

Currently we list out clusterDefaultPolicies in the spec, which will be populated by the MCController every time it updates. Users should not modify that sub-spec, but since the daemon will only apply what's in status (so validated version) the current approach has no additional gating and just has the controller overwrite any changes before applying to status.

Is there a way we would be able to let just the MCController (and not say, someone using the MCC service account or other admin privs) write that sub-object? It should be mutate-able, just not by anything other than the MCO/MCC

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 9, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 9, 2024

@yuqi-zhang: This pull request references MCO-1010 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Draft API of openshift/enhancements#1525

This extension of MachineConfiguration object allows uses to specify how their MachineConfig object changes affect node disruption, allowing for non-drain and non-reboot updates to some config files. The MachineConfigController and MachineConfigDaemon will ultimately be implementing and executing on this object.

Also currently based on #1672 by @djoshy since that should probably merge first

Major questions:

  1. re lists, we currently want users to specify policies such as follows:
userPolicies:
 - type: file
   value: "/etc/my-file"
   actions:
     - type: reload
       reload: my-service
     - type: daemon-reload
 - type: file
   value: "/etc/my-other-file"
   actions:
     - type: none
 - type: sshkey
   actions:
     - type: none

such that the list doesn't have unique keys (it would be type + value, but value isn't required for all types so it cannot be made into a listMapKey).

Re: actions, we have the same issue, as we'd like to allow users to specify multiple actions with the same key

 - type: file
   value: "/etc/my-other-file"
   actions:
     - type: reload
       reload: service-1
     - type: reload
       reload: service-2
     - type: daemon-reload

So there is not a unique key here either.

Currently the proposed set up is, actions are union discriminators (so if you reload, you have to specify the services to reload), type/value pairs are also validated, and otherwise there is no verification on the lists. The MachineConfigController will do the rest of the validation, so a split of responsibilities.

Alternatively, we could either:

  1. have no validation in the API, and have the MCC do all the validation
  2. have no validation in the MCC and have the API logic do all the heavy lifting

Which will likely require us to modify the above approach, maybe to something like:

userPolicies:
 - action: reload
   reload: service-1
   file:
    - /etc/my-file
    - /etc/my-file-2
 - action: daemon-reload
   file:
    - /etc/my-file
- action: none
  sshkey: ""

a reverse mapping (although some of the other issues might still apply)

  1. controller validation and status
    Currently we list out clusterDefaultPolicies in the spec, which will be populated by the MCController every time it updates. Users should not modify that sub-spec, but since the daemon will only apply what's in status (so validated version) the current approach has no additional gating and just has the controller overwrite any changes before applying to status.

Is there a way we would be able to let just the MCController (and not say, someone using the MCC service account or other admin privs) write that sub-object? It should be mutate-able, just not by anything other than the MCO/MCC

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2024
Copy link
Contributor

openshift-ci bot commented Feb 9, 2024

Hello @yuqi-zhang! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

Copy link
Contributor

openshift-ci bot commented Feb 9, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 9, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 13, 2024

@yuqi-zhang: This pull request references MCO-1010 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Draft API of openshift/enhancements#1525

This extension of MachineConfiguration object allows uses to specify how their MachineConfig object changes affect node disruption, allowing for non-drain and non-reboot updates to some config files. The MachineConfigController and MachineConfigDaemon will ultimately be implementing and executing on this object.

Also currently based on #1672 by @djoshy since that should probably merge first

Major questions:

  1. re lists, we currently want users to specify policies such as follows:
userPolicies:
 - type: file
   value: "/etc/my-file"
   actions:
     - type: reload
       reload: my-service
     - type: daemon-reload
 - type: file
   value: "/etc/my-other-file"
   actions:
     - type: none
 - type: sshkey
   actions:
     - type: none

such that the list doesn't have unique keys (it would be type + value, but value isn't required for all types so it cannot be made into a listMapKey). Maybe it would be better to make it more MachineConfig like and have it instead be:

userPolicies:
 files:
   - path: "/etc/my-file"
     actions:
       - type: reload
         reload: my-service
       - type: daemon-reload
   - path: "/etc/my-other-file"
     actions:
       - type: none
 sshkey:
   actions:
     - type: none

for the actions, we have the same issue, as we'd like to allow users to specify multiple actions with the same key

 - type: file
   value: "/etc/my-other-file"
   actions:
     - type: reload
       reload: service-1
     - type: reload
       reload: service-2
     - type: daemon-reload

So there is not a unique key here either.

Currently the proposed set up is, actions are union discriminators (so if you reload, you have to specify the services to reload), type/value pairs are also validated, and otherwise there is no verification on the lists. The MachineConfigController will do the rest of the validation, so a split of responsibilities.

Alternatively, we could either:

  1. have no validation in the API, and have the MCC do all the validation

  2. have no validation in the MCC and have the API logic do all the heavy lifting

  3. controller validation and status
    Currently we list out clusterDefaultPolicies in the spec, which will be populated by the MCController every time it updates. Users should not modify that sub-spec, but since the daemon will only apply what's in status (so validated version) the current approach has no additional gating and just has the controller overwrite any changes before applying to status.

Is there a way we would be able to let just the MCController (and not say, someone using the MCC service account or other admin privs) write that sub-object? It should be mutate-able, just not by anything other than the MCO/MCC

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 13, 2024

@yuqi-zhang: This pull request references MCO-1010 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Draft API of openshift/enhancements#1525

This extension of MachineConfiguration object allows uses to specify how their MachineConfig object changes affect node disruption, allowing for non-drain and non-reboot updates to some config files. The MachineConfigController and MachineConfigDaemon will ultimately be implementing and executing on this object.

Also currently based on #1672 by @djoshy since that should probably merge first

Major questions:

  1. re lists, we currently want users to specify policies such as follows:
userPolicies:
 - type: file
   value: "/etc/my-file"
   actions:
     - type: reload
       reload: my-service
     - type: daemon-reload
 - type: file
   value: "/etc/my-other-file"
   actions:
     - type: none
 - type: sshkey
   actions:
     - type: none

such that the list doesn't have unique keys (it would be type + value, but value isn't required for all types so it cannot be made into a listMapKey). Maybe it would be better to make it more MachineConfig like and have it instead be:

userPolicies:
 files:
   - path: "/etc/my-file"
     actions:
       - type: reload
         reload: my-service
       - type: daemon-reload
   - path: "/etc/my-other-file"
     actions:
       - type: none
 sshkey:
   actions:
     - type: none

for the actions, we have the same issue, as we'd like to allow users to specify multiple actions with the same key

 - type: file
   value: "/etc/my-other-file"
   actions:
     - type: reload
       reload: service-1
     - type: reload
       reload: service-2
     - type: daemon-reload

So there is not a unique key here either.

Currently the proposed set up is, actions are union discriminators (so if you reload, you have to specify the services to reload), type/value pairs are also validated, and otherwise there is no verification on the lists. The MachineConfigController will do the rest of the validation, so a split of responsibilities.

Alternatively, we could either:

  1. have no validation in the API, and have the MCC do all the validation
  2. have no validation in the MCC and have the API logic do all the heavy lifting

controller validation and status:

Currently we list out clusterDefaultPolicies in the spec, which will be populated by the MCController every time it updates. Users should not modify that sub-spec, but since the daemon will only apply what's in status (so validated version) the current approach has no additional gating and just has the controller overwrite any changes before applying to status.

Is there a way we would be able to let just the MCController (and not say, someone using the MCC service account or other admin privs) write that sub-object? It should be mutate-able, just not by anything other than the MCO/MCC

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

type NodeDisruptionPolicyStatus struct {
// clusterPolicies is a merge of cluster default and user provided node disruption policies.
// +optional
ClusterPolicies []NodeDisruptionPolicyConfig `json:"clusterPolicies"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status should use a different type because you're likely to grow different fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Type NodeDisruptionPolicyActionType `json:"type"`
// reload specifies the service to reload, only valid if type is reload
// +optional
Reload *string `json:"reload,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably want a struct

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a list so that user can specify multiple servicename together that needs to be reloaded for a NodeDisruptionPolicyType ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. I opted them to be separate so a user can theoretically order them as they wish to apply in sequence with other actions

Value string `json:"value"`
// actions represents the series of commands to be executed on changes to the corresponding type and value
// +kubebuilder:validation:Required
Actions []NodeDisruptionPolicyAction `json:"actions"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be atomic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Comment on lines 45 to 47
// nodeDisruptionPolicySpec allows an admin to set granular node disruption actions for
// MachineConfig-based updates, such as drains, service reloads, etc. Specifying this will allow
// for less downtime when doing small configuration updates to the cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make this very very clear that this doesn't apply to cluster version upgrades

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a line

// nodeDisruptionPolicySpec allows an admin to set granular node disruption actions for
// MachineConfig-based updates, such as drains, service reloads, etc. Specifying this will allow
// for less downtime when doing small configuration updates to the cluster.
// +openshift:enable:FeatureSets=TechPreviewNoUpgrade
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need CustomNoUpgrade in here too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// nodeDisruptionPolicyStatus status reflects what the latest cluster-validated policies are,
// and will be used by the Machine Config Daemon during future node updates.
// +openshift:enable:FeatureSets=TechPreviewNoUpgrade
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CustomNoUpgrade in here too please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// Service represents a NodeDisruption policy that is in effect for changes to a service.
Service NodeDisruptionPolicyType = "service"

// File represents a NodeDisruption policy that is in effect for changes to a kernel argument.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// File represents a NodeDisruption policy that is in effect for changes to a kernel argument.
// kernelArgument represents a NodeDisruption policy that is in effect for changes to a kernel argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed kargs as per review on the enhacement

Service NodeDisruptionPolicyType = "service"

// File represents a NodeDisruption policy that is in effect for changes to a kernel argument.
KernelArgument NodeDisruptionPolicyType = "kernelArgument"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MCO doesn't have a way today to apply kargs without reboot. Are we thinking of adding a way to apply these kargs live? If not, perhaps we may want to skip letting user skip drain/reboot in the initial implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

None NodeDisruptionPolicyActionType = "none"

// Special represents an action that is internal to the MCO, and is not allowed in user defined NodeDisruption policies.
Special NodeDisruptionPolicyActionType = "special"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we have something in mind that MCO will utilize in the beginning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to discussion on this, I don't intend this to be user specify-able

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since removing API later is hard, maybe we can add this or similar field later on depending upon usecase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case right now is that we would like to display the current cluster defaults, so the user can see (and optionally override) the image registry logic. The initial set we decided on is to have this keyword and description (that you can see via oc describe, etc.) to explain what the default is.

Happy to change this if there's a better route though! Mostly just wanted to be able to show the current cluster settings.

This right now is also tech preview only so we should be able to change it before GA too if needed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context. Let's keep it then and we can update it if needed while API will be in TechPreview. Maybe MCOInternal, MCODefault etc would provide more clarity.

// +unionDiscriminator
// +kubebuilder:validation:Required
Type NodeDisruptionPolicyActionType `json:"type"`
// reload specifies the service to reload, only valid if type is reload

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admin maybe benefited with service restart as well as I beleive sometime just reloading service is not enough to apply the config changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, thanks!

@yuqi-zhang yuqi-zhang force-pushed the add-node-disruption-policies branch from df6de7b to a76a939 Compare February 16, 2024 00:46
@yuqi-zhang yuqi-zhang force-pushed the add-node-disruption-policies branch from a76a939 to 9b02645 Compare February 26, 2024 23:14
@yuqi-zhang yuqi-zhang marked this pull request as ready for review February 26, 2024 23:14
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 26, 2024

// nodeDisruptionPolicySpec allows an admin to set granular node disruption actions for
// MachineConfig-based updates, such as drains, service reloads, etc. Specifying this will allow
// for less downtime when doing small configuration updates to the cluster. This is NOT intended
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might just change the way this is worded, "intended" to me sounds like we don't want you to do that, but it might work, perhaps

// This configuration has no effect on cluster upgrades which will still incur node disruption where required.

A node upgrade always reboots right? So maybe we don't even need the where required on that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change, a cluster upgrade in practice almost always has an associated OS update, even if it's just a minor package bump. In all of OCP 4 I think I've seen one z-stream bump not come with an associated update, so the overlap is 0

Comment on lines 176 to 178
// clusterDefaultPolicies is managed by the Machine Config Operator, and reflects the latest cluster defaults
// +optional
ClusterDefaultPolicies NodeDisruptionPolicyConfig `json:"clusterDefaultPolicies"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably don't want this in spec if it's expected to be set by the operator rather than the user right?

Copy link
Contributor Author

@yuqi-zhang yuqi-zhang Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current design has it such that there is:

  1. a user spec, for the user to set
  2. a cluster spec, for the MCO to set, which may change depending on version
  3. a status which is the validated merge of the two

The expectation being that it is easier for the user to see what the cluster defaults are.

If we remove this from the spec, then the status will hold the cluster defaults + anything overriden by the user, which... I guess still mostly achieves our goal? If that's more aligned with API conventions, I'm happy to do it that way instead.

// +patchStrategy=merge
// +listType=map
// +listMapKey=type
Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By convention, this should be the first field in the status struct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the discussion we had in the call, I think this makes more sense living in the MachineConfigurationStatus object directly, which currently has StaticPodOperatorStatus that we were thinking of deprecating. Could you help me point to a resource on how that should be done?

}

// NodeDisruptionPolicyConfig is the overall spec definition for files/units/sshkeys
type NodeDisruptionPolicyConfig struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For discoverability on this API, we should drop omitempty, that way the API will be written out by the installer

nodeDisruptionPolicy:
  userPolicy:
    files: []
    units: []
    sshKey: ...

Comment on lines 172 to 174
// userPolicies define user-provided node disruption policies
// +optional
UserPolicies NodeDisruptionPolicyConfig `json:"userPolicies"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not have cluster scoped default policies in here, wondering if we need this extra indentation. What possible future options might live alongside these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off the top of my head any potential future extensions would be with the new image based workflow. I think an example could be I could specify some additional non-machineconfig content (image based) and specify something here to indicate how it can be applied.

But maybe for now we can just collapse this to the higher level policy and add a new field when that comes into play?


// ReloadService allows the user to specify the services to be reloaded
type ReloadService struct {
// ServiceName is the full name (e.g. crio.service) of the service to be reloaded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be lower case at the start of the godoc. Similar questions above about service and the godoc here


// RestartService allows the user to specify the services to be restarted
type RestartService struct {
// ServiceName is the full name (e.g. crio.service) of the service to be restarted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for reload

}

// NodeDisruptionPolicyActionType is a string enum used in a NodeDisruptionPolicyAction object. They describe an action to be performed.
// +kubebuilder:validation:Enum:="reboot";"drain";"reload";"restart";"daemon-reload";"none";"special"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PascalCase these values please

Restart NodeDisruptionPolicyActionType = "restart"

// DaemonReload represents an action that TBD
DaemonReload NodeDisruptionPolicyActionType = "daemon-reload"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be DaemonReload

// None represents an action that no handling is required by the MCO.
None NodeDisruptionPolicyActionType = "none"

// Special represents an action that is internal to the MCO, and is not allowed in user defined NodeDisruption policies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to make sure the spec for user defined does not allow this

@yuqi-zhang yuqi-zhang force-pushed the add-node-disruption-policies branch from 9b02645 to b59260a Compare February 28, 2024 23:23
@yuqi-zhang
Copy link
Contributor Author

Fixed based on comments and rebased on master

Path string `json:"path"`
// actions represents the series of commands to be executed on changes to the file at
// corresponding file path. This is an atomic list, which will be validated by
// the MachineConfigOperator, with any conflicts reflecting as an error in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be interested to see what a conflict would actually look like in this list, could you provide and example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this would be more around if the user sets reboot and none, they shouldn't be able to set other options. And that they shouldn't be able to set special (but this point isn't really relevant since we won't expose it in spec anymore after removing cluster defaults)

So I guess we could instead have a validation for list items, and if reboot or none exists they need to be the only entry. The other consideration is that if we want to modify rules in the future and expand on the actions, we'll probably need to have some validation in the MCO? Although that's a bit vague.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in slack, we can use CEL validation to enforce that reboot or none are singletons, something along the lines of the ternary below

+kubebuilder:validation:XValidation:rule="self.exists(x, x == 'Reboot') ? size(self) == 1 : true"

@yuqi-zhang yuqi-zhang force-pushed the add-node-disruption-policies branch from d583e3b to e59c247 Compare March 8, 2024 02:47
@yuqi-zhang
Copy link
Contributor Author

yuqi-zhang commented Mar 8, 2024

Pushed some changes based on the discussion above, now only the user spec should exist in spec, and status will reflect cluster defaults.

Added some more validation, and also removed the original StaticPodOperatorStatus. Will open a separate PR to tombstone those fields, so this might not pass CI atm.

@yuqi-zhang yuqi-zhang force-pushed the add-node-disruption-policies branch from e59c247 to f455c3e Compare March 8, 2024 06:42
@djoshy djoshy force-pushed the add-node-disruption-policies branch from e726779 to 4751043 Compare March 17, 2024 12:45
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2024
@djoshy djoshy force-pushed the add-node-disruption-policies branch from 4751043 to 55178eb Compare March 17, 2024 12:53
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2024
@yuqi-zhang
Copy link
Contributor Author

yuqi-zhang commented Mar 20, 2024

@JoelSpeed I believe @djoshy has covered the latest round of reviews (thanks!). There's just the open question about tombstone-ing the static operator status. Could this PR merge without that or would that have to go in first?

FWIW I did another brief check and nothing is currently depending on this (the CRD itself was added to the MCO in 4.15 but no objects exist and no other repo is referring to it outside of the API and MCO)

(latest update was to remove a testfile which was accidentally added)

@yuqi-zhang yuqi-zhang force-pushed the add-node-disruption-policies branch from 55178eb to 5574447 Compare March 20, 2024 02:25
@hexfusion
Copy link
Contributor

/cc @hexfusion

@openshift-ci openshift-ci bot requested a review from hexfusion March 21, 2024 16:41
Comment on lines 182 to 185
- lastTransitionTime
- message
- reason
- status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to required fields, has this CRD been included in payload yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only place this is currently referenced is by the MCO code, where I think we technically bring in the CRD. (So I think it does exist in clusters 4.15? and up, but no objects)

What is making this generated field required now and is that an issue?

}

type MachineConfigurationStatus struct {
StaticPodOperatorStatus `json:",inline"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we revert this for now and handle separately? We can't merge this before the handling of removing the operator status in this hybrid state

// NodeDisruptionPolicyFile is a file entry and corresponding actions to take
type NodeDisruptionPolicyFile struct {
// path is the file path to a file on disk managed through a MachineConfig.
// Actions specified will be applied when changes to the file at the path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not resolved

Comment on lines 422 to 426
// +kubebuilder:validation:XValidation:rule=`self.matches('\\.(service|socket|device|mount|automount|swap|target|path|timer|snapshot|slice|scope)$')`, message="Invalid ${SERVICETYPE} in service name. Expected format is ${NAME}${SERVICETYPE}, where ${SERVICETYPE} must be one of \".service\", \".socket\", \".device\", \".mount\", \".automount\", \".swap\", \".target\", \".path\", \".timer\",\".snapshot\", \".slice\" or \".scope\"."
// +kubebuilder:validation:XValidation:rule=`self.matches('^[a-zA-Z0-9:._\\\\-]+\\..')`, message="Invalid ${NAME} in service name. Expected format is ${NAME}${SERVICETYPE}, where {NAME} must be atleast 1 character long and can only consist of alphabets, digits, \":\", \"-\", \"_\", \".\", and \"\\\""
// +kubebuilder:validation:Required
// +kubebuilder:validation:MaxLength=255
ServiceName string `json:"serviceName"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is re-used several times, you could create a type alias for this and assign the validations to that instead, not blocking though. Similar to how you have done the action types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, moved to a new type

actions:
- type: DaemonReload
- type: Reload
expectedError: "Reload is required when type is reload, and forbidden otherwise"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrong way around, think about it from an end user perspective, they see the field as lower r and the type as upper r

Suggested change
expectedError: "Reload is required when type is reload, and forbidden otherwise"
expectedError: "reload is required when type is Reload, and forbidden otherwise"

- type: Drain
- type: Restart
restart:
serviceName: a.b.c.d.e.snapshot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, should include new line char as final character in every file

// +kubebuilder:validation:XValidation:rule=`self.matches('^[a-zA-Z0-9:._\\\\-]+\\..')`, message="Invalid ${NAME} in service name. Expected format is ${NAME}${SERVICETYPE}, where {NAME} must be atleast 1 character long and can only consist of alphabets, digits, \":\", \"-\", \"_\", \".\", and \"\\\""
// +kubebuilder:validation:Required
// +kubebuilder:validation:MaxLength=255
Name string `json:"name"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeDisruptionPolicyServiceName?

// +kubebuilder:validation:XValidation:rule=`self.matches('^[a-zA-Z0-9:._\\\\-]+\\..')`, message="Invalid ${NAME} in service name. Expected format is ${NAME}${SERVICETYPE}, where {NAME} must be atleast 1 character long and can only consist of alphabets, digits, \":\", \"-\", \"_\", \".\", and \"\\\""
// +kubebuilder:validation:Required
// +kubebuilder:validation:MaxLength=255
Name string `json:"name"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeDisruptionPolicyServiceName?

}

// +kubebuilder:validation:XValidation:rule="has(self.type) && self.type == 'Reload' ? has(self.reload) : !has(self.reload)",message="reload is required when type is Reload, and forbidden otherwise"
// +kubebuilder:validation:XValidation:rule="has(self.type) && self.type == 'Restart' ? has(self.restart) : !has(self.restart)",message="Restart is required when type is restart, and forbidden otherwise"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// +kubebuilder:validation:XValidation:rule="has(self.type) && self.type == 'Restart' ? has(self.restart) : !has(self.restart)",message="Restart is required when type is restart, and forbidden otherwise"
// +kubebuilder:validation:XValidation:rule="has(self.type) && self.type == 'Restart' ? has(self.restart) : !has(self.restart)",message="restart is required when type is Restart, and forbidden otherwise"

@yuqi-zhang yuqi-zhang force-pushed the add-node-disruption-policies branch from 6a403bb to 5609243 Compare March 21, 2024 18:21
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Will follow up separately on the static pod status removal, build out on this first and transition to metav1.Condition before we ship.

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 21, 2024
@yuqi-zhang
Copy link
Contributor Author

/test verify

yuqi-zhang and others added 3 commits March 21, 2024 17:08
Add a new sub-spec/status to the MachineConfiguration operator object,
which will allow users to specify actions to take when small
MachineConfig updates happen to the cluster.

This will be behind a NodeDisruptionPolicy featuregate, and will be
managed and consumed by the Machine Config Operator in-cluster.
The new NodeDisruption status objects contain a "special" action
type that will only be used by the MCO's controller to indicate some
internal actions. They are not part of the NodeDisruptionPolicyConfig
object and cannot be set by the user.
 - don't remove staticpodoperatorstatus for now
 - update godocs to be more clear
 - add a type alias for serviceName
@yuqi-zhang yuqi-zhang force-pushed the add-node-disruption-policies branch from 5609243 to 4bdf2e3 Compare March 21, 2024 21:12
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 21, 2024
@yuqi-zhang
Copy link
Contributor Author

Whoops, was failing verify since it wasn't rebased on master

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 22, 2024
Copy link
Contributor

openshift-ci bot commented Mar 22, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Mar 23, 2024

@yuqi-zhang: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 2252c7a into openshift:master Mar 23, 2024
18 checks passed
@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-cluster-config-api-container-v4.16.0-202403230015.p0.g2252c7a.assembly.stream.el9 for distgit ose-cluster-config-api.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants