Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: OCPBUGS-28647: consume deferred updates from performance profile #1118

Closed

Conversation

ffromani
Copy link
Contributor

@ffromani ffromani commented Jul 22, 2024

Consume the deferred updates support from the performance profile. The idea is that if the annotation (the same as NTO) is found on performanceprofile, it is passed throught the generated tuned object.

Initially we don't support hypershift

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 22, 2024
@openshift-ci openshift-ci bot requested review from rbaturov and yanirq July 22, 2024 14:58
Copy link
Contributor

openshift-ci bot commented Jul 22, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 22, 2024
@ffromani ffromani force-pushed the tuned-deferred-updates-perfprof branch from e30ada3 to 1b0603e Compare July 23, 2024 12:31
@ffromani ffromani changed the title WIP: consume deferred updates from performance profile WIP: OCPBUGS-28647: consume deferred updates from performance profile Jul 23, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 23, 2024
@openshift-ci-robot
Copy link
Contributor

@ffromani: This pull request references Jira Issue OCPBUGS-28647, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @shajmakh

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

WIP TBD

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jul 23, 2024
@openshift-ci openshift-ci bot requested a review from shajmakh July 23, 2024 12:34
@ffromani
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 23, 2024
@ffromani ffromani force-pushed the tuned-deferred-updates-perfprof branch 2 times, most recently from fe45be9 to 218ed63 Compare July 24, 2024 09:21
@@ -595,6 +600,17 @@ func (r *PerformanceProfileReconciler) isMixedCPUsEnabled(object client.Object)
return profileutil.IsMixedCPUsEnabled(object.(*performancev2.PerformanceProfile))
}

func (r *PerformanceProfileReconciler) isTunedDeferredEnabled(object client.Object) bool {
if ntoconfig.InHyperShift() {
return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's still WIP so I might be judging too early.
We recently aligned all NTO/PAO features to work on Hypershift let's make sure to keep this spirit, unless there's a really good reason why not.
And even if there's, let's add a comment explaining why we cannot support it on Hypershift.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good point. Will fix/clarify by the time the WIP tag is lifted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to come up with a mechanism to keep this practice then

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanirq For new features we're usually adding e2e tests.
Since the same tests are running on both OCP and HCP they likely to be failed on HCP in case no one took care to add support for the new feature on HCP

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 25, 2024
@ffromani ffromani force-pushed the tuned-deferred-updates-perfprof branch from 218ed63 to 40ae191 Compare July 26, 2024 07:30
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2024
@ffromani ffromani force-pushed the tuned-deferred-updates-perfprof branch from 40ae191 to d5731dd Compare August 5, 2024 08:24
@ffromani
Copy link
Contributor Author

ffromani commented Aug 5, 2024

need to clarify if we want to update at all in hypershift. Won't harm, but would it be useful?

@yanirq
Copy link
Contributor

yanirq commented Aug 5, 2024

need to clarify if we want to update at all in hypershift. Won't harm, but would it be useful?

Since we intend to downport this PR I am wondering if should actually keep the separation and have a followup PR for hypershift support

@ffromani
Copy link
Contributor Author

ffromani commented Aug 5, 2024

to backport the feature we will need code changes anyway (unfortunately). I think the hypershift support is not a concern in this case.

in case the performance profile is annotated with
the very same annotation which enabled tuned
deferred updates, then that annotation
is propagated to the generated tuned objects, effectively
enabling the tuned deferred update feature.

Deferred updates are not supported in hypershift yet

Signed-off-by: Francesco Romani <[email protected]>
@ffromani ffromani force-pushed the tuned-deferred-updates-perfprof branch from d5731dd to 31e133b Compare August 5, 2024 09:57
@Tal-or
Copy link
Contributor

Tal-or commented Aug 5, 2024

We'll need an e2e test for that, unless there's something as part of NTO tests

@ffromani
Copy link
Contributor Author

ffromani commented Aug 5, 2024

/hold cancel
/cc @MarSik

@openshift-ci openshift-ci bot requested a review from MarSik August 5, 2024 12:43
@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 5, 2024
@MarSik
Copy link
Contributor

MarSik commented Aug 5, 2024

I am bit worried about this one. PerfProfile does more than just create Tuned. We also post MC and KubeletConfigs that do not support deferred updates. And they need to stay in sync. Moreover, any change to cpu sets or kernel args in PP causes an MCP reboot.

So I wonder what can be changed in PP that it does not reboot the nodes and goes to Tuned.

@ffromani
Copy link
Contributor Author

ffromani commented Aug 5, 2024

I am bit worried about this one. PerfProfile does more than just create Tuned. We also post MC and KubeletConfigs that do not support deferred updates. And they need to stay in sync. Moreover, any change to cpu sets or kernel args in PP causes an MCP reboot.

So I wonder what can be changed in PP that it does not reboot the nodes and goes to Tuned.

each of these can be addressed individually. But it's allowed for a change in the perfprofile to fan out on multiple objects, and each of them can cause a reboot. And by design these objects are reconciled independently.

@ffromani
Copy link
Contributor Author

ffromani commented Aug 5, 2024

I am bit worried about this one. PerfProfile does more than just create Tuned. We also post MC and KubeletConfigs that do not support deferred updates. And they need to stay in sync. Moreover, any change to cpu sets or kernel args in PP causes an MCP reboot.
So I wonder what can be changed in PP that it does not reboot the nodes and goes to Tuned.

each of these can be addressed individually. But it's allowed for a change in the perfprofile to fan out on multiple objects, and each of them can cause a reboot. And by design these objects are reconciled independently.

uhm the linked bug mentions only tuned though. Fixing all the other reboot cause, assuming we can in the current architecture, is a much larger scope.

@ffromani
Copy link
Contributor Author

ffromani commented Aug 5, 2024

I am bit worried about this one. PerfProfile does more than just create Tuned. We also post MC and KubeletConfigs that do not support deferred updates. And they need to stay in sync. Moreover, any change to cpu sets or kernel args in PP causes an MCP reboot.
So I wonder what can be changed in PP that it does not reboot the nodes and goes to Tuned.

each of these can be addressed individually. But it's allowed for a change in the perfprofile to fan out on multiple objects, and each of them can cause a reboot. And by design these objects are reconciled independently.

uhm the linked bug mentions only tuned though. Fixing all the other reboot cause, assuming we can in the current architecture, is a much larger scope.

I addressed somehow this part in my last commit, PTAL @MarSik @yanirq

@yanirq
Copy link
Contributor

yanirq commented Aug 5, 2024

I am bit worried about this one. PerfProfile does more than just create Tuned. We also post MC and KubeletConfigs that do not support deferred updates. And they need to stay in sync. Moreover, any change to cpu sets or kernel args in PP causes an MCP reboot.
So I wonder what can be changed in PP that it does not reboot the nodes and goes to Tuned.

each of these can be addressed individually. But it's allowed for a change in the perfprofile to fan out on multiple objects, and each of them can cause a reboot. And by design these objects are reconciled independently.

uhm the linked bug mentions only tuned though. Fixing all the other reboot cause, assuming we can in the current architecture, is a much larger scope.

I addressed somehow this part in my last commit, PTAL @MarSik @yanirq

I'm good with the setup for having the option to have multiple objects. Lets get to an agreement with the current method presented here (set of objects,using wildcards) so we are consistent in documentation as well (u/s docs as well?)

The original intent of the deferred updates is to
address tuned-triggered reboots.

By design, performanceprofiles fans out objects
managed by different independent controllers.
Guaranteeing deferred updates for the performanceprofile
is thus a much harder problem than tuned.

To meet these conflicting needs, we augment the lower-level
tuned annotation to accept a comma-separated, lowercase,
list of components whose updates should be deferred.
So far we will only support tuned.

xref: https://issues.redhat.com/browse/OCPBUGS-28647

Signed-off-by: Francesco Romani <[email protected]>
@ffromani ffromani force-pushed the tuned-deferred-updates-perfprof branch from ddb8ad1 to d50fa39 Compare August 6, 2024 09:36
@ffromani
Copy link
Contributor Author

ffromani commented Aug 6, 2024

We'll need an e2e test for that, unless there's something as part of NTO tests

talking about classic OCP (non hypershift) we do have e2e tests for the tuned functionality and we have controller tests which ensure that the label is propagated correctly. Would this be sufficient coverage?

Copy link
Contributor

openshift-ci bot commented Aug 6, 2024

@ffromani: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ffromani
Copy link
Contributor Author

ffromani commented Aug 6, 2024

tuned-level changes are sufficient to close this bug. But we need one more followup, to be filed soon.

@ffromani ffromani closed this Aug 6, 2024
@openshift-ci-robot
Copy link
Contributor

@ffromani: This pull request references Jira Issue OCPBUGS-28647. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

Consume the deferred updates support from the performance profile. The idea is that if the annotation (the same as NTO) is found on performanceprofile, it is passed throught the generated tuned object.

Initially we don't support hypershift

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants