-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-32469: Remove tuned/rendered object #1036
OCPBUGS-32469: Remove tuned/rendered object #1036
Conversation
The NTO operand is controlled by the operator by updates to two resources. Its corresponding k8s Tuned Profile resource and tuned/rendered object, which contains a list of all TuneD (daemon) profiles. While this setup works for most cases, there is a problem with this approach when a cluster administator changes both a current TuneD profile content and (at the same) time switches to a new TuneD profile completely. Then, depending on the k8s object update timing, we could see two TuneD daemon reloads instead of just one. Remove the tuned/rendered object and carry TuneD (daemon) profiles directly in the Tuned Profile k8s objects.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jmencak The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc @ffromani |
/test ci/prow/e2e-gcp-pao-workloadhints ci/prow/e2e-gcp-pao is a real failure, need to adjust
to exclude the rendered resource. |
@jmencak: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test e2e-gcp-pao-workloadhints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initial review, still processing the flow changes
@@ -626,6 +630,46 @@ func (pc *ProfileCalculator) getNodePoolNameForNode(node *corev1.Node) (string, | |||
return nodePoolName, nil | |||
} | |||
|
|||
// TunedRecommend returns a name-sorted TunedProfile slice out of | |||
// a slice of Tuned objects. | |||
func TunedProfiles(tunedSlice []*tunedv1.Tuned) []tunedv1.TunedProfile { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add few unit tests for this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There cases, where time is better spent elsewhere than writing unit tests, but I did provide one. Please take a look if this is what you had in mind or whether it needs adjustments. Thank you!
Had another pass. Can't see anything obviously wrong with the proposed approach. Need to see it running though. I'll play with this PR while reworking my #1019 |
Thank you. I'll try to address the concerns ASAP. Please keep in mind that this might not be backportable. I'm open to discussion about this. |
/payload 4.16 ci blocking |
@jmencak: trigger 5 job(s) of type blocking for the ci release of OCP 4.16
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/db530cf0-fcb4-11ee-8c36-924515741273-0 |
/payload 4.16 ci blocking |
@jmencak: trigger 5 job(s) of type blocking for the ci release of OCP 4.16
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ddb268d0-fcf8-11ee-925a-ec8807a0928d-0 |
The
test seems to be passing. I'll also test manually and compare the number of watches with a cluster prior this PR. |
I've ran some performance testing to evaluate the impact of this PR on the CPU/memory and the number of watches. I've tested both during 1h idle time and while running the NTO test-e2e tests. The number of watches collected via the "kubectl-dev_tool audit" command was very close. This is an example for the Intel system: With this PR
Without this PR
The slight difference could be due to the cca 4 minute difference in duration. For the AMD system, the results were nearly exactly the same. CPU utilization was measured both for the operator and operand in user and kernel space. 1h during idle with this PR
1h during idle without this PR
While running the
While running the
To me, the numbers look very similar with/without this PR. |
/retest |
@jmencak: This pull request references Jira Issue OCPBUGS-32469, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Testing and performance testing done. Planning to go through the code one last time today. |
FYI, @liqcui |
I went through the code once again and fixed a few minor issues I've noticed. At the moment I don't have any plans to re-review. Happy to fix issues other reviewers find. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold
to let other reviewers chime in. Feel free to remove
I believe reviewers had sufficient time to comment. Thank you for all the reviews! |
/test e2e-hypershift |
/retest-required |
FWICS, the HyperShift test failures are not caused by NTO/this PR (and they already passed with the same code). |
/retest |
@jmencak: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@jmencak: Jira Issue OCPBUGS-32469: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-32469 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[ART PR BUILD NOTIFIER] This PR has been included in build cluster-node-tuning-operator-container-v4.17.0-202404241149.p0.gdd2698c.assembly.stream.el9 for distgit cluster-node-tuning-operator. |
@jmencak: new pull request created: #1109 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The NTO operand is controlled by the operator by updates to two resources. Its corresponding k8s Tuned Profile resource and tuned/rendered object, which contains a list of all
TuneD
(daemon) profiles.While this setup works for most cases, there is a problem with this approach when a cluster administator changes both a current
TuneD
profile content and (at the same) time switches to a newTuneD
profile completely. Then, depending on the k8s object update timing, we could see twoTuneD
daemon reloads instead of just one.Remove the tuned/rendered object and carry
TuneD
(daemon) profiles directly in the Tuned Profile k8s objects.