OCPBUGS-32469: Remove tuned/rendered object #1036

jmencak · 2024-04-16T04:29:34Z

The NTO operand is controlled by the operator by updates to two resources. Its corresponding k8s Tuned Profile resource and tuned/rendered object, which contains a list of all TuneD (daemon) profiles.

While this setup works for most cases, there is a problem with this approach when a cluster administator changes both a current TuneD profile content and (at the same) time switches to a new TuneD profile completely. Then, depending on the k8s object update timing, we could see two TuneD daemon reloads instead of just one.

Remove the tuned/rendered object and carry TuneD (daemon) profiles directly in the Tuned Profile k8s objects.

The NTO operand is controlled by the operator by updates to two resources. Its corresponding k8s Tuned Profile resource and tuned/rendered object, which contains a list of all TuneD (daemon) profiles. While this setup works for most cases, there is a problem with this approach when a cluster administator changes both a current TuneD profile content and (at the same) time switches to a new TuneD profile completely. Then, depending on the k8s object update timing, we could see two TuneD daemon reloads instead of just one. Remove the tuned/rendered object and carry TuneD (daemon) profiles directly in the Tuned Profile k8s objects.

openshift-ci · 2024-04-16T04:30:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmencak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jmencak]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jmencak · 2024-04-16T04:31:41Z

/cc @ffromani

jmencak · 2024-04-16T08:02:38Z

/test ci/prow/e2e-gcp-pao-workloadhints

ci/prow/e2e-gcp-pao is a real failure, need to adjust

test/e2e/performanceprofile/functests/6_mustgather_testing/mustgather.go

to exclude the rendered resource.

openshift-ci · 2024-04-16T08:02:42Z

@jmencak: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-aws-operator
/test e2e-aws-ovn
/test e2e-aws-ovn-techpreview
/test e2e-gcp-pao
/test e2e-gcp-pao-updating-profile
/test e2e-gcp-pao-workloadhints
/test e2e-hypershift
/test e2e-hypershift-pao
/test e2e-no-cluster
/test e2e-upgrade
/test images
/test unit
/test verify
/test vet

The following commands are available to trigger optional jobs:

/test e2e-telco5g-cnftests
/test lint

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-cluster-node-tuning-operator-master-e2e-aws-operator
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-aws-ovn
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-aws-ovn-techpreview
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-gcp-pao
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-gcp-pao-updating-profile
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-gcp-pao-workloadhints
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-hypershift
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-hypershift-pao
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-no-cluster
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-upgrade
pull-ci-openshift-cluster-node-tuning-operator-master-images
pull-ci-openshift-cluster-node-tuning-operator-master-lint
pull-ci-openshift-cluster-node-tuning-operator-master-unit
pull-ci-openshift-cluster-node-tuning-operator-master-verify
pull-ci-openshift-cluster-node-tuning-operator-master-vet

In response to this:

/test ci/prow/e2e-gcp-pao-workloadhints

ci/prow/e2e-gcp-pao is a real failure, need to adjust
test/e2e/performanceprofile/functests/6_mustgather_testing/mustgather.go
to exclude the rendered resource.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jmencak · 2024-04-16T08:03:09Z

/test e2e-gcp-pao-workloadhints

ffromani

initial review, still processing the flow changes

pkg/operator/profilecalculator.go

ffromani · 2024-04-16T10:02:57Z

pkg/operator/profilecalculator.go

@@ -626,6 +630,46 @@ func (pc *ProfileCalculator) getNodePoolNameForNode(node *corev1.Node) (string,
 	return nodePoolName, nil
 }

+// TunedRecommend returns a name-sorted TunedProfile slice out of
+// a slice of Tuned objects.
+func TunedProfiles(tunedSlice []*tunedv1.Tuned) []tunedv1.TunedProfile {


I think we should add few unit tests for this function

There cases, where time is better spent elsewhere than writing unit tests, but I did provide one. Please take a look if this is what you had in mind or whether it needs adjustments. Thank you!

ffromani · 2024-04-16T14:21:35Z

Had another pass. Can't see anything obviously wrong with the proposed approach. Need to see it running though. I'll play with this PR while reworking my #1019

jmencak · 2024-04-16T15:16:23Z

Had another pass. Can't see anything obviously wrong with the proposed approach. Need to see it running though. I'll play with this PR while reworking my #1019

Thank you. I'll try to address the concerns ASAP. Please keep in mind that this might not be backportable. I'm open to discussion about this.

jmencak · 2024-04-17T12:20:18Z

/payload 4.16 ci blocking

openshift-ci · 2024-04-17T12:21:08Z

@jmencak: trigger 5 job(s) of type blocking for the ci release of OCP 4.16

periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade
periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.16-e2e-aws-sdn-serial
periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/db530cf0-fcb4-11ee-8c36-924515741273-0

jmencak · 2024-04-17T20:27:08Z

/payload 4.16 ci blocking

openshift-ci · 2024-04-17T20:27:17Z

@jmencak: trigger 5 job(s) of type blocking for the ci release of OCP 4.16

periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade
periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.16-e2e-aws-sdn-serial
periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ddb268d0-fcf8-11ee-925a-ec8807a0928d-0

jmencak · 2024-04-18T04:53:52Z

@jmencak: trigger 5 job(s) of type blocking for the ci release of OCP 4.16
* periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade

* periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade

* periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade

* periodic-ci-openshift-release-master-ci-4.16-e2e-aws-sdn-serial

* periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ddb268d0-fcf8-11ee-925a-ec8807a0928d-0

The

[sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel]

test seems to be passing. I'll also test manually and compare the number of watches with a cluster prior this PR.

jmencak · 2024-04-18T13:57:47Z

I've ran some performance testing to evaluate the impact of this PR on the CPU/memory and the number of watches.

I've tested both during 1h idle time and while running the NTO test-e2e tests.
I've tested on two SNO installs, one was a Dell PowerEdge R640 (Intel), the other one PowerEdge R6525 (AMD).

The number of watches collected via the "kubectl-dev_tool audit" command was very close.

This is an example for the Intel system:

With this PR

count: 165, first: 2024-04-18T12:39:22+02:00, last: 2024-04-18T13:37:18+02:00, duration: 57m55.911899s
17x                  tuned.openshift.io/tuneds
16x                  machineconfiguration.openshift.io/v1/machineconfigs
16x                  tuned.openshift.io/profiles
15x                  machineconfiguration.openshift.io/v1/machineconfigpools
15x                  v1/nodes
9x                   tuned.openshift.io/v1/profiles
9x                   machineconfiguration.openshift.io/v1/kubeletconfigs
8x                   config.openshift.io/v1/clusterversions
8x                   config.openshift.io/v1/clusteroperators
8x                   tuned.openshift.io/v1/tuneds

Without this PR

count: 155, first: 2024-04-18T10:41:38+02:00, last: 2024-04-18T11:35:38+02:00, duration: 54m0.239677s
15x                  tuned.openshift.io/tuneds
15x                  machineconfiguration.openshift.io/v1/machineconfigs
15x                  machineconfiguration.openshift.io/v1/machineconfigpools
15x                  tuned.openshift.io/profiles
14x                  v1/nodes
8x                   tuned.openshift.io/v1/profiles
8x                   apps/daemonsets
8x                   config.openshift.io/v1/featuregates
8x                   machineconfiguration.openshift.io/v1/containerruntimeconfigs
8x                   v1/configmaps

The slight difference could be due to the cca 4 minute difference in duration. For the AMD system, the results were nearly exactly the same.

CPU utilization was measured both for the operator and operand in user and kernel space.

1h during idle with this PR

                user        kernel
CPU (operator): 253         82
CPU (operand):  118         29

                VmSize      VmRSS
MEM (operator): 4243056     77376
MEM (operand):  2912040     58468

1h during idle without this PR

                user        kernel
CPU (operator): 255         81
CPU (operand):  81          34

                VmSize      VmRSS
MEM (operator): 4612492     79440
MEM (operand):  2912304     60532

While running the make test-e2e with this PR

                user        kernel
CPU (operator): 101         16
CPU (operand):  22          13

                VmSize      VmRSS
MEM (operator): 5204580     93656
MEM (operand):  2986796     61684

While running the make test-e2e without this PR

                user        kernel
CPU (operator): 105         13
CPU (operand):  25          12

                VmSize      VmRSS
MEM (operator): 5277808     96648
MEM (operand):  3281732     60532

To me, the numbers look very similar with/without this PR.

jmencak · 2024-04-19T05:45:48Z

/retest

openshift-ci-robot · 2024-04-19T07:45:44Z

@jmencak: This pull request references Jira Issue OCPBUGS-32469, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The NTO operand is controlled by the operator by updates to two resources. Its corresponding k8s Tuned Profile resource and tuned/rendered object, which contains a list of all TuneD (daemon) profiles.

While this setup works for most cases, there is a problem with this approach when a cluster administator changes both a current TuneD profile content and (at the same) time switches to a new TuneD profile completely. Then, depending on the k8s object update timing, we could see two TuneD daemon reloads instead of just one.

Remove the tuned/rendered object and carry TuneD (daemon) profiles directly in the Tuned Profile k8s objects.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jmencak · 2024-04-19T07:48:58Z

Testing and performance testing done. Planning to go through the code one last time today.
/hold

jmencak · 2024-04-19T08:05:40Z

FYI, @liqcui tuned/rendered is going away.

jmencak · 2024-04-19T09:59:07Z

I went through the code once again and fixed a few minor issues I've noticed. At the moment I don't have any plans to re-review. Happy to fix issues other reviewers find.
/hold cancel

ffromani

/lgtm
/hold
to let other reviewers chime in. Feel free to remove

jmencak · 2024-04-23T11:07:20Z

I believe reviewers had sufficient time to comment. Thank you for all the reviews!
/hold cancel

openshift-ci-robot · 2024-04-23T12:58:31Z

/retest-required

Remaining retests: 0 against base HEAD 8a57e13 and 2 for PR HEAD 9e19c97 in total

ffromani · 2024-04-23T14:32:40Z

/test e2e-hypershift

ffromani · 2024-04-23T17:08:48Z

/retest-required

jmencak · 2024-04-23T19:42:11Z

FWICS, the HyperShift test failures are not caused by NTO/this PR (and they already passed with the same code).
/retest

jmencak · 2024-04-24T05:11:31Z

/retest

openshift-ci · 2024-04-24T09:50:05Z

@jmencak: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2024-04-24T09:52:59Z

@jmencak: Jira Issue OCPBUGS-32469: All pull requests linked via external trackers have merged:

openshift/cluster-node-tuning-operator#1036

Jira Issue OCPBUGS-32469 has been moved to the MODIFIED state.

In response to this:

The NTO operand is controlled by the operator by updates to two resources. Its corresponding k8s Tuned Profile resource and tuned/rendered object, which contains a list of all TuneD (daemon) profiles.

While this setup works for most cases, there is a problem with this approach when a cluster administator changes both a current TuneD profile content and (at the same) time switches to a new TuneD profile completely. Then, depending on the k8s object update timing, we could see two TuneD daemon reloads instead of just one.

Remove the tuned/rendered object and carry TuneD (daemon) profiles directly in the Tuned Profile k8s objects.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2024-04-24T16:41:58Z

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-node-tuning-operator-container-v4.17.0-202404241149.p0.gdd2698c.assembly.stream.el9 for distgit cluster-node-tuning-operator.
All builds following this will include this PR.

openshift-cherrypick-robot · 2024-07-11T13:26:10Z

@jmencak: new pull request created: #1109

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 16, 2024

openshift-ci bot requested review from MarSik and swatisehgal April 16, 2024 04:30

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 16, 2024

openshift-ci bot requested a review from ffromani April 16, 2024 04:31

ffromani reviewed Apr 16, 2024

View reviewed changes

jmencak added 3 commits April 17, 2024 07:39

Fix e2e-gcp-pao-workloadhints, minor code improvements

8ab0111

Another "happy path" in TunedProfiles()

7accb90

Added a unit test for TunedProfiles()

8d10dd6

jmencak changed the title ~~WiP: Remove tuned/rendered object~~ OCPBUGS-32469: WiP: Remove tuned/rendered object Apr 19, 2024

openshift-ci-robot added the jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. label Apr 19, 2024

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 19, 2024

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Apr 19, 2024

jmencak changed the title ~~OCPBUGS-32469: WiP: Remove tuned/rendered object~~ OCPBUGS-32469: Remove tuned/rendered object Apr 19, 2024

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 19, 2024

No errors when TunedRenderedResourceName was already removed

9e19c97

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 19, 2024

ffromani reviewed Apr 22, 2024

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 22, 2024

openshift-ci bot assigned ffromani Apr 22, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 22, 2024

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 23, 2024

openshift-merge-bot bot merged commit dd2698c into openshift:master Apr 24, 2024
16 checks passed

jmencak deleted the 4.16-no-rendered branch April 24, 2024 09:56

openshift-cherrypick-robot mentioned this pull request Jul 11, 2024

[release-4.15] OCPBUGS-36870: Remove tuned/rendered object #1109

Closed

jmencak mentioned this pull request Jul 11, 2024

OCPBUGS-36870: Remove tuned/rendered object #1110

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-32469: Remove tuned/rendered object #1036

OCPBUGS-32469: Remove tuned/rendered object #1036

jmencak commented Apr 16, 2024

openshift-ci bot commented Apr 16, 2024

jmencak commented Apr 16, 2024

jmencak commented Apr 16, 2024

openshift-ci bot commented Apr 16, 2024

jmencak commented Apr 16, 2024

ffromani left a comment

ffromani Apr 16, 2024

jmencak Apr 17, 2024

ffromani commented Apr 16, 2024

jmencak commented Apr 16, 2024 •

edited

Loading

jmencak commented Apr 17, 2024

openshift-ci bot commented Apr 17, 2024

jmencak commented Apr 17, 2024

openshift-ci bot commented Apr 17, 2024

jmencak commented Apr 18, 2024

jmencak commented Apr 18, 2024

jmencak commented Apr 19, 2024

openshift-ci-robot commented Apr 19, 2024

jmencak commented Apr 19, 2024

jmencak commented Apr 19, 2024

jmencak commented Apr 19, 2024

ffromani left a comment

jmencak commented Apr 23, 2024

openshift-ci-robot commented Apr 23, 2024

ffromani commented Apr 23, 2024

ffromani commented Apr 23, 2024

jmencak commented Apr 23, 2024 •

edited

Loading

jmencak commented Apr 24, 2024

openshift-ci bot commented Apr 24, 2024

openshift-ci-robot commented Apr 24, 2024

openshift-bot commented Apr 24, 2024

openshift-cherrypick-robot commented Jul 11, 2024

OCPBUGS-32469: Remove tuned/rendered object #1036

OCPBUGS-32469: Remove tuned/rendered object #1036

Conversation

jmencak commented Apr 16, 2024

openshift-ci bot commented Apr 16, 2024

jmencak commented Apr 16, 2024

jmencak commented Apr 16, 2024

openshift-ci bot commented Apr 16, 2024

jmencak commented Apr 16, 2024

ffromani left a comment

Choose a reason for hiding this comment

ffromani Apr 16, 2024

Choose a reason for hiding this comment

jmencak Apr 17, 2024

Choose a reason for hiding this comment

ffromani commented Apr 16, 2024

jmencak commented Apr 16, 2024 • edited Loading

jmencak commented Apr 17, 2024

openshift-ci bot commented Apr 17, 2024

jmencak commented Apr 17, 2024

openshift-ci bot commented Apr 17, 2024

jmencak commented Apr 18, 2024

jmencak commented Apr 18, 2024

jmencak commented Apr 19, 2024

openshift-ci-robot commented Apr 19, 2024

jmencak commented Apr 19, 2024

jmencak commented Apr 19, 2024

jmencak commented Apr 19, 2024

ffromani left a comment

Choose a reason for hiding this comment

jmencak commented Apr 23, 2024

openshift-ci-robot commented Apr 23, 2024

ffromani commented Apr 23, 2024

ffromani commented Apr 23, 2024

jmencak commented Apr 23, 2024 • edited Loading

jmencak commented Apr 24, 2024

openshift-ci bot commented Apr 24, 2024

openshift-ci-robot commented Apr 24, 2024

openshift-bot commented Apr 24, 2024

openshift-cherrypick-robot commented Jul 11, 2024

jmencak commented Apr 16, 2024 •

edited

Loading

jmencak commented Apr 23, 2024 •

edited

Loading