Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-32469: Remove tuned/rendered object #1036

Merged
merged 5 commits into from
Apr 24, 2024

Conversation

jmencak
Copy link
Contributor

@jmencak jmencak commented Apr 16, 2024

The NTO operand is controlled by the operator by updates to two resources. Its corresponding k8s Tuned Profile resource and tuned/rendered object, which contains a list of all TuneD (daemon) profiles.

While this setup works for most cases, there is a problem with this approach when a cluster administator changes both a current TuneD profile content and (at the same) time switches to a new TuneD profile completely. Then, depending on the k8s object update timing, we could see two TuneD daemon reloads instead of just one.

Remove the tuned/rendered object and carry TuneD (daemon) profiles directly in the Tuned Profile k8s objects.

The NTO operand is controlled by the operator by updates to two
resources.  Its corresponding k8s Tuned Profile resource and
tuned/rendered object, which contains a list of all TuneD (daemon)
profiles.

While this setup works for most cases, there is a problem with this
approach when a cluster administator changes both a current TuneD
profile content and (at the same) time switches to a new TuneD profile
completely.  Then, depending on the k8s object update timing, we could
see two TuneD daemon reloads instead of just one.

Remove the tuned/rendered object and carry TuneD (daemon) profiles
directly in the Tuned Profile k8s objects.
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 16, 2024
@openshift-ci openshift-ci bot requested review from MarSik and swatisehgal April 16, 2024 04:30
Copy link
Contributor

openshift-ci bot commented Apr 16, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmencak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 16, 2024
@jmencak
Copy link
Contributor Author

jmencak commented Apr 16, 2024

/cc @ffromani

@openshift-ci openshift-ci bot requested a review from ffromani April 16, 2024 04:31
@jmencak
Copy link
Contributor Author

jmencak commented Apr 16, 2024

/test ci/prow/e2e-gcp-pao-workloadhints

ci/prow/e2e-gcp-pao is a real failure, need to adjust

test/e2e/performanceprofile/functests/6_mustgather_testing/mustgather.go

to exclude the rendered resource.

Copy link
Contributor

openshift-ci bot commented Apr 16, 2024

@jmencak: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test e2e-aws-operator
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-techpreview
  • /test e2e-gcp-pao
  • /test e2e-gcp-pao-updating-profile
  • /test e2e-gcp-pao-workloadhints
  • /test e2e-hypershift
  • /test e2e-hypershift-pao
  • /test e2e-no-cluster
  • /test e2e-upgrade
  • /test images
  • /test unit
  • /test verify
  • /test vet

The following commands are available to trigger optional jobs:

  • /test e2e-telco5g-cnftests
  • /test lint

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-cluster-node-tuning-operator-master-e2e-aws-operator
  • pull-ci-openshift-cluster-node-tuning-operator-master-e2e-aws-ovn
  • pull-ci-openshift-cluster-node-tuning-operator-master-e2e-aws-ovn-techpreview
  • pull-ci-openshift-cluster-node-tuning-operator-master-e2e-gcp-pao
  • pull-ci-openshift-cluster-node-tuning-operator-master-e2e-gcp-pao-updating-profile
  • pull-ci-openshift-cluster-node-tuning-operator-master-e2e-gcp-pao-workloadhints
  • pull-ci-openshift-cluster-node-tuning-operator-master-e2e-hypershift
  • pull-ci-openshift-cluster-node-tuning-operator-master-e2e-hypershift-pao
  • pull-ci-openshift-cluster-node-tuning-operator-master-e2e-no-cluster
  • pull-ci-openshift-cluster-node-tuning-operator-master-e2e-upgrade
  • pull-ci-openshift-cluster-node-tuning-operator-master-images
  • pull-ci-openshift-cluster-node-tuning-operator-master-lint
  • pull-ci-openshift-cluster-node-tuning-operator-master-unit
  • pull-ci-openshift-cluster-node-tuning-operator-master-verify
  • pull-ci-openshift-cluster-node-tuning-operator-master-vet

In response to this:

/test ci/prow/e2e-gcp-pao-workloadhints

ci/prow/e2e-gcp-pao is a real failure, need to adjust

test/e2e/performanceprofile/functests/6_mustgather_testing/mustgather.go

to exclude the rendered resource.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jmencak
Copy link
Contributor Author

jmencak commented Apr 16, 2024

/test e2e-gcp-pao-workloadhints

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial review, still processing the flow changes

pkg/operator/profilecalculator.go Outdated Show resolved Hide resolved
pkg/operator/profilecalculator.go Outdated Show resolved Hide resolved
@@ -626,6 +630,46 @@ func (pc *ProfileCalculator) getNodePoolNameForNode(node *corev1.Node) (string,
return nodePoolName, nil
}

// TunedRecommend returns a name-sorted TunedProfile slice out of
// a slice of Tuned objects.
func TunedProfiles(tunedSlice []*tunedv1.Tuned) []tunedv1.TunedProfile {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add few unit tests for this function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There cases, where time is better spent elsewhere than writing unit tests, but I did provide one. Please take a look if this is what you had in mind or whether it needs adjustments. Thank you!

@ffromani
Copy link
Contributor

Had another pass. Can't see anything obviously wrong with the proposed approach. Need to see it running though. I'll play with this PR while reworking my #1019

@jmencak
Copy link
Contributor Author

jmencak commented Apr 16, 2024

Had another pass. Can't see anything obviously wrong with the proposed approach. Need to see it running though. I'll play with this PR while reworking my #1019

Thank you. I'll try to address the concerns ASAP. Please keep in mind that this might not be backportable. I'm open to discussion about this.

@jmencak
Copy link
Contributor Author

jmencak commented Apr 17, 2024

/payload 4.16 ci blocking

Copy link
Contributor

openshift-ci bot commented Apr 17, 2024

@jmencak: trigger 5 job(s) of type blocking for the ci release of OCP 4.16

  • periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade
  • periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.16-e2e-aws-sdn-serial
  • periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/db530cf0-fcb4-11ee-8c36-924515741273-0

@jmencak
Copy link
Contributor Author

jmencak commented Apr 17, 2024

/payload 4.16 ci blocking

Copy link
Contributor

openshift-ci bot commented Apr 17, 2024

@jmencak: trigger 5 job(s) of type blocking for the ci release of OCP 4.16

  • periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade
  • periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.16-e2e-aws-sdn-serial
  • periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ddb268d0-fcf8-11ee-925a-ec8807a0928d-0

@jmencak
Copy link
Contributor Author

jmencak commented Apr 18, 2024

@jmencak: trigger 5 job(s) of type blocking for the ci release of OCP 4.16

* periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade

* periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade

* periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade

* periodic-ci-openshift-release-master-ci-4.16-e2e-aws-sdn-serial

* periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ddb268d0-fcf8-11ee-925a-ec8807a0928d-0

The

[sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel]

test seems to be passing. I'll also test manually and compare the number of watches with a cluster prior this PR.

@jmencak
Copy link
Contributor Author

jmencak commented Apr 18, 2024

I've ran some performance testing to evaluate the impact of this PR on the CPU/memory and the number of watches.

I've tested both during 1h idle time and while running the NTO test-e2e tests.
I've tested on two SNO installs, one was a Dell PowerEdge R640 (Intel), the other one PowerEdge R6525 (AMD).

The number of watches collected via the "kubectl-dev_tool audit" command was very close.

This is an example for the Intel system:

With this PR

count: 165, first: 2024-04-18T12:39:22+02:00, last: 2024-04-18T13:37:18+02:00, duration: 57m55.911899s
17x                  tuned.openshift.io/tuneds
16x                  machineconfiguration.openshift.io/v1/machineconfigs
16x                  tuned.openshift.io/profiles
15x                  machineconfiguration.openshift.io/v1/machineconfigpools
15x                  v1/nodes
9x                   tuned.openshift.io/v1/profiles
9x                   machineconfiguration.openshift.io/v1/kubeletconfigs
8x                   config.openshift.io/v1/clusterversions
8x                   config.openshift.io/v1/clusteroperators
8x                   tuned.openshift.io/v1/tuneds

Without this PR

count: 155, first: 2024-04-18T10:41:38+02:00, last: 2024-04-18T11:35:38+02:00, duration: 54m0.239677s
15x                  tuned.openshift.io/tuneds
15x                  machineconfiguration.openshift.io/v1/machineconfigs
15x                  machineconfiguration.openshift.io/v1/machineconfigpools
15x                  tuned.openshift.io/profiles
14x                  v1/nodes
8x                   tuned.openshift.io/v1/profiles
8x                   apps/daemonsets
8x                   config.openshift.io/v1/featuregates
8x                   machineconfiguration.openshift.io/v1/containerruntimeconfigs
8x                   v1/configmaps

The slight difference could be due to the cca 4 minute difference in duration. For the AMD system, the results were nearly exactly the same.

CPU utilization was measured both for the operator and operand in user and kernel space.

1h during idle with this PR

                user        kernel
CPU (operator): 253         82
CPU (operand):  118         29

                VmSize      VmRSS
MEM (operator): 4243056     77376
MEM (operand):  2912040     58468

1h during idle without this PR

                user        kernel
CPU (operator): 255         81
CPU (operand):  81          34

                VmSize      VmRSS
MEM (operator): 4612492     79440
MEM (operand):  2912304     60532

While running the make test-e2e with this PR

                user        kernel
CPU (operator): 101         16
CPU (operand):  22          13

                VmSize      VmRSS
MEM (operator): 5204580     93656
MEM (operand):  2986796     61684

While running the make test-e2e without this PR

                user        kernel
CPU (operator): 105         13
CPU (operand):  25          12

                VmSize      VmRSS
MEM (operator): 5277808     96648
MEM (operand):  3281732     60532

To me, the numbers look very similar with/without this PR.

@jmencak
Copy link
Contributor Author

jmencak commented Apr 19, 2024

/retest

@jmencak jmencak changed the title WiP: Remove tuned/rendered object OCPBUGS-32469: WiP: Remove tuned/rendered object Apr 19, 2024
@openshift-ci-robot openshift-ci-robot added the jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. label Apr 19, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 19, 2024
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Apr 19, 2024
@openshift-ci-robot
Copy link
Contributor

@jmencak: This pull request references Jira Issue OCPBUGS-32469, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The NTO operand is controlled by the operator by updates to two resources. Its corresponding k8s Tuned Profile resource and tuned/rendered object, which contains a list of all TuneD (daemon) profiles.

While this setup works for most cases, there is a problem with this approach when a cluster administator changes both a current TuneD profile content and (at the same) time switches to a new TuneD profile completely. Then, depending on the k8s object update timing, we could see two TuneD daemon reloads instead of just one.

Remove the tuned/rendered object and carry TuneD (daemon) profiles directly in the Tuned Profile k8s objects.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jmencak jmencak changed the title OCPBUGS-32469: WiP: Remove tuned/rendered object OCPBUGS-32469: Remove tuned/rendered object Apr 19, 2024
@jmencak
Copy link
Contributor Author

jmencak commented Apr 19, 2024

Testing and performance testing done. Planning to go through the code one last time today.
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 19, 2024
@jmencak
Copy link
Contributor Author

jmencak commented Apr 19, 2024

FYI, @liqcui tuned/rendered is going away.

@jmencak
Copy link
Contributor Author

jmencak commented Apr 19, 2024

I went through the code once again and fixed a few minor issues I've noticed. At the moment I don't have any plans to re-review. Happy to fix issues other reviewers find.
/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 19, 2024
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold
to let other reviewers chime in. Feel free to remove

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 22, 2024
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 22, 2024
@jmencak
Copy link
Contributor Author

jmencak commented Apr 23, 2024

I believe reviewers had sufficient time to comment. Thank you for all the reviews!
/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 23, 2024
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 8a57e13 and 2 for PR HEAD 9e19c97 in total

@ffromani
Copy link
Contributor

/test e2e-hypershift

@ffromani
Copy link
Contributor

/retest-required

@jmencak
Copy link
Contributor Author

jmencak commented Apr 23, 2024

FWICS, the HyperShift test failures are not caused by NTO/this PR (and they already passed with the same code).
/retest

@jmencak
Copy link
Contributor Author

jmencak commented Apr 24, 2024

/retest

Copy link
Contributor

openshift-ci bot commented Apr 24, 2024

@jmencak: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit dd2698c into openshift:master Apr 24, 2024
16 checks passed
@openshift-ci-robot
Copy link
Contributor

@jmencak: Jira Issue OCPBUGS-32469: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-32469 has been moved to the MODIFIED state.

In response to this:

The NTO operand is controlled by the operator by updates to two resources. Its corresponding k8s Tuned Profile resource and tuned/rendered object, which contains a list of all TuneD (daemon) profiles.

While this setup works for most cases, there is a problem with this approach when a cluster administator changes both a current TuneD profile content and (at the same) time switches to a new TuneD profile completely. Then, depending on the k8s object update timing, we could see two TuneD daemon reloads instead of just one.

Remove the tuned/rendered object and carry TuneD (daemon) profiles directly in the Tuned Profile k8s objects.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jmencak jmencak deleted the 4.16-no-rendered branch April 24, 2024 09:56
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-node-tuning-operator-container-v4.17.0-202404241149.p0.gdd2698c.assembly.stream.el9 for distgit cluster-node-tuning-operator.
All builds following this will include this PR.

@openshift-cherrypick-robot

@jmencak: new pull request created: #1109

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants