Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NO-JIRA: deferred updates: fix in-place update handling on reboot #1162

Conversation

ffromani
Copy link
Contributor

@ffromani ffromani commented Sep 9, 2024

There are conditions on which we should not set the reload flag. This avoid regression in the test
"Profile deferred when applied should trigger changes when applied fist, then deferred when edited, if tuned restart should be kept deferred"

@ffromani
Copy link
Contributor Author

ffromani commented Sep 9, 2024

/cc @jmencak @MarSik

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 9, 2024
@@ -1029,19 +1031,18 @@ func (c *Controller) changeSyncerTuneD(change Change) (synced bool, err error) {
// Cache the value written to tunedRecommendFile.
c.daemon.recommendedProfile = change.recommendedProfile
klog.V(1).Infof("recommended TuneD profile updated from %q to %q [inplaceUpdate=%v nodeRestart=%v]", prevRecommended, change.recommendedProfile, inplaceUpdate, change.nodeRestart)
changeRecommend = true

if change.deferredMode == util.DeferUpdate && !inplaceUpdate && c.daemon.recoveredRecommendedProfile == change.recommendedProfile {
klog.V(1).Infof("recommended TuneD profile changed; skip TuneD reload [deferred=%v recoveredRecommended=%v]", change.deferredMode, c.daemon.recoveredRecommendedProfile)
// Reset because we need only once the first time we process the TuneD k8s object. Let's avoid stale data.
c.daemon.recoveredRecommendedProfile = ""
} else {
klog.V(1).Infof("recommended TuneD profile changed; trigger TuneD reload [deferred=%v]", change.deferredMode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log still says tuned should be reloaded. I will confuse people working on this in the future.. I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the flow will ultimately trigger a tuned reload: see lines around 1065.
But still: how can we improve the logging?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right! So this is a fix that updates the fingerprint in addition to setting the reload flag. The reload part is not changing. Ok then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do want to not set the reload flag if very specific conditions are met, like the one captured in the test mentioned in the commit message, which is constantly failing without this fix

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 9, 2024
Copy link
Contributor

openshift-ci bot commented Sep 9, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, MarSik

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ffromani
Copy link
Contributor Author

ffromani commented Sep 9, 2024

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 9, 2024
@ffromani
Copy link
Contributor Author

ffromani commented Sep 9, 2024

/retest

@ffromani ffromani force-pushed the deferred-updates-cleanup-20240909 branch from f9d84bc to f429cda Compare September 9, 2024 11:35
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 9, 2024
add stricter check for the node condition when
validating OCPBUGS-38795

Signed-off-by: Francesco Romani <[email protected]>
There are conditions on which we should not set the
reload flag. This avoid regression in the test
"Profile deferred when applied should trigger changes when applied fist,
then deferred when edited, if tuned restart should be kept deferred"

Signed-off-by: Francesco Romani <[email protected]>
@ffromani ffromani force-pushed the deferred-updates-cleanup-20240909 branch from f429cda to c506826 Compare September 9, 2024 13:08
@ffromani
Copy link
Contributor Author

/hold cancel

managed to deflake further (on my env) but not completely

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 10, 2024
Copy link
Contributor

@jmencak jmencak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes, Francesco. They look good to me. I've added some comments which are not exactly fair, because they should've been added them during previous reviews. Nevertheless, I believe that adding better logging/comments what happens (especially) in the

change.deferredMode == util.DeferUpdate && !inplaceUpdate && c.daemon.recoveredRecommendedProfile == change.recommendedProfile

section would help a lot future maintainers.

pkg/tuned/controller.go Outdated Show resolved Hide resolved
pkg/tuned/controller.go Outdated Show resolved Hide resolved
@ffromani
Copy link
Contributor Author

/hold

want to rerun the e2e testsuite

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 11, 2024
@ffromani
Copy link
Contributor Author

/hold cancel
/retest

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 11, 2024
@jmencak
Copy link
Contributor

jmencak commented Sep 12, 2024

Thank you for the changes/clarification, Francesco! I've found some typos and things that (IMO) need further clarification. I'll post a suggested diff later on and we can work from there.
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 12, 2024
@jmencak
Copy link
Contributor

jmencak commented Sep 12, 2024

I tried to clarify the recently added comments and remove the typos. Hopefully I didn't introduce any false information/typos. Feel free to adjust as you see fit and happy to discuss here of over Slack:

diff --git a/pkg/tuned/controller.go b/pkg/tuned/controller.go
index e6095297..627f4271 100644
--- a/pkg/tuned/controller.go
+++ b/pkg/tuned/controller.go
@@ -1032,11 +1032,12 @@ func (c *Controller) changeSyncerTuneD(change Change) (synced bool, err error) {
 			c.daemon.recommendedProfile = change.recommendedProfile
 			klog.V(1).Infof("recommended TuneD profile updated from %q to %q [inplaceUpdate=%v nodeRestart=%v]", prevRecommended, change.recommendedProfile, inplaceUpdate, change.nodeRestart)
 
-			// if we get this far, it's either because we detected a node restart or the profile changed (not in-place, thus not-edit).
-			// So first: if this is a edit (replacing, not in-place), we do further logic. If it's a node restart, we just go ahead and process it.
-			// Handling then edits. Id we need to defer only edits, and we are re-processing a profile which we already applied, then we must
-			// not reload tuned, otherwise we're missing the point entirely about deferred updates.
-			// See the test "Profile deferred when applied should trigger changes when applied fist, then deferred when edited, if tuned restart should be kept deferred"
+			// If we get this far, it's either because we detected a node restart or the recommended profile changed.
+			// If it's a node restart, we just process the change. If it's the latter -- TuneD profile switch, i.e. not an in-place update -- we need further
+			// logic to distinguish between the "update" and "always" DeferMode.
+			// For the "update" DeferMode, we only defer profile edits. If we are re-processing a profile which we already applied, then we must
+			// not reload tuned, otherwise we're missing the point of deferred updates.
+			// See the test "Profile deferred when applied should trigger changes when applied first, then deferred when edited, if tuned restart should be kept deferred"
 			// See the commit 3655f22656d4a3aa9f471099305dcd78a9c80320
 			if !inplaceUpdate && change.deferredMode == util.DeferUpdate && c.daemon.recoveredRecommendedProfile == change.recommendedProfile {
 				klog.V(1).Infof("reprocessing profile already in effect; this seems a daemon reload. Skip TuneD reload [deferred=%v recoveredRecommended=%v]", change.deferredMode, c.daemon.recoveredRecommendedProfile)

@ffromani
Copy link
Contributor Author

LGTM, applying

document better the logic about processing edits

Signed-off-by: Francesco Romani <[email protected]>
@ffromani ffromani force-pushed the deferred-updates-cleanup-20240909 branch from b9947d5 to c3bb0c4 Compare September 12, 2024 09:06
@jmencak
Copy link
Contributor

jmencak commented Sep 12, 2024

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 12, 2024
@jmencak
Copy link
Contributor

jmencak commented Sep 12, 2024

/lgtm
the e2e-gcp-pao and e2e-hypershift-pao failures are unrelated
/retest

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 12, 2024
@jmencak
Copy link
Contributor

jmencak commented Sep 12, 2024

Hmm, I think e2e-gcp-pao test needs looking into. I'm getting the same failures (irqbalance.go) in #1163 and also when running the test locally.

@jmencak
Copy link
Contributor

jmencak commented Sep 16, 2024

Overriding 2 tests because of a known change in TuneD. The tests already passed with the previous version of TuneD and need to be adjusted.

/override ci/prow/e2e-gcp-pao
/override ci/prow/e2e-hypershift-pao

Copy link
Contributor

openshift-ci bot commented Sep 16, 2024

@jmencak: Overrode contexts on behalf of jmencak: ci/prow/e2e-gcp-pao, ci/prow/e2e-hypershift-pao

In response to this:

Overriding 2 tests because of a known change in TuneD. The tests already passed with the previous version of TuneD and need to be adjusted.

/override ci/prow/e2e-gcp-pao
/override ci/prow/e2e-hypershift-pao

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jmencak
Copy link
Contributor

jmencak commented Sep 16, 2024

/jira refresh

@openshift-ci-robot
Copy link
Contributor

@jmencak: No Jira issue is referenced in the title of this pull request.
To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@yanirq
Copy link
Contributor

yanirq commented Sep 16, 2024

/retitle NO-JIRA: deferred updates: fix in-place update handling on reboot

to be added manually to #1149

@openshift-ci openshift-ci bot changed the title deferred updates: fix in-place update handling on reboot NO-JIRA: deferred updates: fix in-place update handling on reboot Sep 16, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 16, 2024
@openshift-ci-robot
Copy link
Contributor

@ffromani: This pull request explicitly references no jira issue.

In response to this:

There are conditions on which we should not set the reload flag. This avoid regression in the test
"Profile deferred when applied should trigger changes when applied fist, then deferred when edited, if tuned restart should be kept deferred"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD f3aafdb and 2 for PR HEAD c3bb0c4 in total

@jmencak
Copy link
Contributor

jmencak commented Sep 16, 2024

/override ci/prow/e2e-gcp-pao
/override ci/prow/e2e-hypershift-pao

Copy link
Contributor

openshift-ci bot commented Sep 16, 2024

@jmencak: Overrode contexts on behalf of jmencak: ci/prow/e2e-gcp-pao, ci/prow/e2e-hypershift-pao

In response to this:

/override ci/prow/e2e-gcp-pao
/override ci/prow/e2e-hypershift-pao

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit 1612025 into openshift:master Sep 16, 2024
16 checks passed
Copy link
Contributor

openshift-ci bot commented Sep 16, 2024

@ffromani: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: cluster-node-tuning-operator
This PR has been included in build cluster-node-tuning-operator-container-v4.18.0-202409162307.p0.g1612025.assembly.stream.el9.
All builds following this will include this PR.

yanirq pushed a commit to yanirq/cluster-node-tuning-operator that referenced this pull request Sep 17, 2024
…enshift#1162)

* log: make sure to have high verbosity in tests

Signed-off-by: Francesco Romani <[email protected]>

* e2e: deferred: stricter testing

add stricter check for the node condition when
validating OCPBUGS-38795

Signed-off-by: Francesco Romani <[email protected]>

* deferred: tuned: fix reload trigger when inplace update

There are conditions on which we should not set the
reload flag. This avoid regression in the test
"Profile deferred when applied should trigger changes when applied fist,
then deferred when edited, if tuned restart should be kept deferred"

Signed-off-by: Francesco Romani <[email protected]>

* deferred: tuned: clarify comments and logs

document better the logic about processing edits

Signed-off-by: Francesco Romani <[email protected]>

---------

Signed-off-by: Francesco Romani <[email protected]>
@ffromani ffromani deleted the deferred-updates-cleanup-20240909 branch September 18, 2024 15:46
openshift-merge-bot bot pushed a commit that referenced this pull request Oct 2, 2024
)

* OCPBUGS-28647: tuned: distinguish deferred updates (#1129)

* OCPBUGS-28647: tuned: distinguish deferred updates

To fully support the usecase described in OCPBUGS-28647 and fix
the issue, we need to further distinguish between first-time
profile change and in-place profile change. This is required to
better support a GitOps flow.

The key distinction is if the recommended profile changes or not,
and there's a desire to defer application of changes only when a
profile is updated (e.g. sysctl modified), not the first time it
is applied.

Thus:
- first-time profile change is a change which triggers a change
  of the recommended profile.
- in-place profile update is a change which does NOT cause a
  switch to a TuneD profile with a different name. This involves
  changes to only the contents of the currently used profile.

We change the way the annotation is used. We now require a value,
which can be either
- always: every Tuned object annotated this way will have its
  application deferred.
- update: every Tuned object annotated this way will be processed
  as usual (and as it wasn't annotated) if it's a first-time profile
  change, but its in-place updates will be deferred.
- a new internal value "never" is also added to be used internally
  to mean the deferred feature is disabled entirely.
  User can use this value but it will explicitly disable the feature
  (which is disabled already by default), thus is redundant and not
  recommended.

Signed-off-by: Francesco Romani <[email protected]>

* e2e: drop tags, use labels

now that we have the more powerful ginkgo labels,
we can stop using tags for newer tests.

Signed-off-by: Francesco Romani <[email protected]>

---------

Signed-off-by: Francesco Romani <[email protected]>

* OCPBUGS-38795: Fix defer status during recommended profile change (#1142)

* OCPBUGS-38795: Fix defer status during recommended profile change

Process recommended profile change even when the profile itself
has not changed itself. The internal profile fingerprint tracking
was not updated during recommended profile update with no internal
changes.

* Add e2e test for defer=update

* NO-JIRA: deferred updates: fix in-place update handling on reboot (#1162)

* log: make sure to have high verbosity in tests

Signed-off-by: Francesco Romani <[email protected]>

* e2e: deferred: stricter testing

add stricter check for the node condition when
validating OCPBUGS-38795

Signed-off-by: Francesco Romani <[email protected]>

* deferred: tuned: fix reload trigger when inplace update

There are conditions on which we should not set the
reload flag. This avoid regression in the test
"Profile deferred when applied should trigger changes when applied fist,
then deferred when edited, if tuned restart should be kept deferred"

Signed-off-by: Francesco Romani <[email protected]>

* deferred: tuned: clarify comments and logs

document better the logic about processing edits

Signed-off-by: Francesco Romani <[email protected]>

---------

Signed-off-by: Francesco Romani <[email protected]>

* Omit unbackported logging support

---------

Signed-off-by: Francesco Romani <[email protected]>
Co-authored-by: Francesco Romani <[email protected]>
Co-authored-by: Martin Sivák <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants