Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-44559: Manage TuneD profiles with the same name and different content #1216

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jmencak
Copy link
Contributor

@jmencak jmencak commented Nov 14, 2024

It is possible to create Tuned CRs with TuneD profiles of the same name.
This is problematic when the duplicate TuneD profiles have different
contents. This can cause periodic switches between the duplicate TuneD
profiles and will lead to TuneD restarts.

This change makes the operator control loop stop to avoid switching
between the conflicting TuneD profiles of the same name and warns about
this misconfiguration via:

  • Tuned CR status and prometheus metric (nto_invalid_tuned_exist_info)
  • NTOInvalidTunedExist alert
  • ERROR message issued in the operator logs

Other changes:

  • Switch the ProfileStatusCondition condition type to
    StatusCondition. This is cosmetic only, all fields are preserved.
  • Obsolete pod label-based matching. Trigger an alert
    (NTOPodLabelsUsed) for this.

Fixes: OCPBUGS-44559.

@openshift-ci openshift-ci bot requested review from MarSik and rbaturov November 14, 2024 10:02
Copy link
Contributor

openshift-ci bot commented Nov 14, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmencak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 14, 2024
@jmencak
Copy link
Contributor Author

jmencak commented Nov 14, 2024

An example of creating duplicate profiles with conflicting content:
Profile 1

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-profile-dup
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Custom OpenShift profile
      include=openshift-node
      [sysctl]
      kernel.shmmni=8192
    name: openshift-profile
  recommend:
  - match:
    - label: profile
    priority: 20
    profile: openshift-profile

Profile 2

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-profile-dup2
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Custom OpenShift profile
      include=openshift-node
      [sysctl]
      kernel.shmmni=8191
    name: openshift-profile
  recommend:
  - match:
    - label: profile
    priority: 20
    profile: openshift-profile

$ oc get tuned
NAME                     VALID   AGE
default                  True    20h
openshift-profile-dup    False   20h
openshift-profile-dup2   False   20h

$ oc get tuned/openshift-profile-dup -o jsonpath='{.status}'        
{"conditions":[{"lastTransitionTime":"2024-12-02T10:58:39Z","message":"Duplicate TuneD profile \"openshift-profile\" with conflicting content detected.","reason":"InvalidConfiguration","status":"False","type":"Valid"}]}

pkg/operator/status.go Outdated Show resolved Hide resolved
@jmencak jmencak changed the title Set co/node-tuning Degraded when conflicting TuneD profiles exist WiP: Set co/node-tuning Degraded when conflicting TuneD profiles exist Nov 14, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 14, 2024
@jmencak jmencak force-pushed the 4.18-duplicate-profiles-diff-content branch 2 times, most recently from 8658666 to fad99f9 Compare November 14, 2024 14:42
@jmencak jmencak changed the title WiP: Set co/node-tuning Degraded when conflicting TuneD profiles exist Drop TuneD profiles with the same name and different content Nov 14, 2024
@jmencak jmencak changed the title Drop TuneD profiles with the same name and different content NO-JIRA: Drop TuneD profiles with the same name and different content Nov 14, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 14, 2024
@openshift-ci-robot
Copy link
Contributor

@jmencak: This pull request explicitly references no jira issue.

In response to this:

It is possible to create Tuned CRs with TuneD profiles of the same name. This is problematic when the duplicate TuneD profiles have different content. While the problem is obvious when inspecting the operator logs, this configuration issue is serious enough it should be more visible.

This change makes the node-tuning ClusterOperator object Degraded when conflicting TuneD profiles are created. This in turn creates an alert for cluster administrators to deal with this misconfiguration.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 14, 2024
@jmencak
Copy link
Contributor Author

jmencak commented Nov 14, 2024

/hold
Unfortunately, as is, this will not fix the issue, not even partly. The Tuned CRs can be retrieved in random order.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 14, 2024
@jmencak jmencak force-pushed the 4.18-duplicate-profiles-diff-content branch from fad99f9 to 47bacc5 Compare November 14, 2024 16:50
@jmencak
Copy link
Contributor Author

jmencak commented Nov 14, 2024

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 14, 2024
@jmencak jmencak force-pushed the 4.18-duplicate-profiles-diff-content branch from 47bacc5 to 2c82188 Compare December 2, 2024 12:08
@jmencak jmencak changed the title NO-JIRA: Drop TuneD profiles with the same name and different content OCPBUGS-44559: Manage TuneD profiles with the same name and different content Dec 2, 2024
@openshift-ci-robot
Copy link
Contributor

@jmencak: This pull request references Jira Issue OCPBUGS-44559, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

It is possible to create Tuned CRs with TuneD profiles of the same name.
This is problematic when the duplicate TuneD profiles have different
contents. This can cause periodic switches between the duplicate TuneD
profiles and will lead to TuneD restarts.

This change makes the operator control loop stop to avoid switching
between the conflicting TuneD profiles of the same name and warns about
this misconfiguration via:

  • Tuned CR status and prometheus metric (nto_invalid_tuned_exist_info)
  • NTOInvalidTunedExist alert
  • ERROR message issued in the operator logs

Other changes:

  • Switch the ProfileStatusCondition condition type to
    StatusCondition. This is cosmetic only, all fields are preserved.
  • Obsolete pod label-based matching. Trigger an alert
    (NTOPodLabelsUsed) for this.

Fixes: OCPBUGS-44559.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Dec 2, 2024
It is possible to create Tuned CRs with TuneD profiles of the same name.
This is problematic when the duplicate TuneD profiles have different
contents.  This can cause periodic switches between the duplicate TuneD
profiles and will lead to TuneD restarts.

This change makes the operator control loop stop to avoid switching
between the conflicting TuneD profiles of the same name and warns about
this misconfiguration via:
  * Tuned CR status and prometheus metric (nto_invalid_tuned_exist_info)
  * NTOInvalidTunedExist alert
  * ERROR message issued in the operator logs

Other changes:
  * Switch the ProfileStatusCondition condition type to
    StatusCondition.  This is cosmetic only, all fields are preserved.
  * Obsolete pod label-based matching.  Trigger an alert
    (NTOPodLabelsUsed) for this.

Fixes: OCPBUGS-44559.
@jmencak jmencak force-pushed the 4.18-duplicate-profiles-diff-content branch from 2c82188 to a6c6575 Compare December 2, 2024 12:59
@jmencak
Copy link
Contributor Author

jmencak commented Dec 2, 2024

/retest

1 similar comment
@jmencak
Copy link
Contributor Author

jmencak commented Dec 2, 2024

/retest

Copy link
Contributor

openshift-ci bot commented Dec 3, 2024

@jmencak: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jmencak
Copy link
Contributor Author

jmencak commented Dec 3, 2024

@MarSik , can you please take a look if this is the direction we want to go? I'd like to ask for a pre-merge testing from the QE.

@jmencak
Copy link
Contributor Author

jmencak commented Dec 13, 2024

/hold
Until I add some e2e testing.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants