Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NO-ISSUE: Recalculate operator dependencies before validations #7227

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jhernand
Copy link
Contributor

Currently operator dependencies are only calculated when a cluster is created or updated. But certain dependencies are dynamic, and may change when new hosts are added. For example, if a cluster has the OpenShift AI operator installed, it will also require the NVIDIA GPU operator only if there are hosts that have NVIDIA GPUs. To support those dynamic dependencies this patch modifies the cluster monitor so that it recalculates the operator dependencies before checking validations.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 27, 2025
@openshift-ci-robot
Copy link

@jhernand: This pull request explicitly references no jira issue.

In response to this:

Currently operator dependencies are only calculated when a cluster is created or updated. But certain dependencies are dynamic, and may change when new hosts are added. For example, if a cluster has the OpenShift AI operator installed, it will also require the NVIDIA GPU operator only if there are hosts that have NVIDIA GPUs. To support those dynamic dependencies this patch modifies the cluster monitor so that it recalculates the operator dependencies before checking validations.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 27, 2025
@openshift-ci openshift-ci bot requested review from mlorenzofr and omertuc January 27, 2025 10:13
Copy link

openshift-ci bot commented Jan 27, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jhernand

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 27, 2025
}
}
for _, addedOperator := range addedOperators {
err = c.db.Save(addedOperator).Error
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rccrdpccl in the previous version of this pull request (now closed) @gamli75 was concerned
that saving the cluster changes directly to DB will cause inconsistent values in the elastic. I added code below to also update the feature set. What else needs to be done to ensure consistency?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to keep elastic data consistent, we notify about important (cluster, hosts and infraenv, only fields we deem important as we want to keep event volume contained) state changes. Unfortunately, there isn't a centralized place for this, so the code used to do this is dispersed.
An example could be updating hosts in some circumstances, but in this case we should update the cluster state.
Related: https://github.com/openshift/assisted-service/blob/398cf47615a80714a11c2bd722681fc8c70a7cc4/internal/stream/notification_stream.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added code to send the notifications, please take another look @rccrdpccl.

Copy link

codecov bot commented Jan 27, 2025

Codecov Report

Attention: Patch coverage is 31.06061% with 91 lines in your changes missing coverage. Please review.

Project coverage is 67.77%. Comparing base (8dd62c1) to head (6eb9b99).
Report is 7 commits behind head on master.

Files with missing lines Patch % Lines
internal/cluster/refresh_status_preprocessor.go 30.08% 73 Missing and 13 partials ⚠️
internal/operators/common/common.go 0.00% 5 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7227      +/-   ##
==========================================
- Coverage   67.92%   67.77%   -0.15%     
==========================================
  Files         298      298              
  Lines       40710    40834     +124     
==========================================
+ Hits        27654    27677      +23     
- Misses      10580    10667      +87     
- Partials     2476     2490      +14     
Files with missing lines Coverage Δ
internal/cluster/cluster.go 65.94% <100.00%> (ø)
internal/operators/manager.go 79.56% <100.00%> (ø)
internal/operators/common/common.go 75.00% <0.00%> (-25.00%) ⬇️
internal/cluster/refresh_status_preprocessor.go 63.70% <30.08%> (-30.62%) ⬇️

... and 10 files with indirect coverage changes

func (r *refreshPreprocessor) recalculateOperatorDependencies(c *clusterPreprocessContext) error {
// Calculate and save the operators that have been added, updated or deleted:
previousOperators := c.cluster.MonitoredOperators
currentOperators, err := r.operatorsAPI.ResolveDependencies(c.cluster, c.cluster.MonitoredOperators)
Copy link
Contributor

@paul-maidment paul-maidment Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c.cluster.MonitoredOperators are the current registered operators for the cluster?

operatorsAPI.ResolveDependencies returns dependencies of the current cluster operators?

I think the terms previousoperators and currentoperators are a little confusing

If I understand this code block correctly, we are doing the following

  • Obtaining a list of the monitored operators for the cluster
  • Finding dependencies required for these monitored operators, this is again a list of MonitoredOperators
  • The resulting operators are then created, updated, deleted according to their presence in the previous operators

I wonder if operators and operatorDependencies (you refer to currentOperators as "operator dependencies" in the error message) or something along those lines would be better?

In it's current form, it's quite hard to follow the intent of this code (but I think I got the basic gist?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c.cluster.MonitoredOperators are the current registered operators for the cluster?

Yes, I think so.

operatorsAPI.ResolveDependencies returns dependencies of the current cluster operators?

I beleive that ResolveDependencies calculates the dependencies of the list of operators that is passed as a parameter. Then it merges that with the complete list of operators of the cluster, and returns the result. Note it returns everything: the operators explicitly added by the user as well as the ones automatically added to resolve the dependencies.

I think the terms previousoperators and currentoperators are a little confusing

If I understand this code block correctly, we are doing the following

  • Obtaining a list of the monitored operators for the cluster
  • Finding dependencies required for these monitored operators, this is again a list of MonitoredOperators
  • The resulting operators are then created, updated, deleted according to their presence in the previous operators

I wonder if operators and operatorDependencies (you refer to currentOperators as "operator dependencies" in the error message) or something along those lines would be better?

The ideal names would be lostOfOperatorsBeforeRecalculatingDependencies and listOfOperatorsAfterRecalculatingDependencies, but that is too long. i can apply the renaming that you suggest, but take into account that "current..." is not only the dependencies; it is everythinng.

In it's current form, it's quite hard to follow the intent of this code (but I think I got the basic gist?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am renaming the variables operatorsBeforeResolve and operatorsAfterResolve. Hope that clarifies it a bit.

)
}
}
for _, updatedOperator := range updatedOperators {
Copy link
Contributor

@paul-maidment paul-maidment Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get the difference between an added and an updated operator in this code, it looks like exactly the same operations are performed on each?

specifically c.db.Save(updatedOperator).Error vs c.db.Save(addedOperator).Error

Why do we have this split if the behaviour is the same?

Edit: Reading further on I can see that this may be related to feature usage. (operators you add or remove may cause a change to feature flags) I still think we could do something more condensed in this area though.

Maybe something like this could allow a single loop to handle the storage for added and updated.

for _, operator := range append(addedOperators, updatedOperators) {

I don't imagine these operator lists are really large enough for space or time complexity to be an issue with them, so this might be a nice way to compact things down a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operators are not only a name: they have also other properties that could change during the recalculation of dependencies. updatedOperators contains the list of operators that already existed before the recalculation and have changed (other than the name) after the recalculation. Currently we happen to do almost the same for both: we save them to the database with the Save method (although internally that Save method does an INSERT for added operators and an UPDATE for updated operators). I say "almost" because the error message generated if that failed is also different: failed to add ... vs failed to update ....

When I wrote this initially I just wrote three independent loops because I didn't know at that point what I will be writing inside. The result ended up having only very minor differences. I think it makes sense, even if there was no difference. But merging them is also fine. If you insist a wee bit I will do that.

@jhernand jhernand force-pushed the recalculate_operator_dependencies_from_cluster_monitor_loop branch from 05a968f to 8b67fe0 Compare January 29, 2025 12:12
@openshift-ci openshift-ci bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 29, 2025
Currently operator dependencies are only calculated when a cluster is
created or updated. But certain dependencies are dynamic, and may
change when new hosts are added. For example, if a cluster has the
OpenShift AI operator installed, it will also require the NVIDIA GPU
operator only if there are hosts that have NVIDIA GPUs. To support those
dynamic dependencies this patch modifies the cluster monitor so that it
recalculates the operator dependencies before checking validations.

Signed-off-by: Juan Hernandez <[email protected]>
@jhernand jhernand force-pushed the recalculate_operator_dependencies_from_cluster_monitor_loop branch from 8b67fe0 to 6eb9b99 Compare January 29, 2025 12:24
Copy link

openshift-ci bot commented Jan 29, 2025

@jhernand: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/edge-e2e-metal-assisted-4-18 6eb9b99 link true /test edge-e2e-metal-assisted-4-18
ci/prow/okd-scos-e2e-aws-ovn 6eb9b99 link false /test okd-scos-e2e-aws-ovn
ci/prow/edge-e2e-metal-assisted-mtv-4-17 6eb9b99 link true /test edge-e2e-metal-assisted-mtv-4-17
ci/prow/mce-images 6eb9b99 link true /test mce-images
ci/prow/edge-e2e-metal-assisted-cnv-4-17 6eb9b99 link true /test edge-e2e-metal-assisted-cnv-4-17
ci/prow/e2e-agent-compact-ipv4 6eb9b99 link true /test e2e-agent-compact-ipv4
ci/prow/edge-e2e-metal-assisted-lvm-4-18 6eb9b99 link true /test edge-e2e-metal-assisted-lvm-4-18
ci/prow/images 6eb9b99 link true /test images
ci/prow/edge-subsystem-kubeapi-aws 6eb9b99 link true /test edge-subsystem-kubeapi-aws
ci/prow/edge-e2e-metal-assisted-odf-4-17 6eb9b99 link true /test edge-e2e-metal-assisted-odf-4-17
ci/prow/edge-ci-index 6eb9b99 link true /test edge-ci-index
ci/prow/edge-e2e-ai-operator-ztp 6eb9b99 link true /test edge-e2e-ai-operator-ztp
ci/prow/edge-images 6eb9b99 link true /test edge-images
ci/prow/edge-subsystem-aws 6eb9b99 link true /test edge-subsystem-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants