NO-ISSUE: Recalculate operator dependencies before validations #7227

jhernand · 2025-01-27T10:13:25Z

Currently operator dependencies are only calculated when a cluster is created or updated. But certain dependencies are dynamic, and may change when new hosts are added. For example, if a cluster has the OpenShift AI operator installed, it will also require the NVIDIA GPU operator only if there are hosts that have NVIDIA GPUs. To support those dynamic dependencies this patch modifies the cluster monitor so that it recalculates the operator dependencies before checking validations.

List all the issues related to this PR

What environments does this code impact?

Automation (CI, tools, etc)
Cloud
Operator Managed Deployments
None

How was this code tested?

assisted-test-infra environment
dev-scripts environment
Reviewer's test appreciated
Waiting for CI to do a full test run
Manual (Elaborate on how it was tested)
No tests needed

Checklist

Title and description added to both, commit and PR.
Relevant issues have been associated (see CONTRIBUTING guide)
This change does not require a documentation update (docstring, docs, README, etc)
Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

Are the title and description (in both PR and commit) meaningful and clear?
Is there a bug required (and linked) for this change?
Should this PR be backported?

openshift-ci-robot · 2025-01-27T10:13:29Z

openshift-ci · 2025-01-27T10:14:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jhernand

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jhernand]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jhernand · 2025-01-27T10:17:54Z

internal/cluster/refresh_status_preprocessor.go

+		}
+	}
+	for _, addedOperator := range addedOperators {
+		err = c.db.Save(addedOperator).Error


@rccrdpccl in the previous version of this pull request (now closed) @gamli75 was concerned
that saving the cluster changes directly to DB will cause inconsistent values in the elastic. I added code below to also update the feature set. What else needs to be done to ensure consistency?

In order to keep elastic data consistent, we notify about important (cluster, hosts and infraenv, only fields we deem important as we want to keep event volume contained) state changes. Unfortunately, there isn't a centralized place for this, so the code used to do this is dispersed.
An example could be updating hosts in some circumstances, but in this case we should update the cluster state.
Related: https://github.com/openshift/assisted-service/blob/398cf47615a80714a11c2bd722681fc8c70a7cc4/internal/stream/notification_stream.go

Added code to send the notifications, please take another look @rccrdpccl.

codecov · 2025-01-27T11:05:55Z

Codecov Report

Attention: Patch coverage is 31.06061% with 91 lines in your changes missing coverage. Please review.

Project coverage is 67.77%. Comparing base (8dd62c1) to head (6eb9b99).
Report is 7 commits behind head on master.

Files with missing lines	Patch %	Lines
internal/cluster/refresh_status_preprocessor.go	30.08%	73 Missing and 13 partials ⚠️
internal/operators/common/common.go	0.00%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7227      +/-   ##
==========================================
- Coverage   67.92%   67.77%   -0.15%     
==========================================
  Files         298      298              
  Lines       40710    40834     +124     
==========================================
+ Hits        27654    27677      +23     
- Misses      10580    10667      +87     
- Partials     2476     2490      +14

Files with missing lines	Coverage Δ
internal/cluster/cluster.go	`65.94% <100.00%> (ø)`
internal/operators/manager.go	`79.56% <100.00%> (ø)`
internal/operators/common/common.go	`75.00% <0.00%> (-25.00%)`	⬇️
internal/cluster/refresh_status_preprocessor.go	`63.70% <30.08%> (-30.62%)`	⬇️

... and 10 files with indirect coverage changes

paul-maidment · 2025-01-28T23:06:17Z

internal/cluster/refresh_status_preprocessor.go

+func (r *refreshPreprocessor) recalculateOperatorDependencies(c *clusterPreprocessContext) error {
+	// Calculate and save the operators that have been added, updated or deleted:
+	previousOperators := c.cluster.MonitoredOperators
+	currentOperators, err := r.operatorsAPI.ResolveDependencies(c.cluster, c.cluster.MonitoredOperators)


c.cluster.MonitoredOperators are the current registered operators for the cluster?

operatorsAPI.ResolveDependencies returns dependencies of the current cluster operators?

I think the terms previousoperators and currentoperators are a little confusing

If I understand this code block correctly, we are doing the following

Obtaining a list of the monitored operators for the cluster

Finding dependencies required for these monitored operators, this is again a list of MonitoredOperators

The resulting operators are then created, updated, deleted according to their presence in the previous operators

I wonder if operators and operatorDependencies (you refer to currentOperators as "operator dependencies" in the error message) or something along those lines would be better?

In it's current form, it's quite hard to follow the intent of this code (but I think I got the basic gist?)

c.cluster.MonitoredOperators are the current registered operators for the cluster?

Yes, I think so.

operatorsAPI.ResolveDependencies returns dependencies of the current cluster operators?

I beleive that ResolveDependencies calculates the dependencies of the list of operators that is passed as a parameter. Then it merges that with the complete list of operators of the cluster, and returns the result. Note it returns everything: the operators explicitly added by the user as well as the ones automatically added to resolve the dependencies.

I think the terms previousoperators and currentoperators are a little confusing

If I understand this code block correctly, we are doing the following

Obtaining a list of the monitored operators for the cluster

Finding dependencies required for these monitored operators, this is again a list of MonitoredOperators

The resulting operators are then created, updated, deleted according to their presence in the previous operators

I wonder if operators and operatorDependencies (you refer to currentOperators as "operator dependencies" in the error message) or something along those lines would be better?

The ideal names would be lostOfOperatorsBeforeRecalculatingDependencies and listOfOperatorsAfterRecalculatingDependencies, but that is too long. i can apply the renaming that you suggest, but take into account that "current..." is not only the dependencies; it is everythinng.

In it's current form, it's quite hard to follow the intent of this code (but I think I got the basic gist?)

I am renaming the variables operatorsBeforeResolve and operatorsAfterResolve. Hope that clarifies it a bit.

paul-maidment · 2025-01-28T23:22:38Z

internal/cluster/refresh_status_preprocessor.go

+			)
+		}
+	}
+	for _, updatedOperator := range updatedOperators {


I don't get the difference between an added and an updated operator in this code, it looks like exactly the same operations are performed on each?

specifically c.db.Save(updatedOperator).Error vs c.db.Save(addedOperator).Error

Why do we have this split if the behaviour is the same?

Edit: Reading further on I can see that this may be related to feature usage. (operators you add or remove may cause a change to feature flags) I still think we could do something more condensed in this area though.

Maybe something like this could allow a single loop to handle the storage for added and updated.

for _, operator := range append(addedOperators, updatedOperators) {

I don't imagine these operator lists are really large enough for space or time complexity to be an issue with them, so this might be a nice way to compact things down a bit.

Operators are not only a name: they have also other properties that could change during the recalculation of dependencies. updatedOperators contains the list of operators that already existed before the recalculation and have changed (other than the name) after the recalculation. Currently we happen to do almost the same for both: we save them to the database with the Save method (although internally that Save method does an INSERT for added operators and an UPDATE for updated operators). I say "almost" because the error message generated if that failed is also different: failed to add ... vs failed to update ....

When I wrote this initially I just wrote three independent loops because I didn't know at that point what I will be writing inside. The result ended up having only very minor differences. I think it makes sense, even if there was no difference. But merging them is also fine. If you insist a wee bit I will do that.

Currently operator dependencies are only calculated when a cluster is created or updated. But certain dependencies are dynamic, and may change when new hosts are added. For example, if a cluster has the OpenShift AI operator installed, it will also require the NVIDIA GPU operator only if there are hosts that have NVIDIA GPUs. To support those dynamic dependencies this patch modifies the cluster monitor so that it recalculates the operator dependencies before checking validations. Signed-off-by: Juan Hernandez <[email protected]>

openshift-ci · 2025-01-29T13:17:43Z

@jhernand: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/edge-e2e-metal-assisted-4-18	`6eb9b99`	link	true	`/test edge-e2e-metal-assisted-4-18`
ci/prow/okd-scos-e2e-aws-ovn	`6eb9b99`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/edge-e2e-metal-assisted-mtv-4-17	`6eb9b99`	link	true	`/test edge-e2e-metal-assisted-mtv-4-17`
ci/prow/mce-images	`6eb9b99`	link	true	`/test mce-images`
ci/prow/edge-e2e-metal-assisted-cnv-4-17	`6eb9b99`	link	true	`/test edge-e2e-metal-assisted-cnv-4-17`
ci/prow/e2e-agent-compact-ipv4	`6eb9b99`	link	true	`/test e2e-agent-compact-ipv4`
ci/prow/edge-e2e-metal-assisted-lvm-4-18	`6eb9b99`	link	true	`/test edge-e2e-metal-assisted-lvm-4-18`
ci/prow/images	`6eb9b99`	link	true	`/test images`
ci/prow/edge-subsystem-kubeapi-aws	`6eb9b99`	link	true	`/test edge-subsystem-kubeapi-aws`
ci/prow/edge-e2e-metal-assisted-odf-4-17	`6eb9b99`	link	true	`/test edge-e2e-metal-assisted-odf-4-17`
ci/prow/edge-ci-index	`6eb9b99`	link	true	`/test edge-ci-index`
ci/prow/edge-e2e-ai-operator-ztp	`6eb9b99`	link	true	`/test edge-e2e-ai-operator-ztp`
ci/prow/edge-images	`6eb9b99`	link	true	`/test edge-images`
ci/prow/edge-subsystem-aws	`6eb9b99`	link	true	`/test edge-subsystem-aws`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 27, 2025

openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 27, 2025

openshift-ci bot requested review from mlorenzofr and omertuc January 27, 2025 10:13

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 27, 2025

jhernand mentioned this pull request Jan 27, 2025

NO-ISSUE: Recalculate operator dependencies before validations #7206

Closed

20 tasks

jhernand commented Jan 27, 2025

View reviewed changes

jhernand mentioned this pull request Jan 27, 2025

NO-ISSUE: Add NVIDIA GPU operator only if there are NVIDIA GPUs #7218

Open

20 tasks

paul-maidment reviewed Jan 28, 2025

View reviewed changes

jhernand force-pushed the recalculate_operator_dependencies_from_cluster_monitor_loop branch from 05a968f to 8b67fe0 Compare January 29, 2025 12:12

openshift-ci bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 29, 2025

jhernand force-pushed the recalculate_operator_dependencies_from_cluster_monitor_loop branch from 8b67fe0 to 6eb9b99 Compare January 29, 2025 12:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NO-ISSUE: Recalculate operator dependencies before validations #7227

NO-ISSUE: Recalculate operator dependencies before validations #7227

jhernand commented Jan 27, 2025

openshift-ci-robot commented Jan 27, 2025

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

openshift-ci bot commented Jan 27, 2025

jhernand Jan 27, 2025

rccrdpccl Jan 28, 2025

jhernand Jan 29, 2025

codecov bot commented Jan 27, 2025 •

edited

Loading

paul-maidment Jan 28, 2025 •

edited

Loading

jhernand Jan 29, 2025

jhernand Jan 29, 2025

paul-maidment Jan 28, 2025 •

edited

Loading

jhernand Jan 29, 2025

openshift-ci bot commented Jan 29, 2025

NO-ISSUE: Recalculate operator dependencies before validations #7227

Are you sure you want to change the base?

NO-ISSUE: Recalculate operator dependencies before validations #7227

Conversation

jhernand commented Jan 27, 2025

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

openshift-ci-robot commented Jan 27, 2025

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

openshift-ci bot commented Jan 27, 2025

jhernand Jan 27, 2025

Choose a reason for hiding this comment

rccrdpccl Jan 28, 2025

Choose a reason for hiding this comment

jhernand Jan 29, 2025

Choose a reason for hiding this comment

codecov bot commented Jan 27, 2025 • edited Loading

Codecov Report

paul-maidment Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

jhernand Jan 29, 2025

Choose a reason for hiding this comment

jhernand Jan 29, 2025

Choose a reason for hiding this comment

paul-maidment Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

jhernand Jan 29, 2025

Choose a reason for hiding this comment

openshift-ci bot commented Jan 29, 2025

codecov bot commented Jan 27, 2025 •

edited

Loading

paul-maidment Jan 28, 2025 •

edited

Loading

paul-maidment Jan 28, 2025 •

edited

Loading