Managed Infrastructure Maintenance Operator - Milestone 1 #3571

hawkowl · 2024-05-10T08:09:47Z

Which issue this PR addresses:

Part of https://issues.redhat.com/browse/ARO-4895.

What this PR does / why we need it:

This PR is the initial feature branch for the MIMO M1 milestone.

Is there any documentation that needs to be updated for this PR?

Yes, see https://issues.redhat.com/browse/ARO-4895 .

How do you know this will function as expected in production?

Telemetry, monitoring, and documentation will need to be fleshed out. See https://issues.redhat.com/browse/ARO-4895 for details.

github-actions · 2024-05-15T16:33:33Z

Please rebase pull request.

jaitaiwan

I've started a review, and reached my ingestion limit. I'll keep reviewing later.

go.mod

pkg/api/mimodocument.go

pkg/database/mimo.go

jaitaiwan · 2024-05-23T01:05:20Z

pkg/database/mimo.go

+	if err, ok := err.(*cosmosdb.Error); ok && err.StatusCode == http.StatusConflict {
+		err.StatusCode = http.StatusPreconditionFailed
+	}


Why are we overwriting the http status condition?

It looks like to me it's because of line 143. We're saying that in case of a conflict we want to change it to a status that will have the cosmosdb Retry function retry the request. If this is the case I think commenting this would be helpful in-case the functionality of functions that use the cosmosdb Retry function change in the future.

I think a lot of the code here which is copy pasted we should probably open up things to improve this consistently.

pkg/database/mimo.go

pkg/api/mimo.go

pkg/deploy/generator/resources_rp.go

ArrisLee · 2024-06-18T00:46:09Z

pkg/mimo/actuator/manager.go

+		docs, err := i.Next(ctx, -1)
+		if err != nil {
+			return false, err
+		}
+		if docs == nil {
+			break
+		}
+
+		docList = append(docList, docs.MaintenanceManifestDocuments...)
+	}
+
+	manifestsToAction := make([]*api.MaintenanceManifestDocument, 0)
+
+	sort.SliceStable(docList, func(i, j int) bool {
+		if docList[i].MaintenanceManifest.RunAfter != docList[j].MaintenanceManifest.RunAfter {
+			return docList[i].MaintenanceManifest.Priority < docList[j].MaintenanceManifest.Priority
+		}
+
+		return docList[i].MaintenanceManifest.RunAfter < docList[j].MaintenanceManifest.RunAfter
+	})
+
+	evaluationTime := a.now()
+
+	// Check for manifests that have timed out first
+	for _, doc := range docList {
+		if evaluationTime.After(time.Unix(int64(doc.MaintenanceManifest.RunBefore), 0)) {
+			// timed out, mark as such
+			a.log.Infof("marking %v as outdated: %v older than %v", doc.ID, doc.MaintenanceManifest.RunBefore, evaluationTime.UTC())
+
+			_, err := a.mmf.Patch(ctx, a.clusterID, doc.ID, func(d *api.MaintenanceManifestDocument) error {
+				d.MaintenanceManifest.State = api.MaintenanceManifestStateTimedOut
+				d.MaintenanceManifest.StatusText = fmt.Sprintf("timed out at %s", evaluationTime.UTC())
+				return nil
+			})
+			if err != nil {
+				a.log.Error(err)
+			}
+		} else {
+			// not timed out, do something about it
+			manifestsToAction = append(manifestsToAction, doc)
+		}
+	}
+
+	// Nothing to do, don't dequeue
+	if len(manifestsToAction) == 0 {
+		return false, nil
+	}
+
+	// Dequeue the document
+	oc, err := a.oc.Get(ctx, a.clusterID)
+	if err != nil {
+		return false, err
+	}
+
+	oc, err = a.oc.DoDequeue(ctx, oc)
+	if err != nil {
+		return false, err // This will include StatusPreconditionFaileds
+	}
+
+	taskContext := newTaskContext(a.env, a.log, oc)
+
+	// Execute on the manifests we want to action
+	for _, doc := range manifestsToAction {
+		// here
+		f, ok := a.tasks[doc.MaintenanceManifest.MaintenanceSetID]
+		if !ok {
+			a.log.Infof("not found %v", doc.MaintenanceManifest.MaintenanceSetID)
+			continue
+		}
+
+		// Attempt a dequeue
+		doc, err = a.mmf.Lease(ctx, a.clusterID, doc.ID)
+		if err != nil {
+			// log and continue if it doesn't work
+			a.log.Error(err)
+			continue
+		}
+
+		// if we've tried too many times, give up
+		if doc.Dequeues > maxDequeueCount {
+			err := fmt.Errorf("dequeued %d times, failing", doc.Dequeues)
+			_, leaseErr := a.mmf.EndLease(ctx, doc.ClusterID, doc.ID, api.MaintenanceManifestStateTimedOut, to.StringPtr(err.Error()))
+			if leaseErr != nil {
+				a.log.Error(err)
+			}
+			continue
+		}
+
+		// Perform the task
+		state, msg := f(ctx, taskContext, doc, oc)
+		_, err = a.mmf.EndLease(ctx, doc.ClusterID, doc.ID, state, &msg)
+		if err != nil {
+			a.log.Error(err)
+		}
+	}
+
+	// release the OpenShiftCluster
+	_, err = a.oc.EndLease(ctx, a.clusterID, oc.OpenShiftCluster.Properties.ProvisioningState, api.ProvisioningStateMaintenance, nil)
+	return true, err
+}


suggest to split the logic into private funcs to improve readability, something like:

func (a *actuator) Process(ctx context.Context) (bool, error) { // Fetch manifests manifests, err := a.fetchManifests(ctx) if err != nil { return false, err } // Evaluate and segregate manifests expiredManifests, actionableManifests := a.evaluateManifests(manifests) // Handle expired manifests a.handleExpiredManifests(ctx, expiredManifests) // If no actionable manifests, return if len(actionableManifests) == 0 { return false, nil } // Dequeue the cluster document oc, err := a.oc.DequeueCluster(ctx, a.clusterID) if err != nil { return false, err } // Execute tasks taskContext := newTaskContext(a.env, a.log, oc) a.executeTasks(ctx, taskContext, actionableManifests) // Release the cluster lease return true, a.oc.EndClusterLease(ctx, a.clusterID, oc) } func (a *actuator) fetchManifests(ctx context.Context) ([]*api.MaintenanceManifestDocument, error) { // Fetch manifests logic here } func (a *actuator) evaluateManifests(manifests []*api.MaintenanceManifestDocument) ([]*api.MaintenanceManifestDocument, []*api.MaintenanceManifestDocument) { // Evaluation logic here } func (a *actuator) handleExpiredManifests(ctx context.Context, expiredManifests []*api.MaintenanceManifestDocument) { // Handling expired manifests logic here } func (a *actuator) executeTasks(ctx context.Context, taskContext tasks.TaskContext, manifests []*api.MaintenanceManifestDocument) { // Task execution logic here }

pkg/api/mimo.go

pkg/database/mimo.go

pkg/deploy/generator/resources_rp.go

pkg/mimo/actuator/manager.go

ArrisLee · 2024-06-25T09:56:54Z

pkg/mimo/actuator/manager.go

+			// timed out, mark as such
+			a.log.Infof("marking %v as outdated: %v older than %v", doc.ID, doc.MaintenanceManifest.RunBefore, evaluationTime.UTC())
+
+			_, err := a.mmf.Patch(ctx, a.clusterID, doc.ID, func(d *api.MaintenanceManifestDocument) error {


shall we implement a retry logic here? just to make the patch action more robust?

ArrisLee

LGTM overall, left some comments for potential improvments, please have a check

Co-authored-by: Kipp Morris <[email protected]>

This was referenced May 10, 2024

[turbo WIP] MIMO PoC #3210

Closed

MIMO M1 (Database) #3560

Closed

github-actions bot added the needs-rebase branch needs a rebase label May 15, 2024

jaitaiwan reviewed May 23, 2024

View reviewed changes

github-actions bot removed the needs-rebase branch needs a rebase label May 27, 2024

hawkowl force-pushed the hawkowl/mimo-m1 branch from 6d97470 to 335c6fd Compare June 6, 2024 02:15

github-actions bot added the needs-rebase branch needs a rebase label Jun 6, 2024

hawkowl force-pushed the hawkowl/mimo-m1 branch from 335c6fd to 94fb144 Compare June 11, 2024 00:59

github-actions bot added needs-rebase branch needs a rebase and removed needs-rebase branch needs a rebase labels Jun 11, 2024

ArrisLee reviewed Jun 15, 2024

View reviewed changes

pkg/api/mimo.go Show resolved Hide resolved

ArrisLee reviewed Jun 15, 2024

View reviewed changes

pkg/api/mimo.go Show resolved Hide resolved

ArrisLee reviewed Jun 18, 2024

View reviewed changes

pkg/deploy/generator/resources_rp.go Show resolved Hide resolved

ArrisLee reviewed Jun 18, 2024

View reviewed changes

yjst2012 reviewed Jun 18, 2024

View reviewed changes

ArrisLee reviewed Jun 25, 2024

View reviewed changes

pkg/mimo/actuator/manager.go Show resolved Hide resolved

ArrisLee reviewed Jun 25, 2024

View reviewed changes

pkg/mimo/actuator/manager.go Show resolved Hide resolved

ArrisLee reviewed Jun 25, 2024

View reviewed changes

ArrisLee previously approved these changes Jun 25, 2024

View reviewed changes

hawkowl mentioned this pull request Jun 26, 2024

Clean up some duplicated code in cmd/ #3648

Merged

hawkowl dismissed ArrisLee’s stale review via 046a230 July 2, 2024 04:27

hawkowl force-pushed the hawkowl/mimo-m1 branch from 94fb144 to 046a230 Compare July 2, 2024 04:27

github-actions bot removed the needs-rebase branch needs a rebase label Jul 2, 2024

hawkowl force-pushed the hawkowl/mimo-m1 branch from 046a230 to 9272c68 Compare July 4, 2024 03:04

hawkowl force-pushed the hawkowl/mimo-m1 branch 4 times, most recently from 6ab5f5d to e5c05b6 Compare July 22, 2024 07:44

hawkowl force-pushed the hawkowl/mimo-m1 branch from e5c05b6 to f22f9e5 Compare July 26, 2024 05:40

hawkowl and others added 29 commits December 3, 2024 14:14

e2e fix

c7e10b1

renames and fixes

6c78ad4

go mod tidy

a66d34c

add some documentation

cd1f27e

review cleanups and neatening things up

e2fbfd1

try and fix e2e race condition

015cc05

fix name

5c37206

more docs

0617b2a

more docs

7baabe2

log err

6289bc5

comments

6962a50

minor cleanups

0822843

add missing platformworkloadidentitydocuments

7e72768

allow deleting maintenancemanifests even if the cluster is deleted

5c134e3

fix up naming since it's not all metrics

93c16ab

try and fix e2e

0089041

database and admin API updates for queue things

1e81acb

emit mimo queue length metrics

6f0472a

db test code

d6e5d16

clean up admin based on review, and add a queue check for all clusters

ff35995

Update pkg/mimo/actuator/manager.go

6028021

Co-authored-by: Kipp Morris <[email protected]>

fix to wrap error

cad24bd

de-flake e2e maybe

9092e9f

Add conversion of the issueDate into the admin API, and add a comment

3bc0883

minor ACR token refactoring, direct unit tests

032929d

fixup

bee92f8

cleanups to do with service duplication

021cdc4

fix lint

3bf59a2

more e2e deflake attempts

e7b9b5e

hawkowl force-pushed the hawkowl/mimo-m1 branch from ee425c6 to e7b9b5e Compare December 3, 2024 03:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managed Infrastructure Maintenance Operator - Milestone 1 #3571

Managed Infrastructure Maintenance Operator - Milestone 1 #3571

hawkowl commented May 10, 2024 •

edited

Loading

github-actions bot commented May 15, 2024

jaitaiwan left a comment

jaitaiwan May 23, 2024

jaitaiwan May 23, 2024

hawkowl Nov 4, 2024

ArrisLee Jun 18, 2024

ArrisLee Jun 25, 2024

ArrisLee left a comment

Managed Infrastructure Maintenance Operator - Milestone 1 #3571

Are you sure you want to change the base?

Managed Infrastructure Maintenance Operator - Milestone 1 #3571

Conversation

hawkowl commented May 10, 2024 • edited Loading

Which issue this PR addresses:

What this PR does / why we need it:

Is there any documentation that needs to be updated for this PR?

How do you know this will function as expected in production?

github-actions bot commented May 15, 2024

jaitaiwan left a comment

Choose a reason for hiding this comment

jaitaiwan May 23, 2024

Choose a reason for hiding this comment

jaitaiwan May 23, 2024

Choose a reason for hiding this comment

hawkowl Nov 4, 2024

Choose a reason for hiding this comment

ArrisLee Jun 18, 2024

Choose a reason for hiding this comment

ArrisLee Jun 25, 2024

Choose a reason for hiding this comment

ArrisLee left a comment

Choose a reason for hiding this comment

hawkowl commented May 10, 2024 •

edited

Loading