Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managed Infrastructure Maintenance Operator - Milestone 1 #3571

Open
wants to merge 89 commits into
base: master
Choose a base branch
from

Conversation

hawkowl
Copy link
Collaborator

@hawkowl hawkowl commented May 10, 2024

Which issue this PR addresses:

Part of https://issues.redhat.com/browse/ARO-4895.

What this PR does / why we need it:

This PR is the initial feature branch for the MIMO M1 milestone.

Is there any documentation that needs to be updated for this PR?

Yes, see https://issues.redhat.com/browse/ARO-4895 .

How do you know this will function as expected in production?

Telemetry, monitoring, and documentation will need to be fleshed out. See https://issues.redhat.com/browse/ARO-4895 for details.

This was referenced May 10, 2024
@github-actions github-actions bot added the needs-rebase branch needs a rebase label May 15, 2024
Copy link

Please rebase pull request.

Copy link
Contributor

@jaitaiwan jaitaiwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've started a review, and reached my ingestion limit. I'll keep reviewing later.

go.mod Outdated Show resolved Hide resolved
pkg/api/mimodocument.go Show resolved Hide resolved
pkg/database/mimo.go Outdated Show resolved Hide resolved
pkg/database/mimo.go Outdated Show resolved Hide resolved
pkg/database/mimo.go Outdated Show resolved Hide resolved
Comment on lines +94 to +78
if err, ok := err.(*cosmosdb.Error); ok && err.StatusCode == http.StatusConflict {
err.StatusCode = http.StatusPreconditionFailed
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we overwriting the http status condition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like to me it's because of line 143. We're saying that in case of a conflict we want to change it to a status that will have the cosmosdb Retry function retry the request. If this is the case I think commenting this would be helpful in-case the functionality of functions that use the cosmosdb Retry function change in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a lot of the code here which is copy pasted we should probably open up things to improve this consistently.

pkg/database/mimo.go Outdated Show resolved Hide resolved
pkg/database/mimo.go Outdated Show resolved Hide resolved
pkg/database/mimo.go Outdated Show resolved Hide resolved
pkg/database/mimo.go Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the needs-rebase branch needs a rebase label May 27, 2024
@github-actions github-actions bot added the needs-rebase branch needs a rebase label Jun 6, 2024
@github-actions github-actions bot added needs-rebase branch needs a rebase and removed needs-rebase branch needs a rebase labels Jun 11, 2024
Comment on lines 74 to 258
docs, err := i.Next(ctx, -1)
if err != nil {
return false, err
}
if docs == nil {
break
}

docList = append(docList, docs.MaintenanceManifestDocuments...)
}

manifestsToAction := make([]*api.MaintenanceManifestDocument, 0)

sort.SliceStable(docList, func(i, j int) bool {
if docList[i].MaintenanceManifest.RunAfter != docList[j].MaintenanceManifest.RunAfter {
return docList[i].MaintenanceManifest.Priority < docList[j].MaintenanceManifest.Priority
}

return docList[i].MaintenanceManifest.RunAfter < docList[j].MaintenanceManifest.RunAfter
})

evaluationTime := a.now()

// Check for manifests that have timed out first
for _, doc := range docList {
if evaluationTime.After(time.Unix(int64(doc.MaintenanceManifest.RunBefore), 0)) {
// timed out, mark as such
a.log.Infof("marking %v as outdated: %v older than %v", doc.ID, doc.MaintenanceManifest.RunBefore, evaluationTime.UTC())

_, err := a.mmf.Patch(ctx, a.clusterID, doc.ID, func(d *api.MaintenanceManifestDocument) error {
d.MaintenanceManifest.State = api.MaintenanceManifestStateTimedOut
d.MaintenanceManifest.StatusText = fmt.Sprintf("timed out at %s", evaluationTime.UTC())
return nil
})
if err != nil {
a.log.Error(err)
}
} else {
// not timed out, do something about it
manifestsToAction = append(manifestsToAction, doc)
}
}

// Nothing to do, don't dequeue
if len(manifestsToAction) == 0 {
return false, nil
}

// Dequeue the document
oc, err := a.oc.Get(ctx, a.clusterID)
if err != nil {
return false, err
}

oc, err = a.oc.DoDequeue(ctx, oc)
if err != nil {
return false, err // This will include StatusPreconditionFaileds
}

taskContext := newTaskContext(a.env, a.log, oc)

// Execute on the manifests we want to action
for _, doc := range manifestsToAction {
// here
f, ok := a.tasks[doc.MaintenanceManifest.MaintenanceSetID]
if !ok {
a.log.Infof("not found %v", doc.MaintenanceManifest.MaintenanceSetID)
continue
}

// Attempt a dequeue
doc, err = a.mmf.Lease(ctx, a.clusterID, doc.ID)
if err != nil {
// log and continue if it doesn't work
a.log.Error(err)
continue
}

// if we've tried too many times, give up
if doc.Dequeues > maxDequeueCount {
err := fmt.Errorf("dequeued %d times, failing", doc.Dequeues)
_, leaseErr := a.mmf.EndLease(ctx, doc.ClusterID, doc.ID, api.MaintenanceManifestStateTimedOut, to.StringPtr(err.Error()))
if leaseErr != nil {
a.log.Error(err)
}
continue
}

// Perform the task
state, msg := f(ctx, taskContext, doc, oc)
_, err = a.mmf.EndLease(ctx, doc.ClusterID, doc.ID, state, &msg)
if err != nil {
a.log.Error(err)
}
}

// release the OpenShiftCluster
_, err = a.oc.EndLease(ctx, a.clusterID, oc.OpenShiftCluster.Properties.ProvisioningState, api.ProvisioningStateMaintenance, nil)
return true, err
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to split the logic into private funcs to improve readability, something like:

func (a *actuator) Process(ctx context.Context) (bool, error) {
    // Fetch manifests
    manifests, err := a.fetchManifests(ctx)
    if err != nil {
        return false, err
    }

    // Evaluate and segregate manifests
    expiredManifests, actionableManifests := a.evaluateManifests(manifests)

    // Handle expired manifests
    a.handleExpiredManifests(ctx, expiredManifests)

    // If no actionable manifests, return
    if len(actionableManifests) == 0 {
        return false, nil
    }

    // Dequeue the cluster document
    oc, err := a.oc.DequeueCluster(ctx, a.clusterID)
    if err != nil {
        return false, err
    }

    // Execute tasks
    taskContext := newTaskContext(a.env, a.log, oc)
    a.executeTasks(ctx, taskContext, actionableManifests)

    // Release the cluster lease
    return true, a.oc.EndClusterLease(ctx, a.clusterID, oc)
}

func (a *actuator) fetchManifests(ctx context.Context) ([]*api.MaintenanceManifestDocument, error) {
    // Fetch manifests logic here
}

func (a *actuator) evaluateManifests(manifests []*api.MaintenanceManifestDocument) ([]*api.MaintenanceManifestDocument, []*api.MaintenanceManifestDocument) {
    // Evaluation logic here
}

func (a *actuator) handleExpiredManifests(ctx context.Context, expiredManifests []*api.MaintenanceManifestDocument) {
    // Handling expired manifests logic here
}

func (a *actuator) executeTasks(ctx context.Context, taskContext tasks.TaskContext, manifests []*api.MaintenanceManifestDocument) {
    // Task execution logic here
}

pkg/api/mimo.go Show resolved Hide resolved
pkg/database/mimo.go Outdated Show resolved Hide resolved
pkg/database/mimo.go Outdated Show resolved Hide resolved
pkg/database/mimo.go Show resolved Hide resolved
pkg/deploy/generator/resources_rp.go Show resolved Hide resolved
pkg/mimo/actuator/manager.go Outdated Show resolved Hide resolved
// timed out, mark as such
a.log.Infof("marking %v as outdated: %v older than %v", doc.ID, doc.MaintenanceManifest.RunBefore, evaluationTime.UTC())

_, err := a.mmf.Patch(ctx, a.clusterID, doc.ID, func(d *api.MaintenanceManifestDocument) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we implement a retry logic here? just to make the patch action more robust?

ArrisLee
ArrisLee previously approved these changes Jun 25, 2024
Copy link
Collaborator

@ArrisLee ArrisLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, left some comments for potential improvments, please have a check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request go Pull requests that update Go code ready-for-review size-large Size large skippy pull requests raised by member of Team Skippy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants