Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managed Infrastructure Maintenance Operator - Milestone 1 #3571

Merged
merged 89 commits into from
Dec 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
4e42ef4
add MIMO to dev
hawkowl Feb 29, 2024
d376ddd
initial maint API
hawkowl Jul 15, 2024
d87b4be
update mimo DB to have a fetch which isn't just pending
hawkowl Jul 18, 2024
fb5eadc
update db fakes for maintmanifests
hawkowl Jul 18, 2024
96cb18f
mimo API conversions for admin API
hawkowl Jul 18, 2024
655d376
query + tests for maintenancemanifests
hawkowl Jul 18, 2024
f3cafa0
update with get
hawkowl Jul 18, 2024
551b9bc
test fixes
hawkowl Jul 19, 2024
b90e297
cancellation tests and impl
hawkowl Jul 19, 2024
a1b8950
renaming and tweaking
hawkowl Jul 22, 2024
c860dd7
static validation
hawkowl Jul 22, 2024
3fa2267
update frontend to have create
hawkowl Jul 22, 2024
9823b42
code for deleting
hawkowl Jul 22, 2024
be4a934
add deleting endpoint
hawkowl Jul 22, 2024
90d9b7c
move clusteroperator check code for reuse
hawkowl Jul 26, 2024
5089c83
MIMO task/set cleanups
hawkowl Jul 26, 2024
3a53965
mimo error code
hawkowl Jul 26, 2024
0653d2e
more work on sets
hawkowl Jul 26, 2024
702b923
tls tasks work
hawkowl Jul 26, 2024
73b3a1c
update for cleanups
hawkowl Aug 5, 2024
eff66f6
move into the main CLI endpoint
hawkowl Aug 16, 2024
308666d
add a task for updating the operator flags, for testing
hawkowl Sep 17, 2024
8fb61f7
add healthz endpoints for MIMO actuator
hawkowl Sep 18, 2024
f580b7f
makefile target for running actuator locally
hawkowl Sep 18, 2024
8101203
add mimo actuator steps in e2e helper
hawkowl Sep 18, 2024
53462a9
start mimo in e2e
hawkowl Sep 18, 2024
19cff72
fix build
hawkowl Sep 18, 2024
9b9b1cf
go generate
hawkowl Sep 18, 2024
9483f13
lint
hawkowl Sep 18, 2024
06e890e
updates for basic mimo e2e
hawkowl Sep 19, 2024
bc71574
e2e testing
hawkowl Sep 19, 2024
67644b9
try and see what e2e is breaking with
hawkowl Sep 23, 2024
53a9f99
initial doc frame
hawkowl Sep 24, 2024
b941d1b
ARO-9263: Add ACR Token expiry
edisonLcardenas Aug 19, 2024
0cb471e
ARO-9263: Add ACR Token Expiry Checker
edisonLcardenas Aug 20, 2024
62a26b9
ARO-9263: Add unit test for checker
edisonLcardenas Aug 21, 2024
8506979
ARE-9263: Rename function
edisonLcardenas Aug 22, 2024
b383a20
ARO-9263: Restore missing "ProvisioningStateMaintenance"
edisonLcardenas Aug 22, 2024
90e6d67
ARO-9263: Fix import CI check failures
edisonLcardenas Aug 26, 2024
18dda40
ARO-9263: Refactoring tests and revising logic to check expiry date.
edisonLcardenas Sep 12, 2024
929fcc7
ARO-9263: Add another condition to check if expiry date is nil
edisonLcardenas Sep 12, 2024
0eed5bc
refactor: update package groupings and error messages to resolve issu…
edisonLcardenas Sep 12, 2024
236cd26
ARO-9263: Change expiry to the date the token was issued.
edisonLcardenas Sep 16, 2024
ab4158e
ARO-9263: Revise logic to check issue date instead of expiry
edisonLcardenas Sep 17, 2024
96ae17c
ARO-9263: Add constants to reduce redunant values
edisonLcardenas Sep 17, 2024
dfcb3b4
ARO-9263: Update test to check issue date in constant time to avoid f…
edisonLcardenas Sep 18, 2024
9c5630d
ARO-9263: Change or remove any references about expiry to issue date.
edisonLcardenas Sep 19, 2024
862a621
ARO-9263: Fix lint issues
edisonLcardenas Sep 20, 2024
5041fd0
ARO-9263: Revise error message and reorder return statement
edisonLcardenas Sep 24, 2024
5096cea
ARO-9263: Fix unit test
edisonLcardenas Sep 24, 2024
39bc422
fix e2e, hopefully
hawkowl Sep 26, 2024
7fd221e
pls
hawkowl Sep 26, 2024
8aaaa51
add the maintmanifests client to the RP frontend/backend in dev
hawkowl Oct 2, 2024
58eb98a
reset the cluster flags to stop other tests failing
hawkowl Oct 2, 2024
0992130
Bump test file
hawkowl Oct 2, 2024
932156a
Update actuator_test.go
hawkowl Oct 2, 2024
64ea0c7
fix the ARM resource deploying the partition key
hawkowl Oct 3, 2024
c0cd952
regen
hawkowl Oct 3, 2024
90361ac
fixes for e2e
hawkowl Oct 3, 2024
88e7bce
add the ability to add a debug flag
hawkowl Oct 4, 2024
d77f9bb
e2e fix
hawkowl Oct 4, 2024
a92e7bc
renames and fixes
hawkowl Oct 9, 2024
172d153
go mod tidy
hawkowl Oct 10, 2024
457d42c
add some documentation
hawkowl Oct 11, 2024
831ae8a
review cleanups and neatening things up
hawkowl Oct 17, 2024
02bd417
try and fix e2e race condition
hawkowl Oct 18, 2024
1e9aa10
fix name
hawkowl Oct 28, 2024
7fd07b6
more docs
hawkowl Oct 28, 2024
2bd811b
more docs
hawkowl Oct 28, 2024
8de6cc5
log err
hawkowl Oct 29, 2024
843f0dd
comments
hawkowl Oct 31, 2024
48adea7
minor cleanups
hawkowl Nov 1, 2024
52e09fa
add missing platformworkloadidentitydocuments
hawkowl Nov 7, 2024
dac4b67
allow deleting maintenancemanifests even if the cluster is deleted
hawkowl Nov 7, 2024
a5a26ff
fix up naming since it's not all metrics
hawkowl Nov 7, 2024
d14362f
try and fix e2e
hawkowl Nov 7, 2024
dc5b3d4
database and admin API updates for queue things
hawkowl Nov 7, 2024
0947e08
emit mimo queue length metrics
hawkowl Nov 7, 2024
3bd9d7f
db test code
hawkowl Nov 7, 2024
6de9f5c
clean up admin based on review, and add a queue check for all clusters
hawkowl Nov 7, 2024
1bef2dc
Update pkg/mimo/actuator/manager.go
hawkowl Nov 11, 2024
acd8773
fix to wrap error
hawkowl Nov 11, 2024
ee9552a
de-flake e2e maybe
hawkowl Nov 18, 2024
0232ca7
Add conversion of the issueDate into the admin API, and add a comment
hawkowl Nov 19, 2024
5aa5916
minor ACR token refactoring, direct unit tests
hawkowl Nov 19, 2024
d7fbf5c
fixup
hawkowl Nov 19, 2024
b302afd
cleanups to do with service duplication
hawkowl Nov 19, 2024
6761a1b
fix lint
hawkowl Nov 19, 2024
e93a3f1
more e2e deflake attempts
hawkowl Nov 21, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .pipelines/e2e.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,8 @@ jobs:

- script: |
export CI=true
# Tell the E2E binary to run the MIMO tests
export ARO_E2E_MIMO=true
. secrets/env
. ./hack/e2e/run-rp-and-e2e.sh

Expand All @@ -84,6 +86,9 @@ jobs:
run_selenium
validate_selenium_running

run_mimo_actuator
validate_mimo_actuator_running

run_rp
validate_rp_running

Expand Down Expand Up @@ -128,6 +133,7 @@ jobs:

delete_e2e_cluster
kill_rp
kill_mimo_actuator
kill_selenium
kill_podman
kill_vpn
Expand Down
10 changes: 7 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ SHELL = /bin/bash
TAG ?= $(shell git describe --exact-match 2>/dev/null)
COMMIT = $(shell git rev-parse --short=7 HEAD)$(shell [[ $$(git status --porcelain) = "" ]] || echo -dirty)
ARO_IMAGE_BASE = ${RP_IMAGE_ACR}.azurecr.io/aro
E2E_FLAGS ?= -test.v --ginkgo.v --ginkgo.timeout 180m --ginkgo.flake-attempts=2 --ginkgo.junit-report=e2e-report.xml
E2E_FLAGS ?= -test.v --ginkgo.vv --ginkgo.timeout 180m --ginkgo.flake-attempts=2 --ginkgo.junit-report=e2e-report.xml
E2E_LABEL ?= !smoke&&!regressiontest
GO_FLAGS ?= -tags=containers_image_openpgp,exclude_graphdriver_btrfs,exclude_graphdriver_devicemapper
OC ?= oc
Expand Down Expand Up @@ -68,7 +68,7 @@ aro: check-release generate

.PHONY: runlocal-rp
runlocal-rp:
go run -ldflags "-X github.com/Azure/ARO-RP/pkg/util/version.GitCommit=$(VERSION)" ./cmd/aro rp
go run -ldflags "-X github.com/Azure/ARO-RP/pkg/util/version.GitCommit=$(VERSION)" ./cmd/aro ${ARO_CMD_ARGS} rp

.PHONY: az
az: pyenv
Expand Down Expand Up @@ -197,7 +197,11 @@ proxy:

.PHONY: runlocal-portal
runlocal-portal:
go run -ldflags "-X github.com/Azure/ARO-RP/pkg/util/version.GitCommit=$(VERSION)" ./cmd/aro portal
go run -ldflags "-X github.com/Azure/ARO-RP/pkg/util/version.GitCommit=$(VERSION)" ./cmd/aro ${ARO_CMD_ARGS} portal

.PHONY: runlocal-actuator
runlocal-actuator:
go run -ldflags "-X github.com/Azure/ARO-RP/pkg/util/version.GitCommit=$(VERSION)" ./cmd/aro ${ARO_CMD_ARGS} mimo-actuator

.PHONY: build-portal
build-portal:
Expand Down
4 changes: 4 additions & 0 deletions cmd/aro/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ func usage() {
fmt.Fprintf(flag.CommandLine.Output(), " %s operator {master,worker}\n", os.Args[0])
fmt.Fprintf(flag.CommandLine.Output(), " %s update-versions\n", os.Args[0])
fmt.Fprintf(flag.CommandLine.Output(), " %s update-role-sets\n", os.Args[0])
fmt.Fprintf(flag.CommandLine.Output(), " %s mimo-actuator\n", os.Args[0])
flag.PrintDefaults()
}

Expand Down Expand Up @@ -74,6 +75,9 @@ func main() {
case "update-role-sets":
checkArgs(1)
err = updatePlatformWorkloadIdentityRoleSets(ctx, log)
case "mimo-actuator":
checkArgs(1)
err = mimoActuator(ctx, log)
default:
usage()
os.Exit(2)
Expand Down
103 changes: 103 additions & 0 deletions cmd/aro/mimoactuator.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
package main

// Copyright (c) Microsoft Corporation.
// Licensed under the Apache License 2.0.

import (
"context"
"os"
"os/signal"
"syscall"

"github.com/sirupsen/logrus"

"github.com/Azure/ARO-RP/pkg/database"
"github.com/Azure/ARO-RP/pkg/env"
"github.com/Azure/ARO-RP/pkg/metrics/statsd"
"github.com/Azure/ARO-RP/pkg/metrics/statsd/golang"
"github.com/Azure/ARO-RP/pkg/mimo/actuator"
"github.com/Azure/ARO-RP/pkg/mimo/tasks"
"github.com/Azure/ARO-RP/pkg/proxy"
"github.com/Azure/ARO-RP/pkg/util/encryption"
)

func mimoActuator(ctx context.Context, log *logrus.Entry) error {
stop := make(chan struct{})

_env, err := env.NewEnv(ctx, log, env.COMPONENT_MIMO_ACTUATOR)
if err != nil {
return err
}

keys := []string{}
if !_env.IsLocalDevelopmentMode() {
keys = []string{
"MDM_ACCOUNT",
"MDM_NAMESPACE",
}
}

if err = env.ValidateVars(keys...); err != nil {
return err
}

m := statsd.New(ctx, log.WithField("component", "actuator"), _env, os.Getenv("MDM_ACCOUNT"), os.Getenv("MDM_NAMESPACE"), os.Getenv("MDM_STATSD_SOCKET"))

g, err := golang.NewMetrics(_env.Logger(), m)
if err != nil {
return err
}
go g.Run()

aead, err := encryption.NewAEADWithCore(ctx, _env, env.EncryptionSecretV2Name, env.EncryptionSecretName)
if err != nil {
return err
}

dbc, err := database.NewDatabaseClientFromEnv(ctx, _env, log, m, aead)
if err != nil {
return err
}

dbName, err := env.DBName(_env)
if err != nil {
return err
}

clusters, err := database.NewOpenShiftClusters(ctx, dbc, dbName)
if err != nil {
return err
}

manifests, err := database.NewMaintenanceManifests(ctx, dbc, dbName)
if err != nil {
return err
}

dbg := database.NewDBGroup().
WithOpenShiftClusters(clusters).
WithMaintenanceManifests(manifests)

go database.EmitMIMOMetrics(ctx, log, manifests, m)

dialer, err := proxy.NewDialer(_env.IsLocalDevelopmentMode())
if err != nil {
return err
}

a := actuator.NewService(_env, _env.Logger(), dialer, dbg, m)
a.SetMaintenanceTasks(tasks.DEFAULT_MAINTENANCE_TASKS)

sigterm := make(chan os.Signal, 1)
done := make(chan struct{})
signal.Notify(sigterm, syscall.SIGTERM)

go a.Run(ctx, stop, done)

<-sigterm
log.Print("received SIGTERM")
close(stop)
<-done

return nil
}
11 changes: 10 additions & 1 deletion cmd/aro/rp.go
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ func rp(ctx context.Context, log, audit *logrus.Entry) error {
return err
}

go database.EmitMetrics(ctx, log, dbOpenShiftClusters, metrics)
go database.EmitOpenShiftClustersMetrics(ctx, log, dbOpenShiftClusters, metrics)

feAead, err := encryption.NewMulti(ctx, _env.ServiceKeyvault(), env.FrontendEncryptionSecretV2Name, env.FrontendEncryptionSecretName)
if err != nil {
Expand All @@ -172,6 +172,15 @@ func rp(ctx context.Context, log, audit *logrus.Entry) error {
WithPlatformWorkloadIdentityRoleSets(dbPlatformWorkloadIdentityRoleSets).
WithSubscriptions(dbSubscriptions)

// MIMO only activated in development for now
if _env.IsLocalDevelopmentMode() {
dbMaintenanceManifests, err := database.NewMaintenanceManifests(ctx, dbc, dbName)
if err != nil {
return err
}
dbg.WithMaintenanceManifests(dbMaintenanceManifests)
}

f, err := frontend.NewFrontend(ctx, audit, log.WithField("component", "frontend"), _env, dbg, api.APIs, metrics, clusterm, feAead, hiveClusterManager, adminactions.NewKubeActions, adminactions.NewAzureActions, adminactions.NewAppLensActions, clusterdata.NewParallelEnricher(metrics, _env))
if err != nil {
return err
Expand Down
22 changes: 22 additions & 0 deletions docs/mimo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# MIMO Documentation

The Managed Infrastructure Maintenance Operator, or MIMO, is a component of the Azure Red Hat OpenShift Resource Provider (ARO-RP) which is responsible for automated maintenance of clusters provisioned by the platform.
MIMO specifically focuses on "managed infrastructure", the parts of ARO that are deployed and maintained by the RP and ARO Operator instead of by OCP (in-cluster) or Hive (out-of-cluster).

MIMO consists of two main components, the [Actuator](./actuator.md) and the [Scheduler](./scheduler.md). It is primarily interfaced with via the [Admin API](./admin-api.md).

## A Primer On MIMO

The smallest thing that you can tell MIMO to run is a **Task** (see [`pkg/mimo/tasks/`](../../pkg/mimo/tasks/)).
A Task is composed of reusable **Steps** (see [`pkg/mimo/steps/`](../../pkg/mimo/steps/)), reusing the framework utilised by AdminUpdate/Update/Install methods in `pkg/cluster/`.
A Task only runs in the scope of a singular cluster.
These steps are run in sequence and can return either **Terminal** errors (causing the ran Task to fail and not be retried) or **Transient** errors (which indicates that the Task can be retried later).

Tasks are executed by the **Actuator** by way of creation of a **Maintenance Manifest**.
This Manifest is created with the cluster ID (which is elided from the cluster-scoped Admin APIs), the Task ID (which is currently a UUID), and optional priority, "start after", and "start before" times which are filled in with defaults if not provided.
The Actuator will treat these Maintenance Manifests as a work queue, taking ones which are past their "start after" time and executing them in order of earliest start-after and priority.
After running each, a state will be written into the Manifest (with optional free-form status text) with the result of the ran Task.
Manifests past their start-before times are marked as having a "timed out" state and not ran.

Currently, Manifests are created by the Admin API.
In the future, the Scheduler will create some these Manifests depending on cluster state/version and wall-clock time, providing the ability to perform tasks like rotations of secrets autonomously.
30 changes: 30 additions & 0 deletions docs/mimo/actuator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Managed Infrastructure Maintenance Operator: Actuator

The Actuator is the MIMO component that performs execution of tasks.
The process of running tasks looks like this:

```mermaid
graph TD;
START((Start))-->QUERY;
QUERY[Fetch all State = Pending] -->SORT;
SORT[Sort tasks by RUNAFTER and PRIORITY]-->ITERATE[Iterate over tasks];
ITERATE-- Per Task -->ISEXPIRED;
subgraph PerTask[ ]
ISEXPIRED{{Is RUNBEFORE > now?}}-- Yes --> STATETIMEDOUT([State = TimedOut]) --> CONTINUE[Continue];
ISEXPIRED-- No --> DEQUEUECLUSTER;
DEQUEUECLUSTER[Claim lease on OpenShiftClusterDocument] --> DEQUEUE;
DEQUEUE[Actuator dequeues task]--> ISRETRYLIMIT;
ISRETRYLIMIT{{Have we retried the task too many times?}} -- Yes --> STATERETRYEXCEEDED([State = RetriesExceeded]) --> CONTINUE;
ISRETRYLIMIT -- No -->STATEINPROGRESS;
STATEINPROGRESS([State = InProgress]) -->RUN[[Task is run]];
RUN -- Success --> SUCCESS
RUN-- Terminal Error-->TERMINALERROR;
RUN-- Transient Error-->TRANSIENTERROR;
SUCCESS([State = Completed])-->DELEASECLUSTER
TERMINALERROR([State = Failed])-->DELEASECLUSTER;
TRANSIENTERROR([State = Pending])-->DELEASECLUSTER;
DELEASECLUSTER[Release Lease on OpenShiftClusterDocument] -->CONTINUE;
end
CONTINUE-->ITERATE;
ITERATE-- Finished -->END;
```
30 changes: 30 additions & 0 deletions docs/mimo/admin-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Admin API

All need `api-version=admin`.

## GET /admin/RESOURCE_ID/maintenanceManifests

Returns a list of MIMO maintenance manifests.

## PUT /admin/RESOURCE_ID/maintenanceManifests

Creates a new manifest. Returns the created manifest.

### Example

```sh
curl -X PUT -k "https://localhost:8443/admin/subscriptions/fe16a035-e540-4ab7-80d9-373fa9a3d6ae/resourcegroups/v4-westeurope/providers/microsoft.redhatopenshift/openshiftclusters/abrownmimom1test/maintenanceManifests?api-version
=admin" -d '{"maintenanceTaskID": "b41749fc-af26-4ab7-b5a1-e03f3ee4cba6"}' --header "Content-Type: application/json"
```

## GET /admin/RESOURCE_ID/maintenanceManifests/MANIFEST_ID

Returns a manifest.

## DELETE /admin/RESOURCE_ID/maintenanceManifests/MANIFEST_ID

Deletes a manifest. This is only to be used as a last resort.

## POST /admin/RESOURCE_ID/maintenanceManifests/MANIFEST_ID/cancel

Cancels the manifest (the state becomes CANCELLED). It does not stop a task that is in the current process of execution.
6 changes: 6 additions & 0 deletions docs/mimo/local-dev.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Local Development

1. Ensure that you have remade your databases (so that you have the MIMO ones), see [Prepare Your Dev Environment](../prepare-your-dev-environment.md).
1. Run the local RP as usual.
1. Run `make runlocal-actuator` to spawn the actuator.
1. Perform queries against the Admin API to queue/monitor MIMO manifests.
3 changes: 3 additions & 0 deletions docs/mimo/scheduler.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# MIMO Scheduler

The MIMO Scheduler is a planned component, but is not yet implemented.
48 changes: 48 additions & 0 deletions docs/mimo/writing-tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Writing MIMO Tasks

Writing a MIMO task consists of three major steps:

1. Writing the new functions in [`pkg/mimo/steps/`](../../pkg/mimo/steps/) which implement the specific behaviour (e.g. rotating a certificate), along with tests.
2. Writing the new Task in [`pkg/mimo/tasks/`](../../pkg/mimo/tasks/) which combines the Step you have written with any pre-existing "check" steps (e.g. `EnsureAPIServerIsUp`).
3. Adding the task with a new ID to [`pkg/mimo/const.go`](../../pkg/mimo/const.go) and `DEFAULT_MAINTENANCE_TASKS` in [`pkg/mimo/tasks/taskrunner.go`](../../pkg/mimo/tasks/taskrunner.go).

## New Step Functions

MIMO Step functions are similar to functions used in `pkg/cluster/install.go` but have additional information on the `Context` to prevent the explosion of struct members as seen in that package. Instead, the `GetTaskContext` function will return a `TaskContext` with various methods that can be used to retrieve information about the cluster, clients to perform actions in Azure, or Kubernetes clients to perform actions in the cluster.

Steps with similar logical domains should live in the same file/package. Currently, `pkg/mimo/steps/cluster/` is the only package, but functionality specific to the cluster's Azure resources may be better in a package called `pkg/mimo/steps/azure/` to make navigation easier.

Your base Action Step will look something like this:

```go
func DoSomething(ctx context.Context) error {
tc, err := mimo.GetTaskContext(ctx)
if err != nil {
return mimo.TerminalError(err)
}

return nil
}
```

Like `pkg/cluster/`, you can also implement `Condition`s which allow you to wait for some state. However, MIMO's design is such that it should not sit around for long periods of time waiting for things which should already be the case -- for example, the API server not being up should instead be a usual Action which returns one of either `mimo.TerminalError` or `mimo.TransientError`.

`TransientError`s will be retried, and do not indicate a permanent failure. This is a good fit for errors that are possibly because of timeouts, random momentary outages, or cosmic winds flipping all the bits on your NIC for a nanosecond. MIMO will retry a task (at least, a few times) whose steps return a `TransientError`.

`TerminalError`s are used when there is no likelihood of automatic recovery. For example, if an API server is healthy and returning data, but it says that some essential OpenShift object that we require is missing, it is unlikely that object will return after one or many retries in a short period of time. These failures ought to require either manual intervention because they are unexpected or indicate that a cluster is unservicable. When a `TerminalError` is returned, it will cause the Task to hard fail and MIMO will not retry it.

## Testing

MIMO provides a fake `TaskContext`, created by `test/mimo/tasks.NewFakeTestContext`. This fake takes a number of mandatory items, such as an inner `Context` for cancellation, an `env.Interface`, a `*logrus.Entry`, and a stand-in clock for testing timing. Additional parts of the `TaskContext` used can be provided by `WithXXX` functions provided at the end of the instantiator, such as `WithClientHelper` to add a `ClientHelper` that is accessible on the `TaskContext`.

Attempting to use additional parts of the `TaskContext` without providing them will cause a panic or an error to be returned, in both the fake and real `TaskContext`. This behaviour is intended to make it clearer when some dependency is required.

## Assembling a Task

Once you have your Steps, you can assemble them into a Task in [`pkg/mimo/steps/`](../../pkg/mimo/steps/). See existing Tasks for examples.

## Assumptions MIMO Makes Of Your Code

- Your Steps may be run more than once -- both if they are in a Task more than once, or because a Task has been retried. Your Step must be resilient to being reran from a partial run.
- Steps should fail fast and not sit around unless they have caused something to happen. Right now, Tasks only have a 60 minute timeout total, so use it wisely.
- Steps use the `TaskContext` interface to get clients, and should not build them itself. If a Task requires a new client, it should be implemented in `TaskContext` to ensure that it can be tested the same way as other used clients.
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ require (
github.com/vincent-petithory/dataurl v1.0.0
go.uber.org/mock v0.4.0
golang.org/x/crypto v0.28.0
golang.org/x/exp v0.0.0-20240222234643-814bf88cf225
golang.org/x/net v0.30.0
golang.org/x/oauth2 v0.21.0
golang.org/x/sync v0.8.0
Expand Down Expand Up @@ -259,7 +260,6 @@ require (
go.opentelemetry.io/otel/metric v1.24.0 // indirect
go.opentelemetry.io/otel/trace v1.24.0 // indirect
go.starlark.net v0.0.0-20220328144851-d1966c6b9fcd // indirect
golang.org/x/exp v0.0.0-20240222234643-814bf88cf225 // indirect
golang.org/x/mod v0.17.0 // indirect
golang.org/x/sys v0.26.0 // indirect
golang.org/x/term v0.25.0 // indirect
Expand Down
Loading
Loading