Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add cron wf improvements proposal #11559

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .spelling
Original file line number Diff line number Diff line change
Expand Up @@ -231,3 +231,8 @@ nix.conf
LDFlags
dev
dependabot
CronWorkflow
CronWorkflows
maxFailures
maxSuccess
gitops
102 changes: 102 additions & 0 deletions docs/proposals/cron-wf-improvement-proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Proposal for Cron Workflows improvements

## Introduction

Currently, CronWorkflows are a great resource if we want to run recurring tasks to infinity. However, it is missing the ability to customize it, for example define how many times a workflow should run or how to handle multiple failures. I believe argo workflows would benefit of having more configuration options for cron workflows, to allow to change its behavior based on the result of its child’s success or failures. Below I present my thoughts on how we could improve them, but also some questions and concerns on how to properly do it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also note here how a comparable tool, Airflow, includes this functionality so it is a strong precedent to enable it for Argo Workflows as well: https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/scheduler.html#triggering-dag-with-future-date

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mention that because this proposal was aiming to be something wider than just allowing to run a workflow once (in the future or not). However, I agree that it is one of the most important features that Argo workflows is lacking. I believe a lot of people would need it more frequently than configuring the behavior of recurrent success/fail. This proposal aimed to allow both use cases. However, if we come to the conclusion that it may be to complex to implement all this behavior, I would suggest at least implementing the trigger once feature.


## Proposal

This proposal discusses the viability of adding 2 more fields into the cron workflow configuration:

```yaml
RunStrategy:
maxSuccess:
maxFailures:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe maxFail for simplicity and align with "success vs. fail"

```

`maxSuccess` - defines how many child workflows must have success before suspending the workflow schedule

`maxFailures` - defines how many child workflows must fail before suspending the workflow scheduling. This may contain `Failed` workflows, `Errored` workflows or spec errors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very important to document this clearly in the docs when the feature is released.


For example, if we want to run a workflow just once, we could just set:

```yaml
RunStrategy:
maxSuccess: 1
```

This configuration will make sure the controller will keep scheduling workflows until one of them finishes with success.

As another example, if we want to stop scheduling workflows when they keep failing, we could configure the CronWorkflow with:

```yaml
RunStrategy:
maxFailures: 2
```

This config will stop scheduling workflows if fails twice.

## Total vs consecutive

One aspect that needs to be discussed is whether these configurations apply to the entire life of a cron Workflow or just in consecutive schedules. For example, if we configure a workflow to stop scheduling after 2 failures, I think it makes sense to have this applied when it fails twice consecutively. Otherwise, we can have 2 outages in different periods which will suspend the workflow. On the other hand, when configuring a workflow to run twice with success, it would make more sense to have it execute with success regardless of whether it is a consecutive success or not. If we have an outage after the first workflow succeeds, which translates into failed workflows, it should need to execute with success only once. So I think it would make sense to have:

- maxFailures - maximum number of **consecutive failures** before stopping the scheduling of a workflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose adding another field to toggle between consecutive or total failures. It could be a boolean with default as false.

Maybe: consecFailures or consecFail

This would make the default logic align for both maxSuccess and maxFailures which is simpler logic for a new user to understand. If they want consecutive failures counted, then they can set consecFailures to true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could have the run strategy to be configurable by success or failure. For example:

RunStrategy:
  onSuccess:
    maxSuccess: 3 
  onFail:
    maxFail:  3
    consecFail: false 

this would allow us to further upgrade the configurable behavior in the future without breaking the schema. For example, if we decide to ship the back-off in another PR, you could just upgrade it to:

RunStrategy:
  onSuccess:
    maxSucess: 3 
  onFail:
    maxFail:  3
    consecFail: true
    back-off:
      enabled: true
      factor: 2

However, I do agree with you, this may add a lot of complexity, and its best to keep it simple for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this suggestion of yours:

Maybe we could have the run strategy to be configurable by success or failure. this would allow us to further upgrade the configurable behavior in the future without breaking the schema. For example, if we decide to ship the back-off in another PR, you could just upgrade it

I'm up for this if you think it'll be simple enough to implement on the first attempt.


- maxSuccess - maximum number of workflows with success.

## How to store state

Since we need to control how many child workflows had success/failure we must store state. With this some questions arise:

- Should we just store it through the lifetime of the controller or should we store it to a database?

- Probably only makes sense if we can backup the state somewhere (like a BD). However, I don't have enough knowledge about workflow's architecture to tell how good of an idea this is.
- If a CronWorkflow gets re-applied, does it maintain or reset the number of success/failures?

- I guess it should reset since a configuration change should be seen as a new start.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with resetting state upon re-application of the CronWorkflow


## How to stop the workflow

Once the configured number of failures or successes is reached, it is necessary to stop the workflow scheduling.
I believe we have 3 options:

- Delete the workflow: In my opinion, this is the worst option and goes against gitops principles.
- Suspend it (set suspend=true): the workflow spec is changed to have the workflow suspended. I may be wrong but this conflicts with gitops as well.
- Stop scheduling it: The workflow spec is the same. The controller needs to check if the max number of runs was already attained and skip scheduling if it did.

Option 3 seems to be the only possibility. After reaching the max configured executions, the cron workflow would exist forever but never scheduled. Maybe we could add a new status field, like `Inactive` and have something the UI to show it?

## How to handle suspended workflows

One possible case that comes to mind is a long outage where all workflows are failing. For example, imagine a workflow that needs to download a file from some storage and for some reason that storage is down. Workflows will keep getting scheduled but they are going to fail. If they fail the number of configured `maxFailures`, the workflows gets stopped forever. Once the storage is back up, how can the user enable the workflow again?

- Manually re-create the workflow: could be an issue if the user has multiple cron workflows
- Instead of stopping the workflow scheduling, introduce a back-off period as suggested by [#7291](https://github.com/argoproj/argo-workflows/issues/7291). Or maybe allow both configurations.

I believe option 2 would allow the user to select if they want to stop scheduling or not. If they do, when cron workflows are wrongfully halted, they will need to manually start them again. If they don't, Argo will only introduce a back-off period between schedules to avoid rescheduling workflows that are just going to fail. Spec could look something like:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this, but I would say that we should probably ship this feature in a separate PR than the above requirements you outlined. It feels like this would add additional complexity.


```yaml
RunStrategy:
maxSuccess:
maxFailures:
value: # this would be optional
back-off:
enabled: true
factor: 2
Comment on lines +78 to +85
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something like the following so it's more extensible to other phases?

runStrategy:
  condition: "failure >= 3"
  backOff:
    factor: ...

```

With this configuration the user could configure 3 behaviors:

1. set `value` if they wanted to stop scheduling a workflow after a maximum number of consecutive failures.
2. set `value` and `back-off` if they wanted to stop scheduling a workflow after a maximum number of consecutive failures but with a back-off period between each failure
3. set `back-off` if they want a back-off period between each failure but they never want to stop the workflow scheduling.

## Wrap up

I believe this feature would enhance the cron workflows to allow more specific use cases that are commonly requested by the community, such as running a workflow only once. This proposal raises some concerns on how to properly implement it and I would like to know the maintainers/contributors opinion on these 4 topics, but also some other issues that I couldn't think of.

## Resources

- This discussion was prompted by [#10620](https://github.com/argoproj/argo-workflows/issues/10620)
- A first approach to this problem was discussed in [5659](https://github.com/argoproj/argo-workflows/issues/5659)
- A draft PR to implement the first approach [#5662](https://github.com/argoproj/argo-workflows/pull/5662)