Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add cron wf improvements proposal #11559

Merged

Conversation

eduardodbr
Copy link
Member

Motivation

Cron workflow improvements proposal as prompted by #10620

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: eduardodbr <[email protected]>
Copy link
Member

@caelan-io caelan-io left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eduardodbr ! Looks real great. I left some suggestions.


One aspect that needs to be discussed is whether these configurations apply to the entire life of a cron Workflow or just in consecutive schedules. For example, if we configure a workflow to stop scheduling after 2 failures, I think it makes sense to have this applied when it fails twice consecutively. Otherwise, we can have 2 outages in different periods which will suspend the workflow. On the other hand, when configuring a workflow to run twice with success, it would make more sense to have it execute with success regardless of whether it is a consecutive success or not. If we have an outage after the first workflow succeeds, which translates into failed workflows, it should need to execute with success only once. So I think it would make sense to have:

- maxFailures - maximum number of **consecutive failures** before stopping the scheduling of a workflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose adding another field to toggle between consecutive or total failures. It could be a boolean with default as false.

Maybe: consecFailures or consecFail

This would make the default logic align for both maxSuccess and maxFailures which is simpler logic for a new user to understand. If they want consecutive failures counted, then they can set consecFailures to true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could have the run strategy to be configurable by success or failure. For example:

RunStrategy:
  onSuccess:
    maxSuccess: 3 
  onFail:
    maxFail:  3
    consecFail: false 

this would allow us to further upgrade the configurable behavior in the future without breaking the schema. For example, if we decide to ship the back-off in another PR, you could just upgrade it to:

RunStrategy:
  onSuccess:
    maxSucess: 3 
  onFail:
    maxFail:  3
    consecFail: true
    back-off:
      enabled: true
      factor: 2

However, I do agree with you, this may add a lot of complexity, and its best to keep it simple for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this suggestion of yours:

Maybe we could have the run strategy to be configurable by success or failure. this would allow us to further upgrade the configurable behavior in the future without breaking the schema. For example, if we decide to ship the back-off in another PR, you could just upgrade it

I'm up for this if you think it'll be simple enough to implement on the first attempt.

```
RunStrategy:
maxSuccess:
maxFailures:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe maxFail for simplicity and align with "success vs. fail"


`maxSuccess` - defines how many child workflows must have success before suspending the workflow schedule

`maxFailures` - defines how many child workflows must fail before suspending the workflow scheduling. This may contain `Failed` workflows, `Errored` workflows or spec errors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very important to document this clearly in the docs when the feature is released.

- Probably only makes sense if we can backup the state somewhere (like a BD). However, I don't have enough knowledge about workflow's architecture to tell how good of an idea this is.
- If a CronWorkflow gets re-applied, does it maintain or reset the number of success/failures?

- I guess it should reset since a configuration change should be seen as a new start.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with resetting state upon re-application of the CronWorkflow

- Manually re-create the workflow: could be an issue if the user has multiple cron workflows
- Instead of stopping the workflow scheduling, introduce a back-off period as suggested by [#7291](https://github.com/argoproj/argo-workflows/issues/7291). Or maybe allow both configurations.

I believe option 2 would allow the user to select if they want to stop scheduling or not. If they do, when cron workflows are wrongfully halted, they will need to manually start them again. If they don't, Argo will only introduce a back-off period between schedules to avoid rescheduling workflows that are just going to fail. Spec could look something like:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this, but I would say that we should probably ship this feature in a separate PR than the above requirements you outlined. It feels like this would add additional complexity.


## Introduction

Currently, CronWorkflows are a great resource if we want to run recurring tasks to infinity. However, it is missing the ability to customize it, for example define how many times a workflow should run or how to handle multiple failures. I believe argo workflows would benefit of having more configuration options for cron workflows, to allow to change its behavior based on the result of its child’s success or failures. Below I present my thoughts on how we could improve them, but also some questions and concerns on how to properly do it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also note here how a comparable tool, Airflow, includes this functionality so it is a strong precedent to enable it for Argo Workflows as well: https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/scheduler.html#triggering-dag-with-future-date

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mention that because this proposal was aiming to be something wider than just allowing to run a workflow once (in the future or not). However, I agree that it is one of the most important features that Argo workflows is lacking. I believe a lot of people would need it more frequently than configuring the behavior of recurrent success/fail. This proposal aimed to allow both use cases. However, if we come to the conclusion that it may be to complex to implement all this behavior, I would suggest at least implementing the trigger once feature.

@caelan-io
Copy link
Member

@terrytangyuan - what are your thoughts on moving forward with this?

@terrytangyuan
Copy link
Member

This is on my list of items to review.

Comment on lines +78 to +85
```yaml
RunStrategy:
maxSuccess:
maxFailures:
value: # this would be optional
back-off:
enabled: true
factor: 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something like the following so it's more extensible to other phases?

runStrategy:
  condition: "failure >= 3"
  backOff:
    factor: ...

@eduardodbr
Copy link
Member Author

@caelan-io @terrytangyuan I have some time to come back to this and after your suggestions I suggest the following first approach:

  • New Spec field stopStrategy with a condition that is used to stop scheduling workflows. Controller stops scheduling workflows if the stopping condition is achieved:
stopStrategy:
   condition: "succeeded >= 1"
  • New status fields:
    • Succeeded - counts the number of workflows that completed in phase Succeeded
    • Failed - counts the number of workflows that failed to submit or completed in phase Error or Failed
    • Completed - whether the stopping condition was achieved or not (is it worth it to have it as an annotation?)
  • Use the completed status field instead of changing the cronworkflow with “suspend=true” to avoid changing the cronworkflow spec, which conflicts with gitops users
  • Keywords allowed for stopping conditions are failed and succeeded:
    - Failed - uses the status field failed
    - Succeeded - uses the status field succeeded
stopStrategy:
 condition: "failed >= 3"
stopStrategy:
 condition: "succeeded >= 1"
  • Updating the stopping condition after the workflow is completed, will continue the normal cron workflow execution.

Just wondering if instead of naming it stopStrategy we could use something more generic. However, I feel that @terrytangyuan's suggestion runStrategy can be easily missed if the user just wants to schedule the workflow for a X amount of times.

If it is ok with you I can open a PR with the above changes.

@terrytangyuan
Copy link
Member

I don't have a strong preference regarding the naming.

@caelan-io
Copy link
Member

That naming convention sounds good to me! It'll help to see it in the code, though. I think it's worth moving forward with the PR as you've described here, and we can revisit if we see any issues with it.

@terrytangyuan - it looks like we'll need your approval to merge this enhancement proposal.

@caelan-io
Copy link
Member

@terrytangyuan - can we merge since @eduardodbr has opened a PR that's under review already?

@terrytangyuan terrytangyuan enabled auto-merge (squash) December 6, 2023 13:38
@terrytangyuan
Copy link
Member

Sure

@terrytangyuan terrytangyuan merged commit 8859c92 into argoproj:main Dec 6, 2023
13 checks passed
@caelan-io
Copy link
Member

Thank you, Terry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants