-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add cron wf improvements proposal #11559
docs: add cron wf improvements proposal #11559
Conversation
Signed-off-by: eduardodbr <[email protected]>
Signed-off-by: eduardodbr <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @sarabala1979 @caelan-io @Kitanotori @wesleyscholl @tico24
Signed-off-by: eduardodbr <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @eduardodbr ! Looks real great. I left some suggestions.
|
||
One aspect that needs to be discussed is whether these configurations apply to the entire life of a cron Workflow or just in consecutive schedules. For example, if we configure a workflow to stop scheduling after 2 failures, I think it makes sense to have this applied when it fails twice consecutively. Otherwise, we can have 2 outages in different periods which will suspend the workflow. On the other hand, when configuring a workflow to run twice with success, it would make more sense to have it execute with success regardless of whether it is a consecutive success or not. If we have an outage after the first workflow succeeds, which translates into failed workflows, it should need to execute with success only once. So I think it would make sense to have: | ||
|
||
- maxFailures - maximum number of **consecutive failures** before stopping the scheduling of a workflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose adding another field to toggle between consecutive or total failures. It could be a boolean with default as false
.
Maybe: consecFailures
or consecFail
This would make the default logic align for both maxSuccess
and maxFailures
which is simpler logic for a new user to understand. If they want consecutive failures counted, then they can set consecFailures
to true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could have the run strategy to be configurable by success or failure. For example:
RunStrategy:
onSuccess:
maxSuccess: 3
onFail:
maxFail: 3
consecFail: false
this would allow us to further upgrade the configurable behavior in the future without breaking the schema. For example, if we decide to ship the back-off in another PR, you could just upgrade it to:
RunStrategy:
onSuccess:
maxSucess: 3
onFail:
maxFail: 3
consecFail: true
back-off:
enabled: true
factor: 2
However, I do agree with you, this may add a lot of complexity, and its best to keep it simple for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this suggestion of yours:
Maybe we could have the run strategy to be configurable by success or failure. this would allow us to further upgrade the configurable behavior in the future without breaking the schema. For example, if we decide to ship the back-off in another PR, you could just upgrade it
I'm up for this if you think it'll be simple enough to implement on the first attempt.
``` | ||
RunStrategy: | ||
maxSuccess: | ||
maxFailures: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe maxFail
for simplicity and align with "success vs. fail"
|
||
`maxSuccess` - defines how many child workflows must have success before suspending the workflow schedule | ||
|
||
`maxFailures` - defines how many child workflows must fail before suspending the workflow scheduling. This may contain `Failed` workflows, `Errored` workflows or spec errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very important to document this clearly in the docs when the feature is released.
- Probably only makes sense if we can backup the state somewhere (like a BD). However, I don't have enough knowledge about workflow's architecture to tell how good of an idea this is. | ||
- If a CronWorkflow gets re-applied, does it maintain or reset the number of success/failures? | ||
|
||
- I guess it should reset since a configuration change should be seen as a new start. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with resetting state upon re-application of the CronWorkflow
- Manually re-create the workflow: could be an issue if the user has multiple cron workflows | ||
- Instead of stopping the workflow scheduling, introduce a back-off period as suggested by [#7291](https://github.com/argoproj/argo-workflows/issues/7291). Or maybe allow both configurations. | ||
|
||
I believe option 2 would allow the user to select if they want to stop scheduling or not. If they do, when cron workflows are wrongfully halted, they will need to manually start them again. If they don't, Argo will only introduce a back-off period between schedules to avoid rescheduling workflows that are just going to fail. Spec could look something like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this, but I would say that we should probably ship this feature in a separate PR than the above requirements you outlined. It feels like this would add additional complexity.
|
||
## Introduction | ||
|
||
Currently, CronWorkflows are a great resource if we want to run recurring tasks to infinity. However, it is missing the ability to customize it, for example define how many times a workflow should run or how to handle multiple failures. I believe argo workflows would benefit of having more configuration options for cron workflows, to allow to change its behavior based on the result of its child’s success or failures. Below I present my thoughts on how we could improve them, but also some questions and concerns on how to properly do it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also note here how a comparable tool, Airflow, includes this functionality so it is a strong precedent to enable it for Argo Workflows as well: https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/scheduler.html#triggering-dag-with-future-date
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't mention that because this proposal was aiming to be something wider than just allowing to run a workflow once (in the future or not). However, I agree that it is one of the most important features that Argo workflows is lacking. I believe a lot of people would need it more frequently than configuring the behavior of recurrent success/fail. This proposal aimed to allow both use cases. However, if we come to the conclusion that it may be to complex to implement all this behavior, I would suggest at least implementing the trigger once feature.
@terrytangyuan - what are your thoughts on moving forward with this? |
This is on my list of items to review. |
```yaml | ||
RunStrategy: | ||
maxSuccess: | ||
maxFailures: | ||
value: # this would be optional | ||
back-off: | ||
enabled: true | ||
factor: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about something like the following so it's more extensible to other phases?
runStrategy:
condition: "failure >= 3"
backOff:
factor: ...
@caelan-io @terrytangyuan I have some time to come back to this and after your suggestions I suggest the following first approach:
Just wondering if instead of naming it If it is ok with you I can open a PR with the above changes. |
I don't have a strong preference regarding the naming. |
That naming convention sounds good to me! It'll help to see it in the code, though. I think it's worth moving forward with the PR as you've described here, and we can revisit if we see any issues with it. @terrytangyuan - it looks like we'll need your approval to merge this enhancement proposal. |
@terrytangyuan - can we merge since @eduardodbr has opened a PR that's under review already? |
Sure |
Thank you, Terry! |
Motivation
Cron workflow improvements proposal as prompted by #10620