-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fail stop feature for DAGs #29406
Conversation
c50b334
to
c824b83
Compare
3439824
to
b94119d
Compare
Hi everyone, I was wondering if I could get some feedback on the approach I used for adding this feature. Does the approach look good, or do you have any suggestions for changes? |
b94119d
to
90ef158
Compare
Hi everybody, it looks like all the tests pass now. I was wondering if I could get some feedback on the approach I used for adding this feature. Does the approach look good, or do you have any suggestions for changes? |
): | ||
continue | ||
if ti.state == TaskInstanceState.RUNNING: | ||
ti.error(session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shoudl add some logging telling that we are doing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise it will be quite magical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I have added a logging statement for when a running task is being force failed, and when a task is being set to the skipped state. Let me know if these log statements are good or if there are anything to change.
90ef158
to
71bbb2a
Compare
How should this interact with trigger rules? |
Hi everyone, does anyone have an opinion of how the fail fast dag should interact with the trigger rules? Should there be some tests for this? Or maybe documentation to outline how to use a fail fast DAG with trigger rules? |
I think it needs a bit of anylysis and design. It's hard to get the logic around it without viualising it, but @uranusjr is right that some thinking should be done around it. I guess we have two options:
How the "smart" should look like - hard to say. probably each rule should be analysed how it should behave and good strategy for each rule shoudl be worked out (if that's even possible) - trying to logically attempt to follow the rules and choose for example a strategy of failing currently run tasks and propagating it furhter. That would be rather complex and there are some non-obvious traps we can fall in (what happens with the tasks that succeed when one failed?). And what about the tasks that follow them? I think it very much depends on the definition of "fail fast". If we define it as "fail all the tasks if any of those tasks fail", then current solution is good. If we try to follow triggering rules, then well, it's complex and likely "fali fast" is hard to define in general ("fail fast all the currently eligible tasks and propagagate it across the DAG"). But maybe there is a middle-ground. Maybe we can only make fail-fast work if the only triggering rules we have are "all_success" (i.e. default). And if you add another triggering rule to a "fail-fast" DAG, this could cause an error. I think there is a big class of DAGs which fall into that category and those are precisely the DAGs where "fail fast" defined as "fail all the tasks if any of them fail" makes perfect sense. @uranusjr ? WDYT? I guess that would only require to add the error in case any non-default triggering rule is added for "fail-fast" dag (and properly documenting it). |
If we define “fail” on the DAG level as “a task will not triggered when all its upstreams finish” I think it can work logically; the question is whether this is expected by users (and we probably should invent a term other than “fail” to avoid confusion). But no matter how we decide on this, having a fail-fast flag (assuming we don’t use the “fail” term directly) would work the same if all tasks use the all-success rule, so implementing strictly only that scenario would be a good first step. But this still requires better terminology. |
Maybe "cancel" or "stop" ? Yes I think this is a useful cases in a number of scenarios especially when users are cost or timing (or both) conscious. There might be a number of cases, especially when you have dynamic cloud infrastructure, and long running tasks, where you know that failure of any task will generally make any other task results useless - so running them would be a waste of money (or time waiting for things to complete). |
Yeah both sounds right to me. |
|
That makes sense. I was thinking of making the following changes, let me know if this sounds good:
|
Sounds great! |
71bbb2a
to
7c08106
Compare
Hi everyone, I have added the following changes:
Let me know if there is anything else I should add or anything of concern. |
Static checks? |
29aa15f
to
db7406d
Compare
db7406d
to
0d7c90f
Compare
airflow/models/baseoperator.py
Outdated
if dag is not None and dag.fail_stop and trigger_rule != DEFAULT_TRIGGER_RULE: | ||
raise DagInvalidTriggerRule() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems awkward to me fail_stop
and DEFAULT_TRIGGER_RULE
are repeated when we check to raise the exception, and when the exception is rendered.
Would something like a check
function on DagInvalidTriggerRule
encapsulating the logic be a good idea, so the logic exists only in one class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I have added a static check
method to the DagInvalidTriggerRule
class, which checks the necessary condition and throws an exception if needed. Let me know if this was what you had in mind or if there is a better possible implementation for this.
d137f8f
to
07afd88
Compare
airflow/exceptions.py
Outdated
@staticmethod | ||
def check(dag: DAG | None, trigger_rule: str): | ||
from airflow.models.abstractoperator import DEFAULT_TRIGGER_RULE | ||
|
||
if dag is not None and dag.fail_stop and trigger_rule != DEFAULT_TRIGGER_RULE: | ||
raise DagInvalidTriggerRule() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems simpler if this is a classmethod and use raise cls
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I have updated the code to be a class method rather than a static method.
07afd88
to
b6373a3
Compare
Hi everyone, I was wondering if these changes look good, or if there is anything else I should add? |
Nice one :) |
Yeah it does seem that the current approach does allow for races between scheduler and task. What if tis are expanded after Separately, what if here we skip a TI and the scheduler sets it to queued? I'm not sure the frequency with which this kind of thing would manifest but the conditions aren't hard to imagine. |
The corresponding issue: #26854
I have added the
fail_fast
parameter for DAGs. When this is set, every running task within the DAG will be forcefully failed immediately and any task that hasn't already completed will be set toSKIPPED
.The way I approached this was, within the
handle_failure
method of theTaskInstance
class, if a task reaches a FAILED state and its DAG is infail_fast
mode, then it fails every task that is currently running within its DAG run.I have added unit tests and have also tested several different fail_fast DAGs. I have confirmed that when a task within these DAGs fails, the currently running tasks within them immediately fail and all other non-completed tasks get skipped, thus quickly stopping the DAG.
One suggestion of how to approach this was to modify
base_executor
to kill all tasks within a DAG once one task fails. The problem that I found within this approach was that, first of all there are several different executors which all store which tasks are running, queued, or completed differently, making it hard to make a unified function that kills all running tasks on all types of executors. More importantly, through my tests I have found that the executor often doesn't have full information of all the tasks in a dag run when one task fails. What this means is that, if a task fails, the executor may not yet have full information about all the tasks currently running within that DAG, and all tasks currently queued, and thus cannot properly fail the DAG.Let me know if there are flaws to my approach or if there are any changes that should be added.