You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If the python_callable used by a ShortCircuitOperator does not properly return a boolean, but instead crashes, the retry and downstream behavior is not deterministic. Either:
the task properly retries, succeeds and downstream tasks are triggered
the task does not properly retry, downstream tasks are marked as upstream_failed
the task does retry, is not marked as success, but downstream tasks are triggered
What you think should happen instead?
In case the callable crashes, it should be treated as a failure, and retry logic should be properly executed until the task completes successfully.
How to reproduce
As the behavior is not deterministic, I've created 10 parallel flows to demonstrate the issue. A crash is simulated by exit(-1).
This DAG file should demonstrate the broken behavior:
Seems like the problem with zombie detection/resolution. In addition found interesting bug, on next DagRun when task failed it also might marks (with high probability) as failed even if it previously run successfully
@vlieven, there is an issue with using exit() to simulate a crash. I hope it will be fixed in #36986, but I'm not certain if this fix will resolve your original problem. Could you please check it?
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.7.3
What happened?
If the
python_callable
used by a ShortCircuitOperator does not properly return a boolean, but instead crashes, the retry and downstream behavior is not deterministic. Either:What you think should happen instead?
In case the callable crashes, it should be treated as a failure, and retry logic should be properly executed until the task completes successfully.
How to reproduce
As the behavior is not deterministic, I've created 10 parallel flows to demonstrate the issue. A crash is simulated by
exit(-1)
.This DAG file should demonstrate the broken behavior:
The resulting grid view is the following:
Operating System
Amazon Linux 2
Versions of Apache Airflow Providers
This is using only the built-in operators
Deployment
Other Docker-based deployment
Deployment details
Running on a Kubernetes cluster
Anything else?
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: