You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I use airflow to orchestrate Aws batch jobs. Since aws batch is doing the heavy lifting, and to save resources on airflow, I'm was using smart sensors (in 2.4.3). It looks like this:
Please note that I set the BatchOperatorwait_for_completion=False to 0 so it only submits the job (fire and forget). This means I can also set max_retries=0 as submitting jobs will only fail if there's an issue validating the job definition.
In the BatchSensor I set the max_retries to 5 which is the default. When the BatchSensor poke/poll for job completion, if the job is being submitted, starting, or running, it doesn't count it as a failed attempt.
I'm in the process of updating to 2.7.2 and smart sensors are no longer supported, and I should use deferred operator. So I set BatchSensor.deferrable to True:
I've noticed that the interpretation of max_retries for the BatchSensor has changed. For instance it will assume that if the job is in RUNNABLE, STARTING or RUNNING state, it is a failed attempt:
[2024-01-31, 14:20:57 UTC] {waiter_with_logging.py:129} INFO - Batch job 035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['RUNNABLE']
[2024-01-31, 14:21:02 UTC] {waiter_with_logging.py:129} INFO - Batch job 035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']
[2024-01-31, 14:21:07 UTC] {waiter_with_logging.py:129} INFO - Batch job 035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']
[2024-01-31, 14:21:12 UTC] {waiter_with_logging.py:129} INFO - Batch job 035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']
[2024-01-31, 14:21:17 UTC] {waiter_with_logging.py:129} INFO - Batch job 035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']d
airflow.exceptions.AirflowException: Waiter error: max attempts reached
So the previous version would consider RUNNABLE/STARTING/RUNNING jobs not as a failed attempt, and only consider a fail attempt if the underlying job failed or if there was a transient / transport failure when checking the job status.
The new version with deferrable will count any poke at the job where the job is not completed as a failure (even though the job hasn't failed). In the light of this change of behaviour, should the max_retries be set to 4200 and the poll_interval to 30 (from 5), like it has been done for the BatchOperator
I created the fix here: #37234. Sorry @0x26res is you have already started to work on it but the fix was pretty easy and the bug could be frustrating for users so I went ahead.
Apache Airflow Provider(s)
amazon
Versions of Apache Airflow Providers
I use airflow to orchestrate Aws batch jobs. Since aws batch is doing the heavy lifting, and to save resources on airflow, I'm was using smart sensors (in 2.4.3). It looks like this:
Please note that I set the
BatchOperator
wait_for_completion=False
to 0 so it only submits the job (fire and forget). This means I can also setmax_retries=0
as submitting jobs will only fail if there's an issue validating the job definition.In the BatchSensor I set the max_retries to 5 which is the default. When the BatchSensor poke/poll for job completion, if the job is being submitted, starting, or running, it doesn't count it as a failed attempt.
I'm in the process of updating to 2.7.2 and smart sensors are no longer supported, and I should use deferred operator. So I set
BatchSensor.deferrable
to True:I've noticed that the interpretation of
max_retries
for the BatchSensor has changed. For instance it will assume that if the job is in RUNNABLE, STARTING or RUNNING state, it is a failed attempt:So the previous version would consider RUNNABLE/STARTING/RUNNING jobs not as a failed attempt, and only consider a fail attempt if the underlying job failed or if there was a transient / transport failure when checking the job status.
The new version with deferrable will count any poke at the job where the job is not completed as a failure (even though the job hasn't failed). In the light of this change of behaviour, should the max_retries be set to 4200 and the poll_interval to 30 (from 5), like it has been done for the BatchOperator
Apache Airflow version
2.7.2
Operating System
linux
Deployment
Amazon (AWS) MWAA
Deployment details
No response
What happened
No response
What you think should happen instead
No response
How to reproduce
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: