Aws BatchSensor max_retries should change with deferrable framework #37120

0x26res · 2024-01-31T15:40:16Z

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

I use airflow to orchestrate Aws batch jobs. Since aws batch is doing the heavy lifting, and to save resources on airflow, I'm was using smart sensors (in 2.4.3). It looks like this:

        with TaskGroup(group_id="job_abc") as group:
            job = BatchOperator(
                task_id=f"submit_job_abc",
                job_name="job_abc",
                max_retries=0,
                wait_for_completion=False,
            )
            BatchSensor(
                task_id=f"wait_for_job_abc",
                job_id=job.output,  # type: ignore
                mode="reschedule",
            )

Please note that I set the BatchOperator wait_for_completion=False to 0 so it only submits the job (fire and forget). This means I can also set max_retries=0 as submitting jobs will only fail if there's an issue validating the job definition.

In the BatchSensor I set the max_retries to 5 which is the default. When the BatchSensor poke/poll for job completion, if the job is being submitted, starting, or running, it doesn't count it as a failed attempt.

I'm in the process of updating to 2.7.2 and smart sensors are no longer supported, and I should use deferred operator. So I set BatchSensor.deferrable to True:

        with TaskGroup(group_id="job_abc") as group:
            job = BatchOperator(
                task_id=f"submit_job_abc",
                job_name="job_abc",
                max_retries=0,
                wait_for_completion=False,
            )
            BatchSensor(
                task_id=f"wait_for_job_abc",
                job_id=job.output,  # type: ignore
                mode="reschedule",
               deferrable=True,
            )

I've noticed that the interpretation of max_retries for the BatchSensor has changed. For instance it will assume that if the job is in RUNNABLE, STARTING or RUNNING state, it is a failed attempt:

[2024-01-31, 14:20:57 UTC] {waiter_with_logging.py:129} INFO - Batch job 035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['RUNNABLE']
[2024-01-31, 14:21:02 UTC] {waiter_with_logging.py:129} INFO - Batch job 035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']
[2024-01-31, 14:21:07 UTC] {waiter_with_logging.py:129} INFO - Batch job 035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']
[2024-01-31, 14:21:12 UTC] {waiter_with_logging.py:129} INFO - Batch job 035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']
[2024-01-31, 14:21:17 UTC] {waiter_with_logging.py:129} INFO - Batch job 035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']d
airflow.exceptions.AirflowException: Waiter error: max attempts reached

So the previous version would consider RUNNABLE/STARTING/RUNNING jobs not as a failed attempt, and only consider a fail attempt if the underlying job failed or if there was a transient / transport failure when checking the job status.

The new version with deferrable will count any poke at the job where the job is not completed as a failure (even though the job hasn't failed). In the light of this change of behaviour, should the max_retries be set to 4200 and the poll_interval to 30 (from 5), like it has been done for the BatchOperator

Apache Airflow version

2.7.2

Operating System

linux

Deployment

Amazon (AWS) MWAA

Deployment details

No response

What happened

No response

What you think should happen instead

No response

How to reproduce

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

cmarteepants · 2024-01-31T22:15:31Z

Seems reasonable. @0x26res do you want to submit a PR for this?

0x26res · 2024-02-02T10:17:28Z

I'll give it a try once I find the time.

vincbeck · 2024-02-07T20:42:22Z

I created the fix here: #37234. Sorry @0x26res is you have already started to work on it but the fix was pretty easy and the bug could be frustrating for users so I went ahead.

0x26res added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jan 31, 2024

cmarteepants added good first issue provider:amazon AWS/Amazon - related issues and removed needs-triage label for new issues that we didn't triage yet area:providers labels Jan 31, 2024

vincbeck mentioned this issue Feb 7, 2024

Update default value for BatchSensor #37234

Merged

vincbeck closed this as completed in #37234 Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aws BatchSensor max_retries should change with deferrable framework #37120

Aws BatchSensor max_retries should change with deferrable framework #37120

0x26res commented Jan 31, 2024

cmarteepants commented Jan 31, 2024

0x26res commented Feb 2, 2024

vincbeck commented Feb 7, 2024

Aws BatchSensor max_retries should change with deferrable framework #37120

Aws BatchSensor max_retries should change with deferrable framework #37120

Comments

0x26res commented Jan 31, 2024

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

cmarteepants commented Jan 31, 2024

0x26res commented Feb 2, 2024

vincbeck commented Feb 7, 2024