Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aws BatchSensor max_retries should change with deferrable framework #37120

Closed
2 tasks done
0x26res opened this issue Jan 31, 2024 · 3 comments · Fixed by #37234
Closed
2 tasks done

Aws BatchSensor max_retries should change with deferrable framework #37120

0x26res opened this issue Jan 31, 2024 · 3 comments · Fixed by #37234
Labels
good first issue kind:bug This is a clearly a bug provider:amazon-aws AWS/Amazon - related issues

Comments

@0x26res
Copy link
Contributor

0x26res commented Jan 31, 2024

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

I use airflow to orchestrate Aws batch jobs. Since aws batch is doing the heavy lifting, and to save resources on airflow, I'm was using smart sensors (in 2.4.3). It looks like this:

        with TaskGroup(group_id="job_abc") as group:
            job = BatchOperator(
                task_id=f"submit_job_abc",
                job_name="job_abc",
                max_retries=0,
                wait_for_completion=False,
            )
            BatchSensor(
                task_id=f"wait_for_job_abc",
                job_id=job.output,  # type: ignore
                mode="reschedule",
            )

Please note that I set the BatchOperator wait_for_completion=False to 0 so it only submits the job (fire and forget). This means I can also set max_retries=0 as submitting jobs will only fail if there's an issue validating the job definition.

In the BatchSensor I set the max_retries to 5 which is the default. When the BatchSensor poke/poll for job completion, if the job is being submitted, starting, or running, it doesn't count it as a failed attempt.

I'm in the process of updating to 2.7.2 and smart sensors are no longer supported, and I should use deferred operator. So I set BatchSensor.deferrable to True:

        with TaskGroup(group_id="job_abc") as group:
            job = BatchOperator(
                task_id=f"submit_job_abc",
                job_name="job_abc",
                max_retries=0,
                wait_for_completion=False,
            )
            BatchSensor(
                task_id=f"wait_for_job_abc",
                job_id=job.output,  # type: ignore
                mode="reschedule",
               deferrable=True,
            )

I've noticed that the interpretation of max_retries for the BatchSensor has changed. For instance it will assume that if the job is in RUNNABLE, STARTING or RUNNING state, it is a failed attempt:

[2024-01-31, 14:20:57 UTC] {waiter_with_logging.py:129} INFO - Batch job 035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['RUNNABLE']
[2024-01-31, 14:21:02 UTC] {waiter_with_logging.py:129} INFO - Batch job 035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']
[2024-01-31, 14:21:07 UTC] {waiter_with_logging.py:129} INFO - Batch job 035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']
[2024-01-31, 14:21:12 UTC] {waiter_with_logging.py:129} INFO - Batch job 035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']
[2024-01-31, 14:21:17 UTC] {waiter_with_logging.py:129} INFO - Batch job 035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']d
airflow.exceptions.AirflowException: Waiter error: max attempts reached

So the previous version would consider RUNNABLE/STARTING/RUNNING jobs not as a failed attempt, and only consider a fail attempt if the underlying job failed or if there was a transient / transport failure when checking the job status.

The new version with deferrable will count any poke at the job where the job is not completed as a failure (even though the job hasn't failed). In the light of this change of behaviour, should the max_retries be set to 4200 and the poll_interval to 30 (from 5), like it has been done for the BatchOperator

Apache Airflow version

2.7.2

Operating System

linux

Deployment

Amazon (AWS) MWAA

Deployment details

No response

What happened

No response

What you think should happen instead

No response

How to reproduce

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@0x26res 0x26res added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jan 31, 2024
@cmarteepants cmarteepants added good first issue provider:amazon-aws AWS/Amazon - related issues and removed needs-triage label for new issues that we didn't triage yet area:providers labels Jan 31, 2024
@cmarteepants
Copy link
Collaborator

Seems reasonable. @0x26res do you want to submit a PR for this?

@0x26res
Copy link
Contributor Author

0x26res commented Feb 2, 2024

I'll give it a try once I find the time.

@vincbeck
Copy link
Contributor

vincbeck commented Feb 7, 2024

I created the fix here: #37234. Sorry @0x26res is you have already started to work on it but the fix was pretty easy and the bug could be frustrating for users so I went ahead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue kind:bug This is a clearly a bug provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants