Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deferred AWS BatchOperator appears to re-trigger task and error out #33016

Closed
2 tasks done
rishi-kulkarni opened this issue Aug 1, 2023 · 5 comments · Fixed by #33045
Closed
2 tasks done

Deferred AWS BatchOperator appears to re-trigger task and error out #33016

rishi-kulkarni opened this issue Aug 1, 2023 · 5 comments · Fixed by #33045
Assignees
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:amazon AWS/Amazon - related issues

Comments

@rishi-kulkarni
Copy link
Contributor

rishi-kulkarni commented Aug 1, 2023

Apache Airflow version

2.6.3

What happened

I started a long-running job with the BatchOperator in deferrable mode. Traceback is below:

The strange behavior here is that just before erroring out, it looks like the BatchOperator resubmits the job. However, the Batch is not actually submitted (verified in the console) - however, this seems to break the trigger and it errors out immediately after.

*** Found local files:
***   * /home/airflow/airflow/logs/dag_id=foxy_salesforce/run_id=scheduled__2023-08-01T19:30:00+00:00/task_id=foxy_salesforce_batch_job/attempt=2.log
***   * /home/airflow/airflow/logs/dag_id=foxy_salesforce/run_id=scheduled__2023-08-01T19:30:00+00:00/task_id=foxy_salesforce_batch_job/attempt=2.log.trigger.17484.log
[2023-08-01, 20:35:37 UTC] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: foxy_salesforce.foxy_salesforce_batch_job scheduled__2023-08-01T19:30:00+00:00 [queued]>
[2023-08-01, 20:35:37 UTC] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: foxy_salesforce.foxy_salesforce_batch_job scheduled__2023-08-01T19:30:00+00:00 [queued]>
[2023-08-01, 20:35:37 UTC] {taskinstance.py:1308} INFO - Starting attempt 2 of 2
[2023-08-01, 20:35:37 UTC] {taskinstance.py:1327} INFO - Executing <Task(BatchOperator): foxy_salesforce_batch_job> on 2023-08-01 19:30:00+00:00
[2023-08-01, 20:35:37 UTC] {standard_task_runner.py:57} INFO - Started process 8834 to run task
[2023-08-01, 20:35:37 UTC] {standard_task_runner.py:84} INFO - Running: ['airflow', 'tasks', 'run', 'foxy_salesforce', 'foxy_salesforce_batch_job', 'scheduled__2023-08-01T19:30:00+00:00', '--job-id', '17555', '--raw', '--subdir', 'DAGS_FOLDER/foxy_salesforce.py', '--cfg-path', '/tmp/tmpr0isg35r']
[2023-08-01, 20:35:37 UTC] {standard_task_runner.py:85} INFO - Job 17555: Subtask foxy_salesforce_batch_job
[2023-08-01, 20:35:37 UTC] {task_command.py:410} INFO - Running <TaskInstance: foxy_salesforce.foxy_salesforce_batch_job scheduled__2023-08-01T19:30:00+00:00 [running]> on host ip-172-20-18-105.ec2.internal
[2023-08-01, 20:35:37 UTC] {taskinstance.py:1545} INFO - Exporting env vars: AIRFLOW_CTX_DAG_EMAIL='[email protected]' AIRFLOW_CTX_DAG_OWNER='nrobinson' AIRFLOW_CTX_DAG_ID='foxy_salesforce' AIRFLOW_CTX_TASK_ID='foxy_salesforce_batch_job' AIRFLOW_CTX_EXECUTION_DATE='2023-08-01T19:30:00+00:00' AIRFLOW_CTX_TRY_NUMBER='2' AIRFLOW_CTX_DAG_RUN_ID='scheduled__2023-08-01T19:30:00+00:00'
[2023-08-01, 20:35:37 UTC] {batch.py:255} INFO - Running AWS Batch job - job definition: foxy_dev_batch_job:1 - on queue foxy-queue
[2023-08-01, 20:35:37 UTC] {batch.py:262} INFO - AWS Batch job - container overrides: {'command': ['-tap', 'salesforce', '-ds', 'salesforce', '-to', '[email protected]']}
[2023-08-01, 20:35:37 UTC] {base.py:73} INFO - Using connection ID 'aws_prod_batch' for task execution.
[2023-08-01, 20:35:37 UTC] {credentials.py:1051} INFO - Found credentials from IAM Role: DSAirflowIAMStack-DSDaggerServerRoleBC9B4D69-BVJN2Q2JAYNS
[2023-08-01, 20:35:37 UTC] {batch.py:292} INFO - AWS Batch job (0c52c6e5-6b77-4e48-88d6-e98478fdcae2) started: {'ResponseMetadata': {'RequestId': '4510c251-622d-4882-aec1-50eef30d2b7d', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 01 Aug 2023 20:35:37 GMT', 'content-type': 'application/json', 'content-length': '165', 'connection': 'keep-alive', 'x-amzn-requestid': '4510c251-622d-4882-aec1-50eef30d2b7d', 'access-control-allow-origin': '*', 'x-amz-apigw-id': 'I_3oCGPhoAMEa6Q=', 'access-control-expose-headers': 'X-amzn-errortype,X-amzn-requestid,X-amzn-errormessage,X-amzn-trace-id,X-amz-apigw-id,date', 'x-amzn-trace-id': 'Root=1-64c96c99-57cff0982ce9c97360d0fd02'}, 'RetryAttempts': 0}, 'jobArn': 'arn:aws:batch:us-east-1:806657589280:job/0c52c6e5-6b77-4e48-88d6-e98478fdcae2', 'jobName': 'foxy_salesforce', 'jobId': '0c52c6e5-6b77-4e48-88d6-e98478fdcae2'}
[2023-08-01, 20:35:37 UTC] {taskinstance.py:1415} INFO - Pausing task as DEFERRED. dag_id=foxy_salesforce, task_id=foxy_salesforce_batch_job, execution_date=20230801T193000, start_date=20230801T203537
[2023-08-01, 20:35:37 UTC] {local_task_job_runner.py:222} INFO - Task exited with return code 100 (task deferral)
[2023-08-01, 20:35:38 UTC] {base.py:73} INFO - Using connection ID 'aws_prod_batch' for task execution.
[2023-08-01, 20:35:38 UTC] {credentials.py:1051} INFO - Found credentials from IAM Role: DSAirflowIAMStack-DSDaggerServerRoleBC9B4D69-BVJN2Q2JAYNS
[2023-08-01, 20:35:39 UTC] {waiter_with_logging.py:129} INFO - Batch job 0c52c6e5-6b77-4e48-88d6-e98478fdcae2 not ready yet: ['STARTING']
[2023-08-01, 20:36:09 UTC] {waiter_with_logging.py:129} INFO - Batch job 0c52c6e5-6b77-4e48-88d6-e98478fdcae2 not ready yet: ['STARTING']
[2023-08-01, 20:36:39 UTC] {waiter_with_logging.py:129} INFO - Batch job 0c52c6e5-6b77-4e48-88d6-e98478fdcae2 not ready yet: ['RUNNING']
[2023-08-01, 20:37:09 UTC] {waiter_with_logging.py:129} INFO - Batch job 0c52c6e5-6b77-4e48-88d6-e98478fdcae2 not ready yet: ['RUNNING']
[2023-08-01, 20:37:39 UTC] {waiter_with_logging.py:129} INFO - Batch job 0c52c6e5-6b77-4e48-88d6-e98478fdcae2 not ready yet: ['RUNNING']
[2023-08-01, 20:38:09 UTC] {waiter_with_logging.py:129} INFO - Batch job 0c52c6e5-6b77-4e48-88d6-e98478fdcae2 not ready yet: ['RUNNING']
[2023-08-01, 20:38:39 UTC] {waiter_with_logging.py:129} INFO - Batch job 0c52c6e5-6b77-4e48-88d6-e98478fdcae2 not ready yet: ['RUNNING']
[2023-08-01, 20:39:09 UTC] {waiter_with_logging.py:129} INFO - Batch job 0c52c6e5-6b77-4e48-88d6-e98478fdcae2 not ready yet: ['RUNNING']
[2023-08-01, 20:39:39 UTC] {waiter_with_logging.py:129} INFO - Batch job 0c52c6e5-6b77-4e48-88d6-e98478fdcae2 not ready yet: ['RUNNING']
[2023-08-01, 20:40:09 UTC] {waiter_with_logging.py:129} INFO - Batch job 0c52c6e5-6b77-4e48-88d6-e98478fdcae2 not ready yet: ['RUNNING']
[2023-08-01, 20:40:12 UTC] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: foxy_salesforce.foxy_salesforce_batch_job scheduled__2023-08-01T19:30:00+00:00 [queued]>
[2023-08-01, 20:40:12 UTC] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: foxy_salesforce.foxy_salesforce_batch_job scheduled__2023-08-01T19:30:00+00:00 [queued]>
[2023-08-01, 20:40:12 UTC] {taskinstance.py:1306} INFO - Resuming after deferral
[2023-08-01, 20:40:12 UTC] {taskinstance.py:1327} INFO - Executing <Task(BatchOperator): foxy_salesforce_batch_job> on 2023-08-01 19:30:00+00:00
[2023-08-01, 20:40:12 UTC] {standard_task_runner.py:57} INFO - Started process 21621 to run task
[2023-08-01, 20:40:12 UTC] {standard_task_runner.py:84} INFO - Running: ['airflow', 'tasks', 'run', 'foxy_salesforce', 'foxy_salesforce_batch_job', 'scheduled__2023-08-01T19:30:00+00:00', '--job-id', '17561', '--raw', '--subdir', 'DAGS_FOLDER/foxy_salesforce.py', '--cfg-path', '/tmp/tmpyl2o5l2k']
[2023-08-01, 20:40:12 UTC] {standard_task_runner.py:85} INFO - Job 17561: Subtask foxy_salesforce_batch_job
[2023-08-01, 20:40:12 UTC] {task_command.py:410} INFO - Running <TaskInstance: foxy_salesforce.foxy_salesforce_batch_job scheduled__2023-08-01T19:30:00+00:00 [running]> on host ip-172-20-18-105.ec2.internal
[2023-08-01, 20:40:12 UTC] {taskinstance.py:1598} ERROR - Trigger failed:
Traceback (most recent call last):
  File "/home/airflow/dagger/venv/lib/python3.9/site-packages/airflow/providers/amazon/aws/utils/waiter_with_logging.py", line 122, in async_wait
    await waiter.wait(**args, WaiterConfig={"MaxAttempts": 1})
  File "/home/airflow/dagger/venv/lib/python3.9/site-packages/aiobotocore/waiter.py", line 49, in wait
    await AIOWaiter.wait(self, **kwargs)
  File "/home/airflow/dagger/venv/lib/python3.9/site-packages/aiobotocore/waiter.py", line 139, in wait
    raise WaiterError(
botocore.exceptions.WaiterError: Waiter batch_job_complete failed: Max attempts exceeded
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/dagger/venv/lib/python3.9/site-packages/airflow/jobs/triggerer_job_runner.py", line 537, in cleanup_finished_triggers
    result = details["task"].result()
  File "/home/airflow/dagger/venv/lib/python3.9/site-packages/airflow/jobs/triggerer_job_runner.py", line 615, in run_trigger
    async for event in trigger.run():
  File "/home/airflow/dagger/venv/lib/python3.9/site-packages/airflow/providers/amazon/aws/triggers/base.py", line 121, in run
    await async_wait(
  File "/home/airflow/dagger/venv/lib/python3.9/site-packages/airflow/providers/amazon/aws/utils/waiter_with_logging.py", line 131, in async_wait
    raise AirflowException("Waiter error: max attempts reached")
airflow.exceptions.AirflowException: Waiter error: max attempts reached
[2023-08-01, 20:40:12 UTC] {taskinstance.py:1824} ERROR - Task failed with exception
airflow.exceptions.TaskDeferralError: Trigger failure
[2023-08-01, 20:40:12 UTC] {taskinstance.py:1345} INFO - Marking task as FAILED. dag_id=foxy_salesforce, task_id=foxy_salesforce_batch_job, execution_date=20230801T193000, start_date=20230801T203537, end_date=20230801T204012
[2023-08-01, 20:40:12 UTC] {logging_mixin.py:150} WARNING - /home/airflow/dagger/venv/lib/python3.9/site-packages/airflow/utils/email.py:153 RemovedInAirflow3Warning: Fetching SMTP credentials from configuration variables will be deprecated in a future release. Please set credentials using a connection instead.
[2023-08-01, 20:40:12 UTC] {email.py:269} INFO - Email alerting: attempt 1
[2023-08-01, 20:40:12 UTC] {email.py:281} INFO - Sent an alert email to ['[email protected]']
[2023-08-01, 20:40:12 UTC] {standard_task_runner.py:104} ERROR - Failed to execute job 17561 for task foxy_salesforce_batch_job (Trigger failure; 21621)
[2023-08-01, 20:40:12 UTC] {local_task_job_runner.py:225} INFO - Task exited with return code 1
[2023-08-01, 20:40:12 UTC] {taskinstance.py:2653} INFO - 0 downstream tasks scheduled from follow-on schedule check

What you think should happen instead

No response

How to reproduce

Run a long-running Batch Job using the BatchOperator with deferrable=True

Operating System

AmazonLinux

Versions of Apache Airflow Providers

Version 8.4.0 of the AWS provider

Deployment

Virtualenv installation

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@rishi-kulkarni rishi-kulkarni added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Aug 1, 2023
@rishi-kulkarni
Copy link
Contributor Author

I got suspicious that it errored out after precisely 10 checks, and I was able to track the bug down to a questionable default set here:

https://github.com/apache/airflow/blob/60da13d191dd345c4d73e0f685c598a4c28f9b4a/airflow/providers/amazon/aws/operators/batch.py#L226C55-L226C55

The docstring of the Batch operator suggests (to me, at least) that the default is to wait 4200 times. The BatchJobTrigger also has a default of 720, which seems more reasonable than 10. This is an easy workaround, but I'm happy to open a PR to set 10 to a higher default if desired.

@eladkal
Copy link
Contributor

eladkal commented Aug 2, 2023

cc @vandonr-amz

@eladkal eladkal added provider:amazon AWS/Amazon - related issues area:providers good first issue and removed area:core needs-triage label for new issues that we didn't triage yet labels Aug 2, 2023
@vandonr-amz
Copy link
Contributor

Yes indeed, 10 is too low.
That default value was introduced in #30865
I think the value of 10 comes from a confusion regarding those existing default values

MAX_RETRIES = 4200
STATUS_RETRIES = 10

It should be 4200 indeed, 10 is the number of retries used in get_job_description in case there are some rate limit errors.

@vandonr-amz
Copy link
Contributor

do you want to open a +1/-1 PR to fix that default value @rishi-kulkarni since you're the one who pinpointed the bug ?

@rishi-kulkarni
Copy link
Contributor Author

Yep, done: #33045

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:amazon AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants