GlueJobOperator stuck in running state, even when the job is completed on AWS, when Verbose=True #44694

rawwar · 2024-12-05T12:24:43Z

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==9.1.0

Apache Airflow version

2.10.3

Operating System

ubuntu-22.04

Deployment

Astronomer

Deployment details

No response

What happened

GlueJobOperator was stuck in running for a long time, while the actual Glue job on AWS took a minute to complete. This is only happening when Verbose is set to Truehappens

What you think should happen instead

It should not get stuck for a long time when Verbose is set to True

How to reproduce

I used the following DAG Code:

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.operators.glue import GlueJobOperator
from datetime import timedelta

from datetime import datetime


def _start():
    print("Hi")


def _end():
    print("Job end")


with DAG("dag_glue_script_python", catchup=False) as dag:
    start = PythonOperator(task_id="start", python_callable=_start)
    
    start_glue_job = GlueJobOperator(
        job_name='sleep2',
        task_id='run',
        aws_conn_id='aws_cre',
        create_job_kwargs={"NumberOfWorkers": 1, "WorkerType": "G.1X"},
        stop_job_run_on_kill=True,
        verbose=True,
        wait_for_completion=True,
        deferrable=True,
        job_poll_interval=15
    )

    end = PythonOperator(task_id="end", python_callable=_end)

start >> start_glue_job >> end

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

rawwar · 2024-12-05T12:27:42Z

I modified the glue.py(in aws provider 9.1.0) hook's print_job_logs method and added a few print statements. Here's the updated glue.py and the task logs:

def print_job_logs(
            self,
            job_name: str,
            run_id: str,
            continuation_tokens: LogContinuationTokens,
        ):
            """
            Print the latest job logs to the Airflow task log and updates the continuation tokens.

            :param continuation_tokens: the tokens where to resume from when reading logs.
                The object gets updated with the new tokens by this method.
            """
            log_client = self.logs_hook.get_conn()
            paginator = log_client.get_paginator("filter_log_events")
            

            def display_logs_from(log_group: str, continuation_token: str | None) -> str | None:
                """Mutualize iteration over the 2 different log streams glue jobs write to."""
                print(f"display_logs_from start with log_group={log_group}, continuation_token={continuation_token}")
                fetched_logs = []
                next_token = continuation_token
                try:
                    for response in paginator.paginate(
                        logGroupName=log_group,
                        logStreamNames=[run_id],
                        PaginationConfig={"StartingToken": continuation_token},
                    ):
                        print("paginator response", response)
                        fetched_logs.extend([event["message"] for event in response["events"]])
                        # if the response is empty there is no nextToken in it
                        next_token = response.get("nextToken") or next_token
                        print("fetched_logs", fetched_logs)
                        print("next_token", next_token)
                except ClientError as e:
                    if e.response["Error"]["Code"] == "ResourceNotFoundException":
                        # we land here when the log groups/streams don't exist yet
                        self.log.warning(
                            "No new Glue driver logs so far.\n"
                            "If this persists, check the CloudWatch dashboard at: %r.",
                            f"https://{self.conn_region_name}.console.aws.amazon.com/cloudwatch/home",
                        )
                    else:
                        print("error", e)
                        raise
                print("finished paginator")
                if len(fetched_logs):
                    # Add a tab to indent those logs and distinguish them from airflow logs.
                    # Log lines returned already contain a newline character at the end.
                    messages = "\t".join(fetched_logs)
                    self.log.info("Glue Job Run %s Logs:\n\t%s", log_group, messages)
                else:
                    self.log.info("No new log from the Glue Job in %s", log_group)
                return next_token

            log_group_prefix = self.conn.get_job_run(JobName=job_name, RunId=run_id)["JobRun"]["LogGroupName"]
            log_group_default = f"{log_group_prefix}/{DEFAULT_LOG_SUFFIX}"
            log_group_error = f"{log_group_prefix}/{ERROR_LOG_SUFFIX}"
            print(f"log_group_prefix={log_group_prefix}, log_group_default={log_group_default}, log_group_error={log_group_error}")
            # one would think that the error log group would contain only errors, but it actually contains
            # a lot of interesting logs too, so it's valuable to have both
            print("before display_logs_from")
            continuation_tokens.output_stream_continuation = display_logs_from(
                log_group_default, continuation_tokens.output_stream_continuation
            )
            print("After")
            continuation_tokens.error_stream_continuation = display_logs_from(
                log_group_error, continuation_tokens.error_stream_continuation
            )
            print("Done")

Task logs are attached
task logs.log

rawwar · 2024-12-05T12:28:33Z

Issue seems to be that, paginate.paginate kept going in a loop until it ultimately failed

eladkal · 2024-12-05T15:44:08Z

Intresting timing. there was a fix about verbose for GlueJobTrigger #43622

rawwar · 2024-12-05T20:17:57Z

Intresting timing. there was a fix about verbose for GlueJobTrigger #43622

@eladkal, I don't think the fix in #43622 is relevant here(Or maybe you were just mentioning about it). Issue happens in both deferrable and non-deferrable mode. Also, I noticed that this issue happens only with one of our customers' AWS accounts. I checked with my personal AWS account with exact permissions, and I can't replicate it. This makes me think there's some edge case in Boto3's paginate, which we need to handle in the provider code.

ferruzzi · 2025-01-04T01:18:28Z

I can't seem to reproduce this on my laptop either, but would adding a call to get_job_state here do the trick? Something along the lines of:

if len(fetched_logs):
    # Add a tab to indent those logs and distinguish them from airflow logs.
    # Log lines returned already contain a newline character at the end.
    messages = "\t".join(fetched_logs)
    self.log.info("Glue Job Run %s Logs:\n\t%s", log_group, messages)

elif self.get_job_state(job_name, run_id) in ["FAILED", "TIMEOUT", "SUCCEEDED", "STOPPED"]:  
   # no new logs and the job has terminated
    return

else:
  # no new logs but job isn't finished, print a "waiting..." message
  self.log.info("No new log from the Glue Job in %s", log_group)

rawwar · 2025-01-04T01:31:59Z

@ferruzzi , I'll give this a try and give you an update.

rawwar · 2025-01-22T10:58:46Z

logs.log
@ferruzzi, the Job state check did not help. It's the same issue. It keeps fetching new tokens repeatedly. I'll share the logs in some time.

rawwar added kind:bug This is a clearly a bug area:providers needs-triage label for new issues that we didn't triage yet labels Dec 5, 2024

dosubot bot added the provider:amazon AWS/Amazon - related issues label Dec 5, 2024

eladkal added good first issue and removed needs-triage label for new issues that we didn't triage yet labels Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GlueJobOperator stuck in running state, even when the job is completed on AWS, when Verbose=True #44694

GlueJobOperator stuck in running state, even when the job is completed on AWS, when Verbose=True #44694

rawwar commented Dec 5, 2024

rawwar commented Dec 5, 2024

rawwar commented Dec 5, 2024 •

edited

Loading

eladkal commented Dec 5, 2024

rawwar commented Dec 5, 2024 •

edited

Loading

ferruzzi commented Jan 4, 2025

rawwar commented Jan 4, 2025

rawwar commented Jan 22, 2025 •

edited

Loading

GlueJobOperator stuck in running state, even when the job is completed on AWS, when Verbose=True #44694

GlueJobOperator stuck in running state, even when the job is completed on AWS, when Verbose=True #44694

Comments

rawwar commented Dec 5, 2024

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

rawwar commented Dec 5, 2024

rawwar commented Dec 5, 2024 • edited Loading

eladkal commented Dec 5, 2024

rawwar commented Dec 5, 2024 • edited Loading

ferruzzi commented Jan 4, 2025

rawwar commented Jan 4, 2025

rawwar commented Jan 22, 2025 • edited Loading

rawwar commented Dec 5, 2024 •

edited

Loading

rawwar commented Dec 5, 2024 •

edited

Loading

rawwar commented Jan 22, 2025 •

edited

Loading