Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTEL Collector exception scheduler termination #42456

Closed
1 of 2 tasks
wolfier opened this issue Sep 24, 2024 · 3 comments · Fixed by #43558
Closed
1 of 2 tasks

OTEL Collector exception scheduler termination #42456

wolfier opened this issue Sep 24, 2024 · 3 comments · Fixed by #43558
Assignees
Labels
area:core area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet

Comments

@wolfier
Copy link
Contributor

wolfier commented Sep 24, 2024

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.10.0

What happened?

An OTEL Collector exception caused the scheduler to terminate unexpectedly.

[2024-09-13T14:12:12.681+0000] {scheduler_job_runner.py:1001} ERROR - Exception when executing SchedulerJob._run_scheduler_loop
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/airflow/jobs/scheduler_job_runner.py", line 984, in _execute
    self._run_scheduler_loop()
  File "/usr/local/lib/python3.10/site-packages/airflow/jobs/scheduler_job_runner.py", line 1132, in _run_scheduler_loop
    num_finished_events += self._process_executor_events(
  File "/usr/local/lib/python3.10/site-packages/airflow/jobs/scheduler_job_runner.py", line 851, in _process_executor_events
    span.add_event(name="started", timestamp=datetime_to_nano(ti.start_date))
  File "/usr/local/lib/python3.10/site-packages/airflow/utils/dates.py", line 283, in datetime_to_nano
    return int(datetime.timestamp() * 1000000000)
AttributeError: 'NoneType' object has no attribute 'timestamp'

The exception was due to the processing of an executor event where the tied task instance start_date attribute did not have the expected type of datetime.

What you think should happen instead?

Instead of raising the exception and causing the schedule to terminate, the exception should be caught and an error log to be generated.

How to reproduce

Unknown.

Operating System

n/a

Versions of Apache Airflow Providers

No response

Deployment

Astronomer

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@wolfier wolfier added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Sep 24, 2024
@dosubot dosubot bot added the area:Scheduler including HA (high availability) scheduler label Sep 24, 2024
@jedcunningham
Copy link
Member

jedcunningham commented Sep 25, 2024

I wasn't able to reproduce this on main or 2.10.0, unfortunately.

edit: wait, might have overlooked something. Let me try again.

edit edit: I was able to reproduce this. I'll open a PR tomorrow.

@jedcunningham jedcunningham self-assigned this Sep 25, 2024
@nathadfield nathadfield removed the needs-triage label for new issues that we didn't triage yet label Sep 25, 2024
@pustoshilov-d
Copy link
Contributor

pustoshilov-d commented Oct 2, 2024

@wolfier @jedcunningham @nathadfield, hi!
I have the same issue on 2.10.4. I can't normally reproduce it, but have more info

[2024-10-02T15:09:12.515+0000] {executor_loader.py:254} INFO - Loaded executor: CeleryExecutor
[2024-10-02T15:09:12.565+0000] {scheduler_job_runner.py:935} INFO - Starting the scheduler
[2024-10-02T15:09:12.566+0000] {scheduler_job_runner.py:942} INFO - Processing each file at most -1 times
[2024-10-02T15:09:12.570+0000] {scheduler_job_runner.py:1847} INFO - Adopting or resetting orphaned tasks for active dag runs
[2024-10-02T15:09:12.675+0000] {otel_tracer.py:265} INFO - [OTLPSpanExporter] Connecting to OpenTelemetry Collector at http://1.1.1.1:24318/v1/traces
[2024-10-02T15:09:12.683+0000] {celery_executor.py:433} INFO - Adopted the following 1 tasks from a dead executor
        <TaskInstance: mgi__daily.trigger_mgi_bioinformatics manual__2024-10-01T22:02:42+03:00 [running]> in state SUCCESS
[2024-10-02T15:09:12.949+0000] {scheduler_job_runner.py:764} INFO - Received executor event with state success for task instance TaskInstanceKey(dag_id='mgi__daily', task_id='trigger_mgi_bioinformatics', run_id='manual__2024-10-01T22:02:42+03:00', try_number=1, map_index=-1)
[2024-10-02T15:09:12.960+0000] {scheduler_job_runner.py:801} INFO - TaskInstance Finished: dag_id=mgi__daily, task_id=trigger_mgi_bioinformatics, run_id=manual__2024-10-01T22:02:42+03:00, map_index=-1, run_start_date=2024-10-01 19:03:16.803553+00:00, run_end_date=None, run_duration=None, state=running, executor=CeleryExecutor(parallelism=1024), executor_state=success, try_number=1, max_tries=1, job_id=33823, pool=default_pool, queue=default, priority_weight=1, operator=TriggerDagRunOperator, queued_dttm=2024-10-01 19:03:15.226131+00:00, queued_by_job_id=41976, pid=47466
[2024-10-02T15:09:12.961+0000] {__init__.py:100} WARNING - Invalid type NoneType for attribute 'duration' value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
[2024-10-02T15:09:12.966+0000] {scheduler_job_runner.py:1001} ERROR - Exception when executing SchedulerJob._run_scheduler_loop
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/jobs/scheduler_job_runner.py", line 984, in _execute
    self._run_scheduler_loop()
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/jobs/scheduler_job_runner.py", line 1132, in _run_scheduler_loop
    num_finished_events += self._process_executor_events(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/jobs/scheduler_job_runner.py", line 852, in _process_executor_events
    span.add_event(name="ended", timestamp=datetime_to_nano(ti.end_date))
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/utils/dates.py", line 283, in datetime_to_nano
    return int(datetime.timestamp() * 1000000000)
               ^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'timestamp'

as I see from the logs it happens when executor_state==success and ti.state==running is passed there

if state in (TaskInstanceState.QUEUED, TaskInstanceState.RUNNING):
ti.external_executor_id = info
cls.logger().info("Setting external_id for %s to %s", ti, info)
continue

searching for the reasons of it in my case is an open question, I think something happens in dead scheduler/executor reincarnation's processes
if anyone has any ideas -- you're welcomed

but I guess this should be handled in the code because of failed ti.end_date=None there

span.add_event(name="ended", timestamp=datetime_to_nano(ti.end_date))

let me know if I could help more or write a PR

@pustoshilov-d
Copy link
Contributor

today caught it again with executor_state==success and ti.state==queued

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants