Airflow Task/DAG fails if connection to OTEL collector fails (when otel integration is enabled) #34405

sa1 · 2023-09-15T18:01:06Z

Apache Airflow version

2.7.1

What happened

I enabled the experimental OTEL integration, and sometimes the connection to OTEL collector fails. Such connection failures are expected and common. However, right now the task seems to fail and there is an extra point of failure added to each task and DAG. Sometimes the failures are before the DAG is even started, and task-level retries can't help.

The only error message I see in this case is the connection failure.

urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=9999): Max retries exceeded with url: /v1/metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff41c054430>: Failed to establish a new connection: [Errno 111] Connection refused'))

This is not printed to the Airflow UI, only to the worker logs, so it's not obvious why a task/DAG failed.

What you think should happen instead

In this situation, Airflow should print a warning and continue with the task.

When any other python application is auto-instrumented with otel, the automatic instrumentation works in the desired way, it ignores connection failures and only prints out a warning message.

Maybe this setting could be configurable, but the desired behaviour should be to ignore the exception.

How to reproduce

Enable OTEL integration, and turn off the collector. Run any DAG/task and they will fail.

Operating System

Ubuntu 22.04.3 LTS

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==8.6.0
apache-airflow-providers-celery==3.3.3
apache-airflow-providers-common-sql==1.7.1
apache-airflow-providers-ftp==3.5.1
apache-airflow-providers-http==4.5.1
apache-airflow-providers-imap==3.3.1
apache-airflow-providers-openlineage==1.0.2
apache-airflow-providers-postgres==5.6.0
apache-airflow-providers-redis==3.3.1
apache-airflow-providers-slack==8.0.0
apache-airflow-providers-snowflake==5.0.0
apache-airflow-providers-sqlite==3.4.3
apache-airflow-providers-ssh==3.7.2

Deployment

Other Docker-based deployment

Deployment details

Docker based custom deployment on ECS Fargate.
Separate fargate tasks for webserver, worker, scheduler and triggerer.
Otel collector is running as an agent in each task.

Anything else

The task fails everytime the connection to otel collector fails. However why the otel collector fails sometimes is the subject of another investigation. Maybe it has to do with something with the size of data/metrics being sent to the collector. But I believe those reasons are not very relevant to this bug.

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2023-09-15T18:01:07Z

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

hussein-awala · 2023-09-15T20:25:47Z

Yes, I think we should add some configuration on how to handle OTEL connection failure, but I don't know if we should treat it as a bug fix or a new feature.

cc: @ferruzzi @potiuk @ephraimbuddy

ferruzzi · 2023-09-22T14:07:01Z

Interesting, thanks for the Issue. Personally, I feel like we can treat this as a bugfix. Especially if we are wrapping it in a config option. I can see the argument for making "log and move on" the default behavior though, let's discuss it a bit and see what folks think and I can sort out the solution once we have some idea how to proceed.

I'll cross-post this to the mailing list and try to get some conversation going.

utkarsharma2 · 2023-09-22T15:49:21Z

I too think we should treat it as a bug, mainly because airflow can still function and process tasks/dags even without exporting telemetry data, therefore any such dependency is virtual and should be avoided. I would be in favor of making it a configurable option with the default behavior of "log and move on".

thesuperzapper · 2024-09-17T21:18:52Z

@potiuk @kaxil @eladkal @ferruzzi I think this is a show-stopping issue for Open Telemetry integration in Airflow.

@potiuk said that he thinks this expected behavior (see: #40286 (comment)), but I strongly disagree for the following reasons:

This is not the behavior of StatsD integration. That is, StatsD being down does not cause all tasks across the cluster to fail.
The telemetry being successfully sent does not change the fact that my task may have succeeded in making some external change. For example, if my task was loading data into a table, I really don't want to do it twice because OpenTelemetry was down and so marked the task a "failed".

At the very least, we need to make this a config, but I honestly think the default value should be "warn and continue", rather than "fail the task" as it's so dangerous in the current state.

sa1 added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Sep 15, 2023

sa1 mentioned this issue Sep 16, 2023

Airflow DAG fails to run if dag_id + task_id is too long with OTEL integration enabled. #34416

Closed

2 tasks

Taragolis added provider:openlineage AIP-53 telemetry Telemetry-related issues and removed needs-triage label for new issues that we didn't triage yet provider:openlineage AIP-53 area:core labels Sep 18, 2023

raphaelauv mentioned this issue Jul 3, 2024

Remove "experimental" banner for OTel Metrics #40286

Merged

thesuperzapper mentioned this issue Sep 17, 2024

Make the plugin compatible with Airflow >= "v2.6.0" newrelic/newrelic-airflow-plugin#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Airflow Task/DAG fails if connection to OTEL collector fails (when otel integration is enabled) #34405

Airflow Task/DAG fails if connection to OTEL collector fails (when otel integration is enabled) #34405

sa1 commented Sep 15, 2023

boring-cyborg bot commented Sep 15, 2023

hussein-awala commented Sep 15, 2023

ferruzzi commented Sep 22, 2023

utkarsharma2 commented Sep 22, 2023 •

edited

Loading

thesuperzapper commented Sep 17, 2024

Airflow Task/DAG fails if connection to OTEL collector fails (when otel integration is enabled) #34405

Airflow Task/DAG fails if connection to OTEL collector fails (when otel integration is enabled) #34405

Comments

sa1 commented Sep 15, 2023

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented Sep 15, 2023

hussein-awala commented Sep 15, 2023

ferruzzi commented Sep 22, 2023

utkarsharma2 commented Sep 22, 2023 • edited Loading

thesuperzapper commented Sep 17, 2024

utkarsharma2 commented Sep 22, 2023 •

edited

Loading