Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow DAG fails to run if dag_id + task_id is too long with OTEL integration enabled. #34416

Closed
1 of 2 tasks
sa1 opened this issue Sep 16, 2023 · 6 comments
Closed
1 of 2 tasks
Labels
kind:bug This is a clearly a bug telemetry Telemetry-related issues

Comments

@sa1
Copy link
Contributor

sa1 commented Sep 16, 2023

Apache Airflow version

2.7.1

What happened

Airflow DAG fails to run if the dag_id and task_id combination is too long. The following exception is raised and logged in worker logs.:

Failed to execute task Invalid stat name: dev-cad.dag.datahub_config_deployment.viper_entrypoint.queued_duration. Please see https://opentelemetry.io/docs/reference/specification/metrics/api/#instrument-name-syntax.

There is no visible logs in airflow UI which would indicate the problem. The airflow UI just shows this as logs for the failed task:

[2023-09-16, 05:35:16 UTC] {taskinstance.py:1157} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: datahub_config_deployment.viper_entrypoint scheduled__2023-09-16T05:30:15.926505+00:00 [queued]>
[2023-09-16, 05:35:16 UTC] {taskinstance.py:1157} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: datahub_config_deployment.viper_entrypoint scheduled__2023-09-16T05:30:15.926505+00:00 [queued]>
[2023-09-16, 05:35:16 UTC] {taskinstance.py:1359} INFO - Starting attempt 1 of 1

There's nothing more logged.

The metrics documentation claims that stat_name_handler can be used to rename stat names, which might workaround this issue, but seems like the otel integration doesn't use this handler, only statsd and datadog integration does.

What you think should happen instead

The dag_id/task_id combination is obviously too long to be sent to otel as a metric name (which has a max limit of just 63 characters), but the DAG itself should not fail in this case.

There is a bunch of metrics that are excluded from the length check here, but seems like queued_duration is not a part of it, so DAG run fails before even starting.

BACK_COMPAT_METRIC_NAME_PATTERNS: set[str] = {

It seems expensive to change many dag_id to workaround this issue, as changing the dag_id usually means renaming files and losing history as well.

How to reproduce

Enable OTEL integration and with prefix as dev-cad and dag_id as datahub_config_deployment and task_id as viper_entrypoint , trigger a new DAG. The first task and subsequently all the rest of the DAG fails.

Operating System

Ubuntu 22.04.3 LTS

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==8.6.0
apache-airflow-providers-celery==3.3.3
apache-airflow-providers-common-sql==1.7.1
apache-airflow-providers-ftp==3.5.1
apache-airflow-providers-http==4.5.1
apache-airflow-providers-imap==3.3.1
apache-airflow-providers-openlineage==1.0.2
apache-airflow-providers-postgres==5.6.0
apache-airflow-providers-redis==3.3.1
apache-airflow-providers-slack==8.0.0
apache-airflow-providers-snowflake==5.0.0
apache-airflow-providers-sqlite==3.4.3
apache-airflow-providers-ssh==3.7.2

Deployment

Other Docker-based deployment

Deployment details

Docker based custom deployment on ECS Fargate.
Separate fargate tasks for webserver, worker, scheduler and triggerer.

Anything else

Along with #34405, these are issues where OTEL exceptions are leading to the failure of airflow DAGs.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@sa1 sa1 added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Sep 16, 2023
@sa1
Copy link
Contributor Author

sa1 commented Sep 16, 2023

It seems that in my case, the task_id of the first task in the DAG is enough to trigger this exception, so the entire DAG failed, but presumably only a task would fail.

@hussein-awala hussein-awala added telemetry Telemetry-related issues and removed area:core needs-triage label for new issues that we didn't triage yet labels Sep 16, 2023
@hussein-awala
Copy link
Member

The dag_id/task_id combination is obviously too long to be sent to otel as a metric name (which has a max limit of just 63 characters), but the DAG itself should not fail in this case.

Let's keep discussing the dag failure/no failure after OTEL failure in #34405 to avoid discuss that twice

For the metric name limit, we have the same limitation in K8S resources, and we fix that by truncate the name and take only the first X characters, we can do the same thing with OTEL metrics.

@sa1
Copy link
Contributor Author

sa1 commented Sep 16, 2023

I think truncation is already happening further down the code over here:

f"Stat name {stat_name} matches exemption {matched_exemption} and "
f"will be truncated to {proposed_stat_name[:OTEL_NAME_MAX_LENGTH]}. "

But metrics that are not in the exemption list trigger the exception before reaching that point. As the comment says, we should be careful about introducing new exemptions, but I think that's the short term solution required here.

# The following set contains existing metrics whose names are too long for
# OpenTelemetry and should be deprecated over time. This is implemented to
# ensure that any new metrics we introduce have names which meet the OTel
# standard while also allowing us time to deprecate the old names.
# NOTE: No new names should be added to this list. This list should
# only ever shorten over time as we deprecate these names.
BACK_COMPAT_METRIC_NAME_PATTERNS: set[str] = {

@sa1
Copy link
Contributor Author

sa1 commented Sep 21, 2023

I'll try implementing the temporary fix today during the contributor's workshop at Airflow summit.

@ferruzzi
Copy link
Contributor

Yeah, this is a known issue and the reason for that exemption list and truncation. We can't just rename all of the metrics because that would break back-compat, and that's why many of them are emitted twice (once with everything embedded in the name and once with tags)

It looks like those three metrics you called out were added after the change and SHOULD have been implemented with tags instead (and therefor should not have been added to the exemption list... but we didn't catch that in time so I guess it's the best answer)

The unit test only makes sure the exemption list isn't changed, it doesn't check for new metrics which might break... maybe some kind of CI test would be wise, to prevent future new metrics from being added which have both dag_id and task_id in their name... I don't know what that would look like though.... we'd maybe have to parse the raw text of the changes looking for lines starting with Stats and including dag_id and task_id which do not match the exemptions list pattern? maybe?

@ferruzzi
Copy link
Contributor

This was addressed in #34531; closing. If it is still an issue, feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug This is a clearly a bug telemetry Telemetry-related issues
Projects
None yet
Development

No branches or pull requests

3 participants