-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Airflow DAG fails to run if dag_id
+ task_id
is too long with OTEL integration enabled.
#34416
Comments
It seems that in my case, the task_id of the first task in the DAG is enough to trigger this exception, so the entire DAG failed, but presumably only a task would fail. |
Let's keep discussing the dag failure/no failure after OTEL failure in #34405 to avoid discuss that twice For the metric name limit, we have the same limitation in K8S resources, and we fix that by truncate the name and take only the first X characters, we can do the same thing with OTEL metrics. |
I think truncation is already happening further down the code over here: airflow/airflow/metrics/validators.py Lines 157 to 158 in 35699ac
But metrics that are not in the exemption list trigger the exception before reaching that point. As the comment says, we should be careful about introducing new exemptions, but I think that's the short term solution required here. airflow/airflow/metrics/validators.py Lines 51 to 57 in 35699ac
|
I'll try implementing the temporary fix today during the contributor's workshop at Airflow summit. |
Yeah, this is a known issue and the reason for that exemption list and truncation. We can't just rename all of the metrics because that would break back-compat, and that's why many of them are emitted twice (once with everything embedded in the name and once with tags) It looks like those three metrics you called out were added after the change and SHOULD have been implemented with tags instead (and therefor should not have been added to the exemption list... but we didn't catch that in time so I guess it's the best answer) The unit test only makes sure the exemption list isn't changed, it doesn't check for new metrics which might break... maybe some kind of CI test would be wise, to prevent future new metrics from being added which have both |
This was addressed in #34531; closing. If it is still an issue, feel free to reopen. |
Apache Airflow version
2.7.1
What happened
Airflow DAG fails to run if the
dag_id
andtask_id
combination is too long. The following exception is raised and logged in worker logs.:There is no visible logs in airflow UI which would indicate the problem. The airflow UI just shows this as logs for the failed task:
There's nothing more logged.
The metrics documentation claims that
stat_name_handler
can be used to rename stat names, which might workaround this issue, but seems like the otel integration doesn't use this handler, onlystatsd
anddatadog
integration does.What you think should happen instead
The
dag_id
/task_id
combination is obviously too long to be sent to otel as a metric name (which has a max limit of just 63 characters), but the DAG itself should not fail in this case.There is a bunch of metrics that are excluded from the length check here, but seems like
queued_duration
is not a part of it, so DAG run fails before even starting.airflow/airflow/metrics/validators.py
Line 57 in 35699ac
It seems expensive to change many
dag_id
to workaround this issue, as changing thedag_id
usually means renaming files and losing history as well.How to reproduce
Enable OTEL integration and with prefix as
dev-cad
and dag_id asdatahub_config_deployment
and task_id asviper_entrypoint
, trigger a new DAG. The first task and subsequently all the rest of the DAG fails.Operating System
Ubuntu 22.04.3 LTS
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==8.6.0
apache-airflow-providers-celery==3.3.3
apache-airflow-providers-common-sql==1.7.1
apache-airflow-providers-ftp==3.5.1
apache-airflow-providers-http==4.5.1
apache-airflow-providers-imap==3.3.1
apache-airflow-providers-openlineage==1.0.2
apache-airflow-providers-postgres==5.6.0
apache-airflow-providers-redis==3.3.1
apache-airflow-providers-slack==8.0.0
apache-airflow-providers-snowflake==5.0.0
apache-airflow-providers-sqlite==3.4.3
apache-airflow-providers-ssh==3.7.2
Deployment
Other Docker-based deployment
Deployment details
Docker based custom deployment on ECS Fargate.
Separate fargate tasks for webserver, worker, scheduler and triggerer.
Anything else
Along with #34405, these are issues where OTEL exceptions are leading to the failure of airflow DAGs.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: