Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery worker tasks in queued status when airflow-redis-master restarted #24498

Closed
2 tasks done
anu251989 opened this issue Jun 16, 2022 · 5 comments
Closed
2 tasks done
Labels
area:core duplicate Issue that is duplicated kind:bug This is a clearly a bug
Milestone

Comments

@anu251989
Copy link

Apache Airflow version

2.2.5

What happened

The airflow-redis-master pod deployment happened after celery worker pods deployment then Worker pods not able to process any tasks until manually restarted the worker pods. I have killed the airflow-redis-master pod and it is disconnected with worker pods and worker pods stop processing tasks until manually restarted the worker pods.

in logs could see, missed heartbeat from another worker pod. we are facing this issue in 2.2.5 version and didn't face this issue in 1.10.12 version.

in logs missed heartbeat is the last message.

[2022-05-30 17:53:18,213: INFO/MainProcess] Connected to redis://:**@airflow-redis-master.auto1.svc.cluster.local:6379/1
[2022-05-30 17:53:18,228: INFO/MainProcess] mingle: searching for neighbors
[2022-05-30 17:53:19,239: INFO/MainProcess] mingle: all alone
[2022-05-30 17:53:24,246: INFO/MainProcess] missed heartbeat from celery@airflow-worker-0

we have updated below config but didn't work.

AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__MAX_RETRIES=6
AIRFLOW__CELERY_BROKER_CONNECTION_TIMEOUT=60
AIRFLOW_CELERY_BROKER_HEARTBEAT=360

broker_connection_timeout=

we have configured the liveness checks for worker pod for workaround with below command.
celery --app airflow.executors.celery_executor.app inspect ping
but the pods are restarting if all worker nodes are health checks failed. if any one of the worker health check failed. the liveness probes are considering as healthy as it is getting response from healthy worker pod.

What you think should happen instead

The worker pods has to resume the connection with airflow-redis-master node after redis pod up.

How to reproduce

please delete the airflow-redis-master pod and monitor the worker logs. after sometime you can see missed heartbeat in logs and worker pods not able to process any tasks.

Operating System

"Debian GNU/Linux 10 (buster)"

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==3.2.0
apache-airflow-providers-celery==2.1.3
apache-airflow-providers-cncf-kubernetes==3.0.0
apache-airflow-providers-docker==2.5.2
apache-airflow-providers-elasticsearch==2.2.0
apache-airflow-providers-ftp==2.1.2
apache-airflow-providers-google==6.7.0
apache-airflow-providers-grpc==2.0.4
apache-airflow-providers-hashicorp==2.1.4
apache-airflow-providers-http==2.1.2
apache-airflow-providers-imap==2.2.3
apache-airflow-providers-microsoft-azure==3.7.2
apache-airflow-providers-mysql==2.2.3
apache-airflow-providers-odbc==2.0.4
apache-airflow-providers-postgres==4.1.0
apache-airflow-providers-redis==2.0.4
apache-airflow-providers-sendgrid==2.0.4
apache-airflow-providers-sftp==2.5.2
apache-airflow-providers-slack==4.2.3
apache-airflow-providers-sqlite==2.1.3
apache-airflow-providers-ssh==2.4.3

Deployment

Official Apache Airflow Helm Chart

Deployment details

https://github.com/apache/airflow

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@anu251989 anu251989 added area:core kind:bug This is a clearly a bug labels Jun 16, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Jun 16, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@potiuk
Copy link
Member

potiuk commented Jul 3, 2022

This seems like something we should address - I have no to much knowledge about celery, but it looks like one that is worth looking at.

@potiuk potiuk added this to the Airflow 2.4.0 milestone Jul 3, 2022
@potiuk
Copy link
Member

potiuk commented Jul 3, 2022

Not critical, but definitely one to look at in 2.4.0 (addded it to milestone).

@anu251989
Copy link
Author

Not critical, but definitely one to look at in 2.4.0 (addded it to milestone).

Thanks for reply.

@potiuk
Copy link
Member

potiuk commented Jul 7, 2022

duplicagte of #24731 - closing.

@potiuk potiuk closed this as completed Jul 7, 2022
@potiuk potiuk added the duplicate Issue that is duplicated label Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core duplicate Issue that is duplicated kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

2 participants