-
Notifications
You must be signed in to change notification settings - Fork 14.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker sometimes does not reconnect to redis/celery queue after crash #27032
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
Thanks for filing this issue @AutomationDev85! I see that you're willing to submit a PR, let me know if you would like this issue assigned to you 😄 As a follow-up: are you sure this is a bug in Airflow or is this actually a Celery issue? |
As I see that the new Helm Chart 1.7.0 was released, does somebody know or expect if the new liveness probe on Celery Worker will fix this problem implicitly? (#25561) |
I think the only way to check is to try it (maybe you can try it @jens-scheffler-bosch - that would help us to close the issue). |
We are on it - just have deployed via helm chart 1.7.0 - but as this problem only appeared randomly hard to predict if resolved. I'd be okay to close with the (positive) assumption that it is fixed and we maybe come back if we see another problem. |
Maybe just keep it running for a while and let us know if ~ few days of running (depending on previously observed frequency) - if you will not see it after 2x the 'average" observeation time we might assume it works :) |
Closing. 2 weeks passed. @jens-scheffler-bosch - if you had any issue, you can comment here still. |
Apache Airflow version
2.4.1
What happened
We are running an Airflow deployment and we had the issue that the redis POD died and then some Tasks stuck in the queue state. Only after killing the worker POD the tasks were consumed by the worker again. I wanted to analyse this more in detail and saw that this behavior only occurs sometimes!
For me looks like the worker some times does not detect that the connection to the redis Pod broke:
What you think should happen instead
Expected behavior is that the worker reconnects to redis automatically and starts consuming queues Tasks.
How to reproduce
Operating System
AKSUbuntu-1804gen2
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
Using a AKS Cluster in Azure to host Airflow.
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: