-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Celery worker enters a catatonic state after redis restart #26542
Comments
Hmmm, this feels like a Celery issue, as all (I'm not saying we shouldn't look at fixing it, just surprised that celery doesn't do it for us) |
Yep! I linked the Github Discussion from celery Github. This is more just a record to keep track of this behaviour. |
The new liveness probe on the Celery Worker (#25561) doesn't fix the root issue but does put in safeguards to reset the worker when workers become catatonic. |
I am having exactly the same issue with my Airflow 2.5.1 |
Are you using the official chart that has the safeguards as mentioned in ?
|
Sorry my Airflow was deployed on normal EC2 instances, not Kubernetes. |
Yes. You need to run similar monitoring and restarting of Celery on your own. Liveness Probe runs periodicaly check if any component is responding and will restart it when it is not. This is standard feature of K8S and since you chose tto manage airflow on your own you should add similar monitoring on your own. You can take a look at the PR and take it as an inspiration when writing yours. Generally speaking it's a good practice to do the kind of monitoring and liveness probes, and if you choose to run your own deployment, you shoud make sure you do similar check - this is standard practice with any kinds of applications. Our Helm chart provides all those things out-of-the-box as an inspiration for anyone who woudl like to continue using their own deployment (you can easily run |
I actually don't think the livenessProbe solves the issue in all cases. It's rare, but I still see this issue cause workers to become useless when they lose connection with Redis |
We were able to improve the behavior with these settings:
In our repro (which was to force kill the redis pod) the celery worker would get stuck waiting forever for a response when it tried to heartbeat itself to the failed redis pod. I believe adding the socket timeout allows the connection to close in this scenario, which ultimately allows the container to be restarted. There's also a similar issue reported in celery that may possibly be resolved in 5.4.0 (see comment). But I have not tested it. |
@dstandish |
Should we turn into some kind of best practice doc @dstandish :D? |
Amazing, very glad to hear it... let us hope it continues 🤞 |
We could set those in the Helm Chart too ? |
Yeah. Good idea. |
I'd wait with the doc changes to see if celery 5.4.0 solves this problem from the root then we can simply bump version |
I think it depends if those are "Generically applicable". I think a number of our users would still not use the latest celery, celery has quite a number of dependencies on it's own and it "adds" on top of the dependencies just plain airflow has. The problem with celery is that - similarly as in case of Airflow - the workers and dependencies of Celery share the same virtualenv that the task runs in, so there might be plenty of reasons why our users would like to use older celery version (in older Airflow) or maybe even downgrade celery version used even in future versions of Airlfow. Celery is pretty special because it generally impacts everyone running any task with any provider combinations, so I'd imagine it migh be much harder to expect people to upgrade it to 5.4.0. So I think if we ALSO have a way how to bring older celery configuration to "reasonable defaults", it's worth considering to explain that in the documentation and even our Helm Chart, and I am even happy to to it myself @dstandish @eladkal , but I need to know if those are defaults that are "ok" to be considered as sane defaults. |
And just to add a bit more context to the discussion. It's a known fact that a number of our users have some troubles with stability running Celery-based deployment. I think if we know a way to make it more stable, keeping it as "secret-sauce" makes little sense. Also some people proposed other solutions to improve stability (which might or might not be related and they are way more invasive) and possibly the stability issues might be helped with better default configuration (in docs and in chart). I am particualarly referring to that one #31829 which proposes to disable the "ack late" feature to improve stability of Airflow Celery deployment, which comes from the creator of the "user community" chart. We - so far - generally ignored that one, mostly I guessm becasue we did not want to spend time on analysing all the consequences of disabling ack late. I personally have a feeling that it changes a lot of promises about delivery and taks handling in Celery, and shifts the problem to other set of edge cases, but maybe, by making the defaults working "better" out-of-the-box, the reason for the "ack late disabling" migh be generally gone as that might improve Celery deployment stability? That's just a thought and more context, the "ack late" ase bothers me for quite some time, but I do not have enough experience/knowledge in Celery to be able to provide more than just intuition, that maybe this one applied properly might solve problems that manifest in others trying other, worse effectively, things to improve the stability of celery. |
And the new PR for "ack_late" - to complete the picture: #37066 |
I am OK with it. |
Apache Airflow version
main (development)
What happened
Worker seems to be stuck in a catatonic state where queued tasks instance messages are not consumed from redis.
Redis did restart while the worker remained as is. The worker did output logs that indicated a loss in connection but was able to reconnect after redis came back online.
What you think should happen instead
After redis comes back online and the worker connected again, the worker should consume the messages and execute queued task instances.
How to reproduce
Operating System
N/A
Versions of Apache Airflow Providers
No response
Deployment
Astronomer
Deployment details
No response
Anything else
There was a Github Discussion earlier this year about this behaviour.
This didn't seem to be an issue on an early version of celery (
4.4.7
).The current installed version is
celery==5.2.7
and I use redis versioned at6.2.6
.Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: