-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up. #24731
Comments
Hii i am interested in taking up this issue. Can i contribute to this issue. |
Assigned you |
@NaveenGokavarapu19 @potiuk , This issue similar which i raised below issue |
Can you please check @vivek-zeta @NaveenGokavarapu19 if 2.4.0rc1 solves it ? I am closing it provisionally - unless you test it and see that it is not fixed. We can always re-open it in this case. |
@potiuk what specifically in 2.4.0rc1 addresses this issue? Is that the liveness probe suggested by @jedcunningham or something else? |
I am not 100% sure if it is fixed but it's likely this is fixed by the fix to #24498 (see above -it is referred to the thread). Since the user @vivek-zeta @NaveenGokavarapu19 had experienced that in earlier versions, the easiest way to see if it is fixed is to try it by the user. Actually - we never know for sure if we have no detailed logs. We can always re-open it if it is not. |
@potiuk Hello Jarek. To solve this problem I have added one more service at each airflow instanse - it called 'autoheal'. It restarts docker container when it become 'unhelthy'. I am ready help you to debug this problem and find the solution, just tell me what I can do for Airflow developers. |
Thanks for the diagnosis, but I think you applied one of "good" solutions and there is not much we can and will do in Airlfow for that. I think what you did is the right approach (one of) not a workaround. This is expected. Airflow has no support for active/active setup for Redis or postgres and expects to talk to one database server only. There is no way for airflow components to recover when there is an established connection and an IP address of the componnent it talks to change in the way that Airlfow does not even know that the other party has changed the address. this is really a deployment issue, I think airflow should not really take into account such changes. Airflow is not a "critical/real-time" service that should react and reconfigure it's networking dynamically and we have no intention to turn it into such service. Developing such 'autohealing" service is far more costly and unless someone comes up with idea, and create Airflow Improvement Proposal and implement such auto-healing, this is not something that is going to happen. There are many consequences and complexities to implement such services and there is no need to do so for Airlfow because this is perfectly fine to restart and redeploy airflow components from time to time and this is OK - far easier and less costly for development and maintenance. This task is put on the deployment - that's why for example in our helm chart we have liveness probes and healthy checks and auto-healing in K8S is done exactly the way you did - when service becomes unhealthy, you restart it. This is perfectly ok and perfectly viable solution - especially when things like virtual IP changes which happen infrequently. Even better solution for you will be to react on the event of IP changes and restart the services immediately. This the kind of things that usually should and can be done on the deployment level - Airlfow has no knowledge about such events and cannot react to it - but your deployment can. And should. This will help you to recover much faster. Another option - if you want to avoid such restarts - will be to avoid changing the Virtual IP and use static IP addresses allocated to each component. Usually changing virtual IP addresses is not something that happens in enterprise setup - it is safe to assume that you can come up with the approach that IP addresses are static - even if you have some dynamically changing Public IP addresses or node fail-overs, you can usually have static private ones and you can configure your deployment to use them. |
Also. You can also configure keep-alives in your connections to make such fail-over faster, Postgres redis, PGBouncer, all of those have a way to configure keep-alives (look for sqlalchemy decumentation etc.) and you can usually configure keep alives to get connections broken faster, so that Airflow components might naturally restart due to "broken pipe" kind of errors much faster. |
@potiuk Thank you for such extended comment, I see your point. I have one more question.
Airflow-worker restarting can long endlessly, and each time there will be this error. Why ' During such endless automatic restart, container with worker doesn't take state "unhelthy" (because it dies immidiatedly) and 'autoheal' doesn't understand that worker should be rebooted. Is it possible to fix it? |
No idea how your liveness probe works. But generally all software that manages another software running (i.e deployment like kubernetes) have the usual sequence of events:
And only AFTER that sequence knowing that the component's process is down, the "restart" should happen if this is fulfilled (process is not running) - whether the .pid file is not deteled does not matter. Because the process is not running any more (at worst it was SIGKILLED) and the .pid file contains process id of the process that was running. So when airflow component starts next time and the .pid is not deleted it will check if process specified in the .pid is running and if not, it will delete the pid file and run. Only when the process in .pid is still running, it will refuse to start. And this is a general advice. This is how .pid file approach works for any process. Nothing Airlfow-specific. There are 100s (if not thousands) of other appiications working this way. And generally all software runing under some kind of supervisor should be managed this way. I have no idea how your docker-compose and killing works and when the restart happen but it should be done the same and you should configure docker compose to this in exactly this way (this is what for example Kubernetes does). But you should lool at the internals of docker-compose behaviour when restarting airflow in such case. I honestly don't know how to do it with docker compose. Maybe it is possible, maybe not, maybe it requires some tricks to make it works. Maybe you took it over completely with your scripts, but general approach should be like the above algorithm. Never restart airflow component unless you are absolutely sure you killed the previous process and it is gone. I personally think of docker-compose like a very poor deployment that lacks a lot of features and a lot of stability that "real" production deployment like Kubernetes does. In my opinion it lacks some of the automation and some of the deployment features - precisely the kind you obeserve, when you want to do some "real production stuff" with the software. Maybe it is because I do not know it, maybe because it is hard, maybe because it is impossible. But I believe it is a very poor cousin of K8S when it comes to running "real/serious" production deployments. When you are choosing it, you take the responsibility on you as deployment manager to sometimes do manual recovery where docker-compose wil not let you do this. It's one of the responsibilities you take on your shoulders. And we as community decided not to spend our time on making a "production-ready" docker-compose deployment - because we know this is not something we can give advices on and that those who decide to go this path have to solve them on their own in the way it is best for them. Contrary to that, the "Helm Chart" which we maintain - with the chart and k8s combined, we are able to solve a lot of those problems (including liveness probes, restarts etc.). It is much closer to something that runs 'out-of-the-box" - once you have resources sorted out and available, a lot of the management is handled for you by helm/kubernetes combo. I am afraid you made the choice to use docker-compose despite our warnings. We warned the one we have is not suitable for production (it's a quick-start) and it requires a lot of work to make it so and you need to become docker-compose expert to solve them. Also you can take a look here, where we explain what kind of skils you need to have: If you want to stick with docker-compose - good luck, you will have a lot of things like that. If you find some solutions - you even can contribute it back to our docs as "good practices". But we will never turn it into "this is how you run docker-compose deployment" as this is impossible to make into a general set of advices - at most this might be some advice - "if you get into this trouble -> maybe this solution will work". |
BTW. I believe there is something very wrong with your restarting scenario and configuration in general - some mistakes or misunderstanding on how image entrypoint works.
I think there are some things you are doing wrong here and they compounded
The .pid file should only contain '1' if your process is started as "init" process - and this means that container will be killed when your process is killed. When you use dumb-init as we do by default in our image, the dumb-init has process id 1, but in your case your airflow process will always has process id 1 and that is original root cause of the problem you have.
This is very much against the container philosophy. The .pid file should always be stored in the ephemeral container volume, so that when your containers is stopped, the .pid file is gone. Make sure that you do not keep the .pid file in a shared volume - especially if you run your airflow command as entrypoint, because indeed, if you run In general, if you restart whole containers rather than processes, the .pid should NEVER be stored in a shared volume - it should always be stored in the ephemeral container volume so that it gets automatically deleted when whole container gets killed. So I think you should really rethink the way entrypoint works in your images, the way you store the .pid files get created and the way how restart process of failed container works - seems like all the three points are custom-done by you and they compound to the problem you experience. When you are using docker-compose approach, you need to realise how this all works in concert, how those elements interact and how to make it production-robust. Seems that you have chosen pretty hard path to walk, and going the beaten Helm + Kubernetes path without diverging too much from the approach we proposed, would have solved most of it. |
HI Team, we have AIrflow(version 2.5.3) with celery executor and Redis queue, In one of our environments Redis health check failed and after some time it started working but the celery worker stopped working and not processing any thing ,documents got stuck in queue. I don't see any logs in Airflow worker and i can see only the following message The airflow-worker stopped working , once i restart the worker pod all the queued documents started processing, can some one provide any fix for this scenario. |
Apache Airflow version
2.2.2
What happened
We are using celery executor with Redis as broker.
We are using default settings for celery.
We are trying to test below case.
Observations:
We tried killing Redis Pod when one task was in
queued
state.We observed that task which was in
queued
state stays inqueued
state even after Redis Pod comes up.From Airflow UI we tried clearing task and run again. But still it was getting stuck at
queued
state only.Task message was received at
celery worker
. But worker is not starting executing the task.Let us know if we can try changing any celery or airflow config to avoid this issue.
Also what is best practice to handle such cases.
Please help here to avoid this case. As this is very critical if this happens in production.
What you think should happen instead
Task must not get stuck at
queued
state and it should start executing.How to reproduce
While task is in queued state. Kill the Redis pod.
Operating System
k8s
Versions of Apache Airflow Providers
No response
Deployment
Official Helm chart
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: