Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery Executor : After killing Redis or Airflow Worker Pod, queued Tasks not getting executed even after pod is up. #24731

Closed
1 of 2 tasks
vivek-zeta opened this issue Jun 29, 2022 · 14 comments
Assignees
Labels
area:core duplicate Issue that is duplicated kind:bug This is a clearly a bug
Milestone

Comments

@vivek-zeta
Copy link

vivek-zeta commented Jun 29, 2022

Apache Airflow version

2.2.2

What happened

We are using celery executor with Redis as broker.
We are using default settings for celery.
We are trying to test below case.

  • What happens when Redis or Worker Pod goes down?

Observations:

  • We tried killing Redis Pod when one task was in queued state.

  • We observed that task which was in queued state stays in queued state even after Redis Pod comes up.

  • From Airflow UI we tried clearing task and run again. But still it was getting stuck at queued state only.

  • Task message was received at celery worker . But worker is not starting executing the task.

  • Let us know if we can try changing any celery or airflow config to avoid this issue.

  • Also what is best practice to handle such cases.

  • Please help here to avoid this case. As this is very critical if this happens in production.

What you think should happen instead

Task must not get stuck at queued state and it should start executing.

How to reproduce

While task is in queued state. Kill the Redis pod.

Operating System

k8s

Versions of Apache Airflow Providers

No response

Deployment

Official Helm chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@vivek-zeta vivek-zeta added area:core kind:bug This is a clearly a bug labels Jun 29, 2022
@vivek-zeta
Copy link
Author

@potiuk @jedcunningham

@NaveenGokavarapu19
Copy link

NaveenGokavarapu19 commented Jun 29, 2022

Hii i am interested in taking up this issue. Can i contribute to this issue.

@potiuk
Copy link
Member

potiuk commented Jul 4, 2022

Assigned you

@anu251989
Copy link

@NaveenGokavarapu19 @potiuk , This issue similar which i raised below issue
Celery worker tasks in queued status when airflow-redis-master restarted #24498

@potiuk
Copy link
Member

potiuk commented Sep 18, 2022

Can you please check @vivek-zeta @NaveenGokavarapu19 if 2.4.0rc1 solves it ? I am closing it provisionally - unless you test it and see that it is not fixed. We can always re-open it in this case.

@potiuk potiuk closed this as completed Sep 18, 2022
@potiuk potiuk added the duplicate Issue that is duplicated label Sep 18, 2022
@potiuk potiuk modified the milestones: Airflow 2.4.1, Airflow 2.4.0 Sep 18, 2022
@malthe
Copy link
Contributor

malthe commented Sep 24, 2022

@potiuk what specifically in 2.4.0rc1 addresses this issue? Is that the liveness probe suggested by @jedcunningham or something else?

@potiuk
Copy link
Member

potiuk commented Sep 25, 2022

I am not 100% sure if it is fixed but it's likely this is fixed by the fix to #24498 (see above -it is referred to the thread).

Since the user @vivek-zeta @NaveenGokavarapu19 had experienced that in earlier versions, the easiest way to see if it is fixed is to try it by the user. Actually - we never know for sure if we have no detailed logs. We can always re-open it if it is not.

@stepanof
Copy link

stepanof commented Nov 17, 2022

@potiuk Hello Jarek.
I'm using custom docker airflow image based on apache/airflow:2.4.1-python3.8 and docker-compose for deploying airflow.
Recently I built HA clusters for postgres database and redis. Both are used by airflow cluster (1webserver,2scheduler,2worker)
I have faced with problem in scheduler and worker in the moment when VirtualIP of redis or postgres cluster move at another node - tasks stuck in 'queqed' or 'scheduled' status.
I attach worker's logs which was stuck when redis master moved to another node.
airflow_worker_logs_err.txt
Restarting airflow-worker solve the problem.

To solve this problem I have added one more service at each airflow instanse - it called 'autoheal'. It restarts docker container when it become 'unhelthy'.
We are using it in production but it is workaround solution. I think airflow scheduler and worker have to be able react on such situations without any additional services.

I am ready help you to debug this problem and find the solution, just tell me what I can do for Airflow developers.
To tell the truth this error isn't always reproducible - sometimes scheduler and worker work normally during changing node with postgres or redis master.🤷‍♀️

@potiuk
Copy link
Member

potiuk commented Nov 17, 2022

Thanks for the diagnosis, but I think you applied one of "good" solutions and there is not much we can and will do in Airlfow for that.

I think what you did is the right approach (one of) not a workaround. This is expected. Airflow has no support for active/active setup for Redis or postgres and expects to talk to one database server only. There is no way for airflow components to recover when there is an established connection and an IP address of the componnent it talks to change in the way that Airlfow does not even know that the other party has changed the address. this is really a deployment issue, I think airflow should not really take into account such changes.

Airflow is not a "critical/real-time" service that should react and reconfigure it's networking dynamically and we have no intention to turn it into such service. Developing such 'autohealing" service is far more costly and unless someone comes up with idea, and create Airflow Improvement Proposal and implement such auto-healing, this is not something that is going to happen. There are many consequences and complexities to implement such services and there is no need to do so for Airlfow because this is perfectly fine to restart and redeploy airflow components from time to time and this is OK - far easier and less costly for development and maintenance.

This task is put on the deployment - that's why for example in our helm chart we have liveness probes and healthy checks and auto-healing in K8S is done exactly the way you did - when service becomes unhealthy, you restart it. This is perfectly ok and perfectly viable solution - especially when things like virtual IP changes which happen infrequently.

Even better solution for you will be to react on the event of IP changes and restart the services immediately. This the kind of things that usually should and can be done on the deployment level - Airlfow has no knowledge about such events and cannot react to it - but your deployment can. And should. This will help you to recover much faster.

Another option - if you want to avoid such restarts - will be to avoid changing the Virtual IP and use static IP addresses allocated to each component. Usually changing virtual IP addresses is not something that happens in enterprise setup - it is safe to assume that you can come up with the approach that IP addresses are static - even if you have some dynamically changing Public IP addresses or node fail-overs, you can usually have static private ones and you can configure your deployment to use them.

@potiuk
Copy link
Member

potiuk commented Nov 17, 2022

Also. You can also configure keep-alives in your connections to make such fail-over faster, Postgres redis, PGBouncer, all of those have a way to configure keep-alives (look for sqlalchemy decumentation etc.) and you can usually configure keep alives to get connections broken faster, so that Airflow components might naturally restart due to "broken pipe" kind of errors much faster.

@stepanof
Copy link

@potiuk Thank you for such extended comment, I see your point.

I have one more question.
Sometimes during Virtual IP and postgres endpoint changing airflow-worker try to restart by itself (not by 'autoheal' service).
But it can't restart beacuse of such error:

[2022-11-18 18:39:34 +0300] [42] [INFO] Starting gunicorn 20.1.0
[2022-11-18 18:39:34 +0300] [42] [INFO] Listening at: http://[::]:8793 (42)
[2022-11-18 18:39:34 +0300] [42] [INFO] Using worker: sync
[2022-11-18 18:39:35 +0300] [43] [INFO] Booting worker with pid: 43
[2022-11-18 18:39:35 +0300] [44] [INFO] Booting worker with pid: 44
[2022-11-18 18:39:35 +0300] [42] [INFO] Handling signal: term
ERROR: Pidfile (/opt/airflow/airflow-worker.pid) already exists.
Seems we're already running? (pid: 1)
[2022-11-18 18:39:35 +0300] [43] [INFO] Worker exiting (pid: 43)
[2022-11-18 18:39:35 +0300] [44] [INFO] Worker exiting (pid: 44)
[2022-11-18 18:39:35 +0300] [42] [INFO] Shutting down: Master

Airflow-worker restarting can long endlessly, and each time there will be this error.
Manual restart of worker (docker-compose down && docker-compose up) fixes the problem (/opt/airflow/airflow-worker.pid becomes deleted).

Why '/opt/airflow/airflow-worker.pid' isn't deleted during automatic worker restart?

During such endless automatic restart, container with worker doesn't take state "unhelthy" (because it dies immidiatedly) and 'autoheal' doesn't understand that worker should be rebooted.

Is it possible to fix it?

@potiuk
Copy link
Member

potiuk commented Nov 21, 2022

No idea how your liveness probe works. But generally all software that manages another software running (i.e deployment like kubernetes) have the usual sequence of events:

  • check if the software is running and responding to some kind of liveness probe (see how liveness probe is defined in our Helm Chart for example
  • when the liveness probe fails for some time (usually several times) then it announces the component and attempts to stop it
  • usually it happens via SIGTERM and other 'soft" signals that allow the componente to "shutdown" and clean up in response (usually if the software is able to shutdown itself cleanly it will remove all the "pid" and any other resources)
  • when it does not succeed it wll escalate the signal (SIGTERM -> SIGHUP -> SIGKILL) giving the process time to actualy react and clean-up. SIGKILL is not possible to handle, it shuts down the process immediately and some stuff (like .pid) remain as they were.

And only AFTER that sequence knowing that the component's process is down, the "restart" should happen

if this is fulfilled (process is not running) - whether the .pid file is not deteled does not matter. Because the process is not running any more (at worst it was SIGKILLED) and the .pid file contains process id of the process that was running. So when airflow component starts next time and the .pid is not deleted it will check if process specified in the .pid is running and if not, it will delete the pid file and run. Only when the process in .pid is still running, it will refuse to start.

And this is a general advice. This is how .pid file approach works for any process. Nothing Airlfow-specific. There are 100s (if not thousands) of other appiications working this way. And generally all software runing under some kind of supervisor should be managed this way.

I have no idea how your docker-compose and killing works and when the restart happen but it should be done the same and you should configure docker compose to this in exactly this way (this is what for example Kubernetes does). But you should lool at the internals of docker-compose behaviour when restarting airflow in such case. I honestly don't know how to do it with docker compose. Maybe it is possible, maybe not, maybe it requires some tricks to make it works. Maybe you took it over completely with your scripts, but general approach should be like the above algorithm. Never restart airflow component unless you are absolutely sure you killed the previous process and it is gone.

I personally think of docker-compose like a very poor deployment that lacks a lot of features and a lot of stability that "real" production deployment like Kubernetes does. In my opinion it lacks some of the automation and some of the deployment features - precisely the kind you obeserve, when you want to do some "real production stuff" with the software. Maybe it is because I do not know it, maybe because it is hard, maybe because it is impossible. But I believe it is a very poor cousin of K8S when it comes to running "real/serious" production deployments.

When you are choosing it, you take the responsibility on you as deployment manager to sometimes do manual recovery where docker-compose wil not let you do this. It's one of the responsibilities you take on your shoulders.

And we as community decided not to spend our time on making a "production-ready" docker-compose deployment - because we know this is not something we can give advices on and that those who decide to go this path have to solve them on their own in the way it is best for them.

Contrary to that, the "Helm Chart" which we maintain - with the chart and k8s combined, we are able to solve a lot of those problems (including liveness probes, restarts etc.). It is much closer to something that runs 'out-of-the-box" - once you have resources sorted out and available, a lot of the management is handled for you by helm/kubernetes combo.

I am afraid you made the choice to use docker-compose despite our warnings. We warned the one we have is not suitable for production (it's a quick-start) and it requires a lot of work to make it so and you need to become docker-compose expert to solve them.

Also you can take a look here, where we explain what kind of skils you need to have:

https://airflow.apache.org/docs/apache-airflow/stable/installation/index.html#using-production-docker-images

If you want to stick with docker-compose - good luck, you will have a lot of things like that. If you find some solutions - you even can contribute it back to our docs as "good practices". But we will never turn it into "this is how you run docker-compose deployment" as this is impossible to make into a general set of advices - at most this might be some advice - "if you get into this trouble -> maybe this solution will work".

@potiuk
Copy link
Member

potiuk commented Nov 21, 2022

BTW. I believe there is something very wrong with your restarting scenario and configuration in general - some mistakes or misunderstanding on how image entrypoint works.

ERROR: Pidfile (/opt/airflow/airflow-worker.pid) already exists.
Seems we're already running? (pid: 1)

I think there are some things you are doing wrong here and they compounded

  1. seems that you run airflow as init process in your container. This is possible but you need to realise the consequences of signal propagation and do it properly. You might fall into many traps of it if you are doing it wrongly so I recommend you to read why in airflow image we use dumb-init as init process and what consequences it has (especially for celery): https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation

The .pid file should only contain '1' if your process is started as "init" process - and this means that container will be killed when your process is killed. When you use dumb-init as we do by default in our image, the dumb-init has process id 1, but in your case your airflow process will always has process id 1 and that is original root cause of the problem you have.

  1. then the problem is most likely that you write a .pid to a shared volume which makes the pid file remain after killing the container. This is very, very, very, very wrong. If you rely on restarting the container and your process has PID = 1, you should never save the .pid file in the shared volume that can survive the container. Because you will get the exact problem you have. Your airflow webserver will always start as init process with PID =1. So even if the process has been killed, just the fact of restarting it will create a process ID 1 so airflow is really checking the PID file created by the previous "1" process with itself (which runs with PID=1) and it will never start.

This is very much against the container philosophy. The .pid file should always be stored in the ephemeral container volume, so that when your containers is stopped, the .pid file is gone. Make sure that you do not keep the .pid file in a shared volume - especially if you run your airflow command as entrypoint, because indeed, if you run

In general, if you restart whole containers rather than processes, the .pid should NEVER be stored in a shared volume - it should always be stored in the ephemeral container volume so that it gets automatically deleted when whole container gets killed.

So I think you should really rethink the way entrypoint works in your images, the way you store the .pid files get created and the way how restart process of failed container works - seems like all the three points are custom-done by you and they compound to the problem you experience. When you are using docker-compose approach, you need to realise how this all works in concert, how those elements interact and how to make it production-robust.

Seems that you have chosen pretty hard path to walk, and going the beaten Helm + Kubernetes path without diverging too much from the approach we proposed, would have solved most of it.

@Sathya-omnius
Copy link

HI Team,

we have AIrflow(version 2.5.3) with celery executor and Redis queue, In one of our environments Redis health check failed and after some time it started working but the celery worker stopped working and not processing any thing ,documents got stuck in queue. I don't see any logs in Airflow worker and i can see only the following message
Connected
Wed, Oct 4 2023 4:49:39 pm[2023-10-04 14:49:39,203: INFO/MainProcess] sync with celery@xxxx-pipelines-worker-79b7578f8b-qkmp4

The airflow-worker stopped working , once i restart the worker pod all the queued documents started processing, can some one provide any fix for this scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core duplicate Issue that is duplicated kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

8 participants