Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery worker enters a catatonic state after redis restart #26542

Open
1 of 2 tasks
wolfier opened this issue Sep 21, 2022 · 19 comments
Open
1 of 2 tasks

Celery worker enters a catatonic state after redis restart #26542

wolfier opened this issue Sep 21, 2022 · 19 comments
Labels

Comments

@wolfier
Copy link
Contributor

wolfier commented Sep 21, 2022

Apache Airflow version

main (development)

What happened

Worker seems to be stuck in a catatonic state where queued tasks instance messages are not consumed from redis.

Redis did restart while the worker remained as is. The worker did output logs that indicated a loss in connection but was able to reconnect after redis came back online.

[2022-09-19 23:58:00,794: ERROR/MainProcess] consumer: Cannot connect to redis://:**@accurate-axis-9558-redis:6379/0: Error 111 connecting to accurate-axis-9558-redis:6379. Connection refused..
Trying again in 2.00 seconds... (1/100)

[2022-09-19 23:58:03,802: ERROR/MainProcess] consumer: Cannot connect to redis://:**@accurate-axis-9558-redis:6379/0: Error 111 connecting to accurate-axis-9558-redis:6379. Connection refused..
Trying again in 4.00 seconds... (2/100)

[2022-09-19 23:58:08,830: ERROR/MainProcess] consumer: Cannot connect to redis://:**@accurate-axis-9558-redis:6379/0: Error 111 connecting to accurate-axis-9558-redis:6379. Connection refused..
Trying again in 6.00 seconds... (3/100)

[2022-09-19 23:58:15,866: ERROR/MainProcess] consumer: Cannot connect to redis://:**@accurate-axis-9558-redis:6379/0: Error 111 connecting to accurate-axis-9558-redis:6379. Connection refused..
Trying again in 8.00 seconds... (4/100)

[2022-09-19 23:58:24,890: ERROR/MainProcess] consumer: Cannot connect to redis://:**@accurate-axis-9558-redis:6379/0: Error 111 connecting to accurate-axis-9558-redis:6379. Connection refused..
Trying again in 10.00 seconds... (5/100)

[2022-09-19 23:58:34,907: INFO/MainProcess] Connected to redis://:**@accurate-axis-9558-redis:6379/0
[2022-09-19 23:58:34,915: INFO/MainProcess] mingle: searching for neighbors
[2022-09-19 23:58:35,923: INFO/MainProcess] mingle: all alone

What you think should happen instead

After redis comes back online and the worker connected again, the worker should consume the messages and execute queued task instances.

How to reproduce

  1. Delete the existing redis pod and the worker should be unable to connect to redis
  2. Redis restarts and the worker connects as expected
  3. Worker does not consume new messages (queued task instances)

Operating System

N/A

Versions of Apache Airflow Providers

No response

Deployment

Astronomer

Deployment details

No response

Anything else

There was a Github Discussion earlier this year about this behaviour.

This didn't seem to be an issue on an early version of celery (4.4.7).

The current installed version is celery==5.2.7 and I use redis versioned at 6.2.6.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@wolfier wolfier added area:core kind:bug This is a clearly a bug labels Sep 21, 2022
@potiuk potiuk added affected_version:2.4 Issues Reported for 2.4 priority:medium Bug that should be fixed before next release but would not block a release labels Sep 21, 2022
@ashb
Copy link
Member

ashb commented Sep 21, 2022

Hmmm, this feels like a Celery issue, as all airflow celery worker does is run the celery app.

(I'm not saying we shouldn't look at fixing it, just surprised that celery doesn't do it for us)

@wolfier
Copy link
Contributor Author

wolfier commented Sep 22, 2022

Yep! I linked the Github Discussion from celery Github.

This is more just a record to keep track of this behaviour.

@wolfier
Copy link
Contributor Author

wolfier commented Oct 16, 2022

The new liveness probe on the Celery Worker (#25561) doesn't fix the root issue but does put in safeguards to reset the worker when workers become catatonic.

@Scorpion2115
Copy link

I am having exactly the same issue with my Airflow 2.5.1

@potiuk
Copy link
Member

potiuk commented Sep 4, 2023

I am having exactly the same issue with my Airflow 2.5.1

Are you using the official chart that has the safeguards as mentioned in ?

The new liveness probe on the Celery Worker (#25561) doesn't fix the root issue but does put in safeguards to reset the worker when workers become catatonic.

@eladkal eladkal added area:providers provider:celery and removed priority:medium Bug that should be fixed before next release but would not block a release area:core affected_version:2.4 Issues Reported for 2.4 labels Sep 5, 2023
@Scorpion2115
Copy link

I am having exactly the same issue with my Airflow 2.5.1

Are you using the official chart that has the safeguards as mentioned in ?

The new liveness probe on the Celery Worker (#25561) doesn't fix the root issue but does put in safeguards to reset the worker when workers become catatonic.

Sorry my Airflow was deployed on normal EC2 instances, not Kubernetes.
Is there any similar fix available?

@potiuk
Copy link
Member

potiuk commented Sep 5, 2023

Yes. You need to run similar monitoring and restarting of Celery on your own. Liveness Probe runs periodicaly check if any component is responding and will restart it when it is not. This is standard feature of K8S and since you chose tto manage airflow on your own you should add similar monitoring on your own. You can take a look at the PR and take it as an inspiration when writing yours.

Generally speaking it's a good practice to do the kind of monitoring and liveness probes, and if you choose to run your own deployment, you shoud make sure you do similar check - this is standard practice with any kinds of applications.

Our Helm chart provides all those things out-of-the-box as an inspiration for anyone who woudl like to continue using their own deployment (you can easily run helm install --dry-run and you will see K8S resourrces generated, if you look for liveness provbes in varioous components you might see what CLI commands you can use.

@RNHTTR
Copy link
Contributor

RNHTTR commented Nov 17, 2023

Yes. You need to run similar monitoring and restarting of Celery on your own. Liveness Probe runs periodicaly check if any component is responding and will restart it when it is not. This is standard feature of K8S and since you chose tto manage airflow on your own you should add similar monitoring on your own. You can take a look at the PR and take it as an inspiration when writing yours.

Generally speaking it's a good practice to do the kind of monitoring and liveness probes, and if you choose to run your own deployment, you shoud make sure you do similar check - this is standard practice with any kinds of applications.

Our Helm chart provides all those things out-of-the-box as an inspiration for anyone who woudl like to continue using their own deployment (you can easily run helm install --dry-run and you will see K8S resourrces generated, if you look for liveness provbes in varioous components you might see what CLI commands you can use.

I actually don't think the livenessProbe solves the issue in all cases. It's rare, but I still see this issue cause workers to become useless when they lose connection with Redis

@dstandish
Copy link
Contributor

We were able to improve the behavior with these settings:

AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__SOCKET_TIMEOUT=30
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__SOCKET_CONNECT_TIMEOUT=5
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__SOCKET_KEEPALIVE=True
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__RETRY_ON_TIMEOUT=True

In our repro (which was to force kill the redis pod) the celery worker would get stuck waiting forever for a response when it tried to heartbeat itself to the failed redis pod. I believe adding the socket timeout allows the connection to close in this scenario, which ultimately allows the container to be restarted.

There's also a similar issue reported in celery that may possibly be resolved in 5.4.0 (see comment). But I have not tested it.

@kim-youngseop-developer
Copy link

kim-youngseop-developer commented Jan 26, 2024

@dstandish
dude, it's passed about 30 minutes since after applying your solution.
my celery worker was throwing connection reset by peer. errors and freezing very frequently,
but now the problem is gone.
SINCERELY APPRECIATE TO YOUR SOLUTION!!! 😱 👍

@potiuk
Copy link
Member

potiuk commented Jan 26, 2024

@dstandish dude, it's passed about 30 minutes since after applying your solution. my celery worker was throwing connection reset by peer. errors and freezing very frequently, but now the problem is gone. SINCERELY APPRECIATE TO YOUR SOLUTION!!! 😱 👍

Should we turn into some kind of best practice doc @dstandish :D?

@dstandish
Copy link
Contributor

@dstandish dude, it's passed about 30 minutes since after applying your solution. my celery worker was throwing connection reset by peer. errors and freezing very frequently, but now the problem is gone. SINCERELY APPRECIATE TO YOUR SOLUTION!!! 😱 👍

Amazing, very glad to hear it... let us hope it continues 🤞

@kaxil
Copy link
Member

kaxil commented Jan 27, 2024

We were able to improve the behavior with these settings:

AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__SOCKET_TIMEOUT=30
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__SOCKET_CONNECT_TIMEOUT=5
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__SOCKET_KEEPALIVE=True
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__RETRY_ON_TIMEOUT=True

In our repro (which was to force kill the redis pod) the celery worker would get stuck waiting forever for a response when it tried to heartbeat itself to the failed redis pod. I believe adding the socket timeout allows the connection to close in this scenario, which ultimately allows the container to be restarted.

There's also a similar issue reported in celery that may possibly be resolved in 5.4.0 (see comment). But I have not tested it.

We could set those in the Helm Chart too ?

@potiuk
Copy link
Member

potiuk commented Jan 27, 2024

We could set those in the Helm Chart too ?

Yeah. Good idea.

@eladkal
Copy link
Contributor

eladkal commented Jan 28, 2024

I'd wait with the doc changes to see if celery 5.4.0 solves this problem from the root then we can simply bump version

@potiuk
Copy link
Member

potiuk commented Jan 29, 2024

I'd wait with the doc changes to see if celery 5.4.0 solves this problem from the root then we can simply bump version

I think it depends if those are "Generically applicable".

I think a number of our users would still not use the latest celery, celery has quite a number of dependencies on it's own and it "adds" on top of the dependencies just plain airflow has. The problem with celery is that - similarly as in case of Airflow - the workers and dependencies of Celery share the same virtualenv that the task runs in, so there might be plenty of reasons why our users would like to use older celery version (in older Airflow) or maybe even downgrade celery version used even in future versions of Airlfow. Celery is pretty special because it generally impacts everyone running any task with any provider combinations, so I'd imagine it migh be much harder to expect people to upgrade it to 5.4.0.

So I think if we ALSO have a way how to bring older celery configuration to "reasonable defaults", it's worth considering to explain that in the documentation and even our Helm Chart, and I am even happy to to it myself @dstandish @eladkal , but I need to know if those are defaults that are "ok" to be considered as sane defaults.

@potiuk
Copy link
Member

potiuk commented Jan 29, 2024

And just to add a bit more context to the discussion. It's a known fact that a number of our users have some troubles with stability running Celery-based deployment. I think if we know a way to make it more stable, keeping it as "secret-sauce" makes little sense. Also some people proposed other solutions to improve stability (which might or might not be related and they are way more invasive) and possibly the stability issues might be helped with better default configuration (in docs and in chart).

I am particualarly referring to that one #31829 which proposes to disable the "ack late" feature to improve stability of Airflow Celery deployment, which comes from the creator of the "user community" chart. We - so far - generally ignored that one, mostly I guessm becasue we did not want to spend time on analysing all the consequences of disabling ack late.

I personally have a feeling that it changes a lot of promises about delivery and taks handling in Celery, and shifts the problem to other set of edge cases, but maybe, by making the defaults working "better" out-of-the-box, the reason for the "ack late disabling" migh be generally gone as that might improve Celery deployment stability?

That's just a thought and more context, the "ack late" ase bothers me for quite some time, but I do not have enough experience/knowledge in Celery to be able to provide more than just intuition, that maybe this one applied properly might solve problems that manifest in others trying other, worse effectively, things to improve the stability of celery.

@potiuk
Copy link
Member

potiuk commented Jan 29, 2024

And the new PR for "ack_late" - to complete the picture: #37066

@eladkal
Copy link
Contributor

eladkal commented Jan 30, 2024

So I think if we ALSO have a way how to bring older celery configuration to "reasonable defaults", it's worth considering to explain that in the documentation and even our Helm Chart, and I am even happy to to it myself @dstandish @eladkal , but I need to know if those are defaults that are "ok" to be considered as sane defaults.

I am OK with it.
If we don't know the values yet we can refer from the docs to this issue (or to github discussion) where we explain consideration and suggest a few options while letting the community to comment and participate in the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants