Celery worker enters a catatonic state after redis restart #26542

wolfier · 2022-09-21T00:09:55Z

Apache Airflow version

main (development)

What happened

Worker seems to be stuck in a catatonic state where queued tasks instance messages are not consumed from redis.

Redis did restart while the worker remained as is. The worker did output logs that indicated a loss in connection but was able to reconnect after redis came back online.

[2022-09-19 23:58:00,794: ERROR/MainProcess] consumer: Cannot connect to redis://:**@accurate-axis-9558-redis:6379/0: Error 111 connecting to accurate-axis-9558-redis:6379. Connection refused..
Trying again in 2.00 seconds... (1/100)

[2022-09-19 23:58:03,802: ERROR/MainProcess] consumer: Cannot connect to redis://:**@accurate-axis-9558-redis:6379/0: Error 111 connecting to accurate-axis-9558-redis:6379. Connection refused..
Trying again in 4.00 seconds... (2/100)

[2022-09-19 23:58:08,830: ERROR/MainProcess] consumer: Cannot connect to redis://:**@accurate-axis-9558-redis:6379/0: Error 111 connecting to accurate-axis-9558-redis:6379. Connection refused..
Trying again in 6.00 seconds... (3/100)

[2022-09-19 23:58:15,866: ERROR/MainProcess] consumer: Cannot connect to redis://:**@accurate-axis-9558-redis:6379/0: Error 111 connecting to accurate-axis-9558-redis:6379. Connection refused..
Trying again in 8.00 seconds... (4/100)

[2022-09-19 23:58:24,890: ERROR/MainProcess] consumer: Cannot connect to redis://:**@accurate-axis-9558-redis:6379/0: Error 111 connecting to accurate-axis-9558-redis:6379. Connection refused..
Trying again in 10.00 seconds... (5/100)

[2022-09-19 23:58:34,907: INFO/MainProcess] Connected to redis://:**@accurate-axis-9558-redis:6379/0
[2022-09-19 23:58:34,915: INFO/MainProcess] mingle: searching for neighbors
[2022-09-19 23:58:35,923: INFO/MainProcess] mingle: all alone

What you think should happen instead

After redis comes back online and the worker connected again, the worker should consume the messages and execute queued task instances.

How to reproduce

Delete the existing redis pod and the worker should be unable to connect to redis
Redis restarts and the worker connects as expected
Worker does not consume new messages (queued task instances)

Operating System

N/A

Versions of Apache Airflow Providers

No response

Deployment

Astronomer

Deployment details

No response

Anything else

There was a Github Discussion earlier this year about this behaviour.

This didn't seem to be an issue on an early version of celery (4.4.7).

The current installed version is celery==5.2.7 and I use redis versioned at 6.2.6.

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

ashb · 2022-09-21T11:46:47Z

Hmmm, this feels like a Celery issue, as all airflow celery worker does is run the celery app.

(I'm not saying we shouldn't look at fixing it, just surprised that celery doesn't do it for us)

wolfier · 2022-09-22T00:16:53Z

Yep! I linked the Github Discussion from celery Github.

This is more just a record to keep track of this behaviour.

wolfier · 2022-10-16T20:42:27Z

The new liveness probe on the Celery Worker (#25561) doesn't fix the root issue but does put in safeguards to reset the worker when workers become catatonic.

Scorpion2115 · 2023-09-04T12:59:09Z

I am having exactly the same issue with my Airflow 2.5.1

potiuk · 2023-09-04T19:54:37Z

I am having exactly the same issue with my Airflow 2.5.1

Are you using the official chart that has the safeguards as mentioned in ?

The new liveness probe on the Celery Worker (#25561) doesn't fix the root issue but does put in safeguards to reset the worker when workers become catatonic.

Scorpion2115 · 2023-09-05T13:23:03Z

I am having exactly the same issue with my Airflow 2.5.1

Are you using the official chart that has the safeguards as mentioned in ?

The new liveness probe on the Celery Worker (#25561) doesn't fix the root issue but does put in safeguards to reset the worker when workers become catatonic.

Sorry my Airflow was deployed on normal EC2 instances, not Kubernetes.
Is there any similar fix available?

potiuk · 2023-09-05T15:01:02Z

Yes. You need to run similar monitoring and restarting of Celery on your own. Liveness Probe runs periodicaly check if any component is responding and will restart it when it is not. This is standard feature of K8S and since you chose tto manage airflow on your own you should add similar monitoring on your own. You can take a look at the PR and take it as an inspiration when writing yours.

Generally speaking it's a good practice to do the kind of monitoring and liveness probes, and if you choose to run your own deployment, you shoud make sure you do similar check - this is standard practice with any kinds of applications.

Our Helm chart provides all those things out-of-the-box as an inspiration for anyone who woudl like to continue using their own deployment (you can easily run helm install --dry-run and you will see K8S resourrces generated, if you look for liveness provbes in varioous components you might see what CLI commands you can use.

RNHTTR · 2023-11-17T19:40:37Z

Yes. You need to run similar monitoring and restarting of Celery on your own. Liveness Probe runs periodicaly check if any component is responding and will restart it when it is not. This is standard feature of K8S and since you chose tto manage airflow on your own you should add similar monitoring on your own. You can take a look at the PR and take it as an inspiration when writing yours.

Generally speaking it's a good practice to do the kind of monitoring and liveness probes, and if you choose to run your own deployment, you shoud make sure you do similar check - this is standard practice with any kinds of applications.

Our Helm chart provides all those things out-of-the-box as an inspiration for anyone who woudl like to continue using their own deployment (you can easily run helm install --dry-run and you will see K8S resourrces generated, if you look for liveness provbes in varioous components you might see what CLI commands you can use.

I actually don't think the livenessProbe solves the issue in all cases. It's rare, but I still see this issue cause workers to become useless when they lose connection with Redis

dstandish · 2024-01-22T17:35:08Z

We were able to improve the behavior with these settings:

AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__SOCKET_TIMEOUT=30
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__SOCKET_CONNECT_TIMEOUT=5
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__SOCKET_KEEPALIVE=True
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__RETRY_ON_TIMEOUT=True

In our repro (which was to force kill the redis pod) the celery worker would get stuck waiting forever for a response when it tried to heartbeat itself to the failed redis pod. I believe adding the socket timeout allows the connection to close in this scenario, which ultimately allows the container to be restarted.

There's also a similar issue reported in celery that may possibly be resolved in 5.4.0 (see comment). But I have not tested it.

kim-youngseop-developer · 2024-01-26T16:28:53Z

@dstandish
dude, it's passed about 30 minutes since after applying your solution.
my celery worker was throwing connection reset by peer. errors and freezing very frequently,
but now the problem is gone.
SINCERELY APPRECIATE TO YOUR SOLUTION!!! 😱 👍

potiuk · 2024-01-26T16:34:48Z

@dstandish dude, it's passed about 30 minutes since after applying your solution. my celery worker was throwing connection reset by peer. errors and freezing very frequently, but now the problem is gone. SINCERELY APPRECIATE TO YOUR SOLUTION!!! 😱 👍

Should we turn into some kind of best practice doc @dstandish :D?

dstandish · 2024-01-26T17:01:45Z

@dstandish dude, it's passed about 30 minutes since after applying your solution. my celery worker was throwing connection reset by peer. errors and freezing very frequently, but now the problem is gone. SINCERELY APPRECIATE TO YOUR SOLUTION!!! 😱 👍

Amazing, very glad to hear it... let us hope it continues 🤞

kaxil · 2024-01-27T07:48:13Z

We were able to improve the behavior with these settings:
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__SOCKET_TIMEOUT=30
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__SOCKET_CONNECT_TIMEOUT=5
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__SOCKET_KEEPALIVE=True
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__RETRY_ON_TIMEOUT=True
In our repro (which was to force kill the redis pod) the celery worker would get stuck waiting forever for a response when it tried to heartbeat itself to the failed redis pod. I believe adding the socket timeout allows the connection to close in this scenario, which ultimately allows the container to be restarted.

There's also a similar issue reported in celery that may possibly be resolved in 5.4.0 (see comment). But I have not tested it.

We could set those in the Helm Chart too ?

potiuk · 2024-01-27T18:33:33Z

We could set those in the Helm Chart too ?

Yeah. Good idea.

eladkal · 2024-01-28T07:33:30Z

I'd wait with the doc changes to see if celery 5.4.0 solves this problem from the root then we can simply bump version

potiuk · 2024-01-29T12:09:27Z

I'd wait with the doc changes to see if celery 5.4.0 solves this problem from the root then we can simply bump version

I think it depends if those are "Generically applicable".

I think a number of our users would still not use the latest celery, celery has quite a number of dependencies on it's own and it "adds" on top of the dependencies just plain airflow has. The problem with celery is that - similarly as in case of Airflow - the workers and dependencies of Celery share the same virtualenv that the task runs in, so there might be plenty of reasons why our users would like to use older celery version (in older Airflow) or maybe even downgrade celery version used even in future versions of Airlfow. Celery is pretty special because it generally impacts everyone running any task with any provider combinations, so I'd imagine it migh be much harder to expect people to upgrade it to 5.4.0.

So I think if we ALSO have a way how to bring older celery configuration to "reasonable defaults", it's worth considering to explain that in the documentation and even our Helm Chart, and I am even happy to to it myself @dstandish @eladkal , but I need to know if those are defaults that are "ok" to be considered as sane defaults.

potiuk · 2024-01-29T12:19:24Z

And just to add a bit more context to the discussion. It's a known fact that a number of our users have some troubles with stability running Celery-based deployment. I think if we know a way to make it more stable, keeping it as "secret-sauce" makes little sense. Also some people proposed other solutions to improve stability (which might or might not be related and they are way more invasive) and possibly the stability issues might be helped with better default configuration (in docs and in chart).

I am particualarly referring to that one #31829 which proposes to disable the "ack late" feature to improve stability of Airflow Celery deployment, which comes from the creator of the "user community" chart. We - so far - generally ignored that one, mostly I guessm becasue we did not want to spend time on analysing all the consequences of disabling ack late.

I personally have a feeling that it changes a lot of promises about delivery and taks handling in Celery, and shifts the problem to other set of edge cases, but maybe, by making the defaults working "better" out-of-the-box, the reason for the "ack late disabling" migh be generally gone as that might improve Celery deployment stability?

That's just a thought and more context, the "ack late" ase bothers me for quite some time, but I do not have enough experience/knowledge in Celery to be able to provide more than just intuition, that maybe this one applied properly might solve problems that manifest in others trying other, worse effectively, things to improve the stability of celery.

potiuk · 2024-01-29T12:53:35Z

And the new PR for "ack_late" - to complete the picture: #37066

eladkal · 2024-01-30T07:33:02Z

So I think if we ALSO have a way how to bring older celery configuration to "reasonable defaults", it's worth considering to explain that in the documentation and even our Helm Chart, and I am even happy to to it myself @dstandish @eladkal , but I need to know if those are defaults that are "ok" to be considered as sane defaults.

I am OK with it.
If we don't know the values yet we can refer from the docs to this issue (or to github discussion) where we explain consideration and suggest a few options while letting the community to comment and participate in the discussion.

wolfier added area:core kind:bug This is a clearly a bug labels Sep 21, 2022

potiuk added affected_version:2.4 Issues Reported for 2.4 priority:medium Bug that should be fixed before next release but would not block a release labels Sep 21, 2022

eladkal added area:providers provider:celery and removed priority:medium Bug that should be fixed before next release but would not block a release area:core affected_version:2.4 Issues Reported for 2.4 labels Sep 5, 2023

andreyvital mentioned this issue May 21, 2024

Executor reports task instance (...) finished (failed) although the task says it's queued #39717

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Celery worker enters a catatonic state after redis restart #26542

Celery worker enters a catatonic state after redis restart #26542

wolfier commented Sep 21, 2022

ashb commented Sep 21, 2022

wolfier commented Sep 22, 2022

wolfier commented Oct 16, 2022

Scorpion2115 commented Sep 4, 2023

potiuk commented Sep 4, 2023

Scorpion2115 commented Sep 5, 2023

potiuk commented Sep 5, 2023

RNHTTR commented Nov 17, 2023

dstandish commented Jan 22, 2024

kim-youngseop-developer commented Jan 26, 2024 •

edited

Loading

potiuk commented Jan 26, 2024

dstandish commented Jan 26, 2024

kaxil commented Jan 27, 2024

potiuk commented Jan 27, 2024

eladkal commented Jan 28, 2024 •

edited

Loading

potiuk commented Jan 29, 2024 •

edited

Loading

potiuk commented Jan 29, 2024 •

edited

Loading

potiuk commented Jan 29, 2024

eladkal commented Jan 30, 2024

Celery worker enters a catatonic state after redis restart #26542

Celery worker enters a catatonic state after redis restart #26542

Comments

wolfier commented Sep 21, 2022

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

ashb commented Sep 21, 2022

wolfier commented Sep 22, 2022

wolfier commented Oct 16, 2022

Scorpion2115 commented Sep 4, 2023

potiuk commented Sep 4, 2023

Scorpion2115 commented Sep 5, 2023

potiuk commented Sep 5, 2023

RNHTTR commented Nov 17, 2023

dstandish commented Jan 22, 2024

kim-youngseop-developer commented Jan 26, 2024 • edited Loading

potiuk commented Jan 26, 2024

dstandish commented Jan 26, 2024

kaxil commented Jan 27, 2024

potiuk commented Jan 27, 2024

eladkal commented Jan 28, 2024 • edited Loading

potiuk commented Jan 29, 2024 • edited Loading

potiuk commented Jan 29, 2024 • edited Loading

potiuk commented Jan 29, 2024

eladkal commented Jan 30, 2024

kim-youngseop-developer commented Jan 26, 2024 •

edited

Loading

eladkal commented Jan 28, 2024 •

edited

Loading

potiuk commented Jan 29, 2024 •

edited

Loading

potiuk commented Jan 29, 2024 •

edited

Loading