Worker sometimes does not reconnect to redis/celery queue after crash #27032

AutomationDev85 · 2022-10-13T12:33:14Z

Apache Airflow version

2.4.1

What happened

We are running an Airflow deployment and we had the issue that the redis POD died and then some Tasks stuck in the queue state. Only after killing the worker POD the tasks were consumed by the worker again. I wanted to analyse this more in detail and saw that this behavior only occurs sometimes!

For me looks like the worker some times does not detect that the connection to the redis Pod broke:

If I do not see any error in the log file the worker does NOT reconnect once the worker is back!
If I see this error in the log of the Worker it is WORKING and Worker automatically reconnects:

[2022-10-13 06:29:55,967: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 332, in start
    blueprint.start(self)
  File "/home/airflow/.local/lib/python3.8/site-packages/celery/bootsteps.py", line 116, in start
    step.start(parent)
  File "/home/airflow/.local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 628, in start
    c.loop(*c.loop_args())
  File "/home/airflow/.local/lib/python3.8/site-packages/celery/worker/loops.py", line 97, in asynloop
    next(loop)
  File "/home/airflow/.local/lib/python3.8/site-packages/kombu/asynchronous/hub.py", line 362, in create_loop
    cb(*cbargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 1326, in on_readable
    self.cycle.on_readable(fileno)
  File "/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 562, in on_readable
    chan.handlers[type]()
  File "/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 906, in _receive
    ret.append(self._receive_one(c))
  File "/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 916, in _receive_one
    response = c.parse_response()
  File "/home/airflow/.local/lib/python3.8/site-packages/redis/client.py", line 3505, in parse_response
    response = self._execute(conn, conn.read_response)
  File "/home/airflow/.local/lib/python3.8/site-packages/redis/client.py", line 3479, in _execute
    return command(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 739, in read_response
    response = self._parser.read_response()
  File "/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 324, in read_response
    raw = self._buffer.readline()
  File "/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 256, in readline
    self._read_from_socket()
  File "/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 201, in _read_from_socket
    raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.

What you think should happen instead

Expected behavior is that the worker reconnects to redis automatically and starts consuming queues Tasks.

How to reproduce

Run DAG 2 tasks behind each other.
Then start DAG and during the first task is executed, force kill the redis POD (kubectl delete pod redis-0 -n ??? --grace-period=0 --force.) To simulate a crashing POD.
Check if the worker reconnects automatically and executes next tasks or if task stuck in queue state and worker must be killed to fix this.

Operating System

AKSUbuntu-1804gen2

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

Using a AKS Cluster in Azure to host Airflow.

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2022-10-13T12:33:16Z

Thanks for opening your first issue here! Be sure to follow the issue template!

o-nikolas · 2022-10-13T20:25:46Z

Thanks for filing this issue @AutomationDev85! I see that you're willing to submit a PR, let me know if you would like this issue assigned to you 😄

As a follow-up: are you sure this is a bug in Airflow or is this actually a Celery issue?

jscheffl · 2022-10-14T21:44:07Z

As I see that the new Helm Chart 1.7.0 was released, does somebody know or expect if the new liveness probe on Celery Worker will fix this problem implicitly? (#25561)

potiuk · 2022-10-23T23:00:32Z

As I see that the new Helm Chart 1.7.0 was released, does somebody know or expect if the new liveness probe on Celery Worker will fix this problem implicitly? (#25561)

I think the only way to check is to try it (maybe you can try it @jens-scheffler-bosch - that would help us to close the issue).

jscheffl · 2022-10-24T05:15:04Z

As I see that the new Helm Chart 1.7.0 was released, does somebody know or expect if the new liveness probe on Celery Worker will fix this problem implicitly? (#25561)

I think the only way to check is to try it (maybe you can try it @jens-scheffler-bosch - that would help us to close the issue).

We are on it - just have deployed via helm chart 1.7.0 - but as this problem only appeared randomly hard to predict if resolved. I'd be okay to close with the (positive) assumption that it is fixed and we maybe come back if we see another problem.

potiuk · 2022-10-26T01:58:17Z

Maybe just keep it running for a while and let us know if ~ few days of running (depending on previously observed frequency) - if you will not see it after 2x the 'average" observeation time we might assume it works :)

potiuk · 2022-11-07T23:09:19Z

Closing. 2 weeks passed. @jens-scheffler-bosch - if you had any issue, you can comment here still.

AutomationDev85 added area:core kind:bug This is a clearly a bug labels Oct 13, 2022

o-nikolas added the area:Scheduler including HA (high availability) scheduler label Oct 13, 2022

eladkal added the affected_version:2.4 Issues Reported for 2.4 label Oct 14, 2022

potiuk closed this as completed Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker sometimes does not reconnect to redis/celery queue after crash #27032

Worker sometimes does not reconnect to redis/celery queue after crash #27032

AutomationDev85 commented Oct 13, 2022 •

edited by uranusjr

Loading

boring-cyborg bot commented Oct 13, 2022

o-nikolas commented Oct 13, 2022

jscheffl commented Oct 14, 2022

potiuk commented Oct 23, 2022 •

edited

Loading

jscheffl commented Oct 24, 2022

potiuk commented Oct 26, 2022

potiuk commented Nov 7, 2022

Worker sometimes does not reconnect to redis/celery queue after crash #27032

Worker sometimes does not reconnect to redis/celery queue after crash #27032

Comments

AutomationDev85 commented Oct 13, 2022 • edited by uranusjr Loading

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented Oct 13, 2022

o-nikolas commented Oct 13, 2022

jscheffl commented Oct 14, 2022

potiuk commented Oct 23, 2022 • edited Loading

jscheffl commented Oct 24, 2022

potiuk commented Oct 26, 2022

potiuk commented Nov 7, 2022

AutomationDev85 commented Oct 13, 2022 •

edited by uranusjr

Loading

potiuk commented Oct 23, 2022 •

edited

Loading