Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker sometimes does not reconnect to redis/celery queue after crash #27032

Closed
2 tasks done
AutomationDev85 opened this issue Oct 13, 2022 · 7 comments
Closed
2 tasks done
Labels
affected_version:2.4 Issues Reported for 2.4 area:core area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug

Comments

@AutomationDev85
Copy link
Contributor

AutomationDev85 commented Oct 13, 2022

Apache Airflow version

2.4.1

What happened

We are running an Airflow deployment and we had the issue that the redis POD died and then some Tasks stuck in the queue state. Only after killing the worker POD the tasks were consumed by the worker again. I wanted to analyse this more in detail and saw that this behavior only occurs sometimes!

For me looks like the worker some times does not detect that the connection to the redis Pod broke:

  1. If I do not see any error in the log file the worker does NOT reconnect once the worker is back!
  2. If I see this error in the log of the Worker it is WORKING and Worker automatically reconnects:
[2022-10-13 06:29:55,967: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 332, in start
    blueprint.start(self)
  File "/home/airflow/.local/lib/python3.8/site-packages/celery/bootsteps.py", line 116, in start
    step.start(parent)
  File "/home/airflow/.local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 628, in start
    c.loop(*c.loop_args())
  File "/home/airflow/.local/lib/python3.8/site-packages/celery/worker/loops.py", line 97, in asynloop
    next(loop)
  File "/home/airflow/.local/lib/python3.8/site-packages/kombu/asynchronous/hub.py", line 362, in create_loop
    cb(*cbargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 1326, in on_readable
    self.cycle.on_readable(fileno)
  File "/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 562, in on_readable
    chan.handlers[type]()
  File "/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 906, in _receive
    ret.append(self._receive_one(c))
  File "/home/airflow/.local/lib/python3.8/site-packages/kombu/transport/redis.py", line 916, in _receive_one
    response = c.parse_response()
  File "/home/airflow/.local/lib/python3.8/site-packages/redis/client.py", line 3505, in parse_response
    response = self._execute(conn, conn.read_response)
  File "/home/airflow/.local/lib/python3.8/site-packages/redis/client.py", line 3479, in _execute
    return command(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 739, in read_response
    response = self._parser.read_response()
  File "/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 324, in read_response
    raw = self._buffer.readline()
  File "/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 256, in readline
    self._read_from_socket()
  File "/home/airflow/.local/lib/python3.8/site-packages/redis/connection.py", line 201, in _read_from_socket
    raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.

What you think should happen instead

Expected behavior is that the worker reconnects to redis automatically and starts consuming queues Tasks.

How to reproduce

  1. Run DAG 2 tasks behind each other.
  2. Then start DAG and during the first task is executed, force kill the redis POD (kubectl delete pod redis-0 -n ??? --grace-period=0 --force.) To simulate a crashing POD.
  3. Check if the worker reconnects automatically and executes next tasks or if task stuck in queue state and worker must be killed to fix this.

Operating System

AKSUbuntu-1804gen2

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

Using a AKS Cluster in Azure to host Airflow.

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@AutomationDev85 AutomationDev85 added area:core kind:bug This is a clearly a bug labels Oct 13, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Oct 13, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@o-nikolas o-nikolas added the area:Scheduler including HA (high availability) scheduler label Oct 13, 2022
@o-nikolas
Copy link
Contributor

Thanks for filing this issue @AutomationDev85! I see that you're willing to submit a PR, let me know if you would like this issue assigned to you 😄

As a follow-up: are you sure this is a bug in Airflow or is this actually a Celery issue?

@eladkal eladkal added the affected_version:2.4 Issues Reported for 2.4 label Oct 14, 2022
@jscheffl
Copy link
Contributor

As I see that the new Helm Chart 1.7.0 was released, does somebody know or expect if the new liveness probe on Celery Worker will fix this problem implicitly? (#25561)

@potiuk
Copy link
Member

potiuk commented Oct 23, 2022

As I see that the new Helm Chart 1.7.0 was released, does somebody know or expect if the new liveness probe on Celery Worker will fix this problem implicitly? (#25561)

I think the only way to check is to try it (maybe you can try it @jens-scheffler-bosch - that would help us to close the issue).

@jscheffl
Copy link
Contributor

As I see that the new Helm Chart 1.7.0 was released, does somebody know or expect if the new liveness probe on Celery Worker will fix this problem implicitly? (#25561)

I think the only way to check is to try it (maybe you can try it @jens-scheffler-bosch - that would help us to close the issue).

We are on it - just have deployed via helm chart 1.7.0 - but as this problem only appeared randomly hard to predict if resolved. I'd be okay to close with the (positive) assumption that it is fixed and we maybe come back if we see another problem.

@potiuk
Copy link
Member

potiuk commented Oct 26, 2022

Maybe just keep it running for a while and let us know if ~ few days of running (depending on previously observed frequency) - if you will not see it after 2x the 'average" observeation time we might assume it works :)

@potiuk
Copy link
Member

potiuk commented Nov 7, 2022

Closing. 2 weeks passed. @jens-scheffler-bosch - if you had any issue, you can comment here still.

@potiuk potiuk closed this as completed Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.4 Issues Reported for 2.4 area:core area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

5 participants