Replies: 15 comments 8 replies
-
Thanks for opening your first issue here! Be sure to follow the issue template! |
Beta Was this translation helpful? Give feedback.
-
Is this still happening in the latest airflow(2.4.2)? |
Beta Was this translation helpful? Give feedback.
-
Unfortunately this only occurs at scale so it's not something we can quickly check in a test environment. We will upgrade Airflow again in January. However, the code involved is the same between 2.2.5 and 2.4.2 so we don't expect any difference. |
Beta Was this translation helpful? Give feedback.
-
Hi, we are seeing this deadlock error at Slack at scale causing disruptions. Is there any fix thats been worked on? cc: @potiuk Details: Airflow version: 2.2.5, Celery, 2 schedulers with row level locking enabled, mysql 8 metadata db
|
Beta Was this translation helpful? Give feedback.
-
The easiest way you can check it (and without loosing too much time of others trying to investigate) is to upgrade to latest version of Airlfow and see if the problem is fixed there @ashwinshankar77 . If it happens for you frequently, this is by far best contribution you can make to the project and give back to the free project you get. Because we can confirm that either the issue has been fixed already or that it is still there. We've implemented many fixes since 2.2.5 to various parts of Airflow that are likely to fix it. And - even if we find a fix, the only way you will be able to fix it is to .... upgrade. Because we only release fixes in the latest line of Airlfow (so even if we find a problem now the fix will come in 2.5.1 and you .... will have to upgrade anyway). |
Beta Was this translation helpful? Give feedback.
-
Sounds good. We do have plans to upgrade next month 👍 |
Beta Was this translation helpful? Give feedback.
-
Cool . Let us know what you find out. And in case you find it happening, providing logs including server-side logs from the DB containing lock logs would be really helpful. No guarantees but it might bring us closer to diagnosing and fixing it. |
Beta Was this translation helpful? Give feedback.
-
Closing as no task on this issue at this point. |
Beta Was this translation helpful? Give feedback.
-
We have upgraded to Airflow 2.5.3 and verified this issue is still present. This issue occurs intermittently and is not easy to repro -- as reported in the issue it sometimes does not occur for weeks. I see Airflow 2.6.0 has been released, but upgrading Airflow is a lot of work and there is no reason to expect this is fixed on main, so I think it's fair to say this is an ongoing issue. |
Beta Was this translation helpful? Give feedback.
-
@dstaple Based on your comment on a different deadlock issue, I've converted this issue into a discussion so as to continue the conversation. It'll be really difficult for anyone to investigate without reliable reproduction steps, though, which is why I assume this issue was closed.
Can you expand on this? |
Beta Was this translation helpful? Give feedback.
-
@RNHTTR Sure. Yes, like the other deadlock issue it is intermittent, which makes it difficult to repro, let alone debug and fix. Regarding locks being taken out in systematic order: If you define a total ordering on locks and demand that any time multiple locks are taken out by a single process, the locks are applied according to this order, then deadlocks are impossible. To make it concrete, suppose we always ensure DagRun is locked before TaskInstance. If process 1 (e.g. the scheduler) locks DagRun and wants to take out a lock on TaskInstance, this is fine. If process 2 (e.g. the task runner) locks TaskInstance and also wants to take out a lock on DagRun, this is forbidden, because it would cause a deadlock if it happened at the same time as process 1. Instead, the task runner code should be modified to make sure the locks are applied according to the order we defined. Also, in #25312 @ashb found that in the specific example we were looking at back then, the DagRun table didn't even need to be locked! So in that case, instead of ensure the locks are taken out in a specific order, you just take out one of the locks (the one you actually need). |
Beta Was this translation helpful? Give feedback.
-
Hi @dstaple. are you still having this issue, did some upgrade help you fix it? |
Beta Was this translation helpful? Give feedback.
-
I also encountered the same problem. It may happen whenever multiple dag_run are scheduled at the same time, but it does not necessarily happen. My version is 2.3.4 |
Beta Was this translation helpful? Give feedback.
-
Hello, we also face this issuse in 2.8.4. With 40 dags and > 10k taks, it happens at least 5 times a week. |
Beta Was this translation helpful? Give feedback.
-
@dstaple can you please check if you have enabled the 'mini-scheduler'? https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#config-scheduler-schedule-after-task-execution |
Beta Was this translation helpful? Give feedback.
-
Apache Airflow version
Both Airflow 2.2.5 and 2.5.3; presumably other versions.
What happened
The Airflow scheduler occasionally exits with
psycopg2.errors.DeadlockDetected
, and several running tasks fail with SIGTERMs.One of the queries involved in the deadlock originates from the scheduler and is of the form
This can be seen in both the scheduler and database logs, and originates here:
Here line 909 corresponds to version 2.2.5.
The other query involved in the deadlock is shown in the database logs as follows:
From reading the Airflow source code, this seems likely to originate from
airflow/models/dagrun.py
in the functionupdate_state()
.Database logs look as follows:
The stack trace in the Airflow scheduler looks as follows:
What you think should happen instead
If locks on multiple tables are needed, they should be taken out in systematic order to make deadlocks impossible. Alternatively we may be locking more than we need to, similar to the situation in taskinstance.py prior to #25312
How to reproduce
The problem is not easily reproducible. It occurs approximately once every two weeks when operating at scale (50-150 DAGs, several of which have hundreds of tasks).
Operating System
CentOS 7
Versions of Apache Airflow Providers
Used these providers to repro with Airflow 2.2.5:
Used these providers to repro with Airflow 2.5.3:
Deployment
Virtualenv installation
Deployment details
Anything else
This issue is distinct from the previous deadlock issue reported in #23361 and fixed by #25312.
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions