-
Notifications
You must be signed in to change notification settings - Fork 14.6k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock when updating DagRun last_scheduling_decision and TaskInstance state=scheduled #27473
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
Is this still happening in the latest airflow(2.4.2)? |
Unfortunately this only occurs at scale so it's not something we can quickly check in a test environment. We will upgrade Airflow again in January. However, the code involved is the same between 2.2.5 and 2.4.2 so we don't expect any difference. |
Hi, we are seeing this deadlock error at Slack at scale causing disruptions. Is there any fix thats been worked on? cc: @potiuk Details: Airflow version: 2.2.5, Celery, 2 schedulers with row level locking enabled, mysql 8 metadata db
|
The easiest way you can check it (and without loosing too much time of others trying to investigate) is to upgrade to latest version of Airlfow and see if the problem is fixed there @ashwinshankar77 . If it happens for you frequently, this is by far best contribution you can make to the project and give back to the free project you get. Because we can confirm that either the issue has been fixed already or that it is still there. We've implemented many fixes since 2.2.5 to various parts of Airflow that are likely to fix it. And - even if we find a fix, the only way you will be able to fix it is to .... upgrade. Because we only release fixes in the latest line of Airlfow (so even if we find a problem now the fix will come in 2.5.1 and you .... will have to upgrade anyway). |
Sounds good. We do have plans to upgrade next month 👍 |
Cool . Let us know what you find out. And in case you find it happening, providing logs including server-side logs from the DB containing lock logs would be really helpful. No guarantees but it might bring us closer to diagnosing and fixing it. |
Closing as no task on this issue at this point. |
We have upgraded to Airflow 2.5.3 and verified this issue is still present. This issue occurs intermittently and is not easy to repro -- as reported in the issue it sometimes does not occur for weeks. I see Airflow 2.6.0 has been released, but upgrading Airflow is a lot of work and there is no reason to expect this is fixed on main, so I think it's fair to say this is an ongoing issue. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Apache Airflow version
Both Airflow 2.2.5 and 2.5.3; presumably other versions.
What happened
The Airflow scheduler occasionally exits with
psycopg2.errors.DeadlockDetected
, and several running tasks fail with SIGTERMs.One of the queries involved in the deadlock originates from the scheduler and is of the form
This can be seen in both the scheduler and database logs, and originates here:
Here line 909 corresponds to version 2.2.5.
The other query involved in the deadlock is shown in the database logs as follows:
From reading the Airflow source code, this seems likely to originate from
airflow/models/dagrun.py
in the functionupdate_state()
.Database logs look as follows:
The stack trace in the Airflow scheduler looks as follows:
What you think should happen instead
If locks on multiple tables are needed, they should be taken out in systematic order to make deadlocks impossible. Alternatively we may be locking more than we need to, similar to the situation in taskinstance.py prior to #25312
How to reproduce
The problem is not easily reproducible. It occurs approximately once every two weeks when operating at scale (50-150 DAGs, several of which have hundreds of tasks).
Operating System
CentOS 7
Versions of Apache Airflow Providers
Used these providers to repro with Airflow 2.2.5:
Used these providers to repro with Airflow 2.5.3:
Deployment
Virtualenv installation
Deployment details
Anything else
This issue is distinct from the previous deadlock issue reported in #23361 and fixed by #25312.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: