-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using SLAs causes DagFileProcessorManager timeouts and prevents deleted dags from being recreated #15596
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
Couple of thoughts on this:
|
No I think that is fair -- If the file exists on the system it is likely for a reason. If a user does not want to add that DAG, they should just remove the dag file |
It didn't used to be possible to delete a dag when the file style existed. (Which led to problems with multi-dag files) |
Ok sounds like a fix is needed then. @argibbs are you interested in submitting a fix for this? If so we can get it in the next release (either 2.0.1 or 2.2.0). |
Hi there, I'd be up for making a change, but I don't know what the right solution is (as my lengthy screed above hopefully made clear). I could just comment the call entirely, but that would break SLAs. Doing more than that would require a bit more guidance. If someone's able to look at it and tell me "ah right, the callback should only be sent every tenth time" or "the callback should only be sent under the following additional conditions: ..." then I'm happy to pick it up and can probably get it out within the next milestone or two. If you want me to look at it, and take ownership of working out how SLAs are supposed to work, well, that's going to take a while. I'd still be game, but you'd need to adjust your expectations appropriately :) |
This should be fixed by #25147 |
#25147 improves things slightly, but add enough SLAs to your system and you hit another issue. Re-opening while I work on a second MR. |
Apache Airflow version: 2.0.1 and 2.0.2
Kubernetes version (if you are using kubernetes) (use
kubectl version
): N/AEnvironment: Celery executors, Redis + Postgres
What happens:
In 2.0.0 if you delete a dag from the GUI when the
.py
file is still present, the dag is re-added within a few seconds (albeit with no history etc. etc.). Upon attempting to upgrade to 2.0.1 we found that after deleting a dag it would take tens of minutes to come back (or more!), and its reappearance was seemingly at random (i.e. restarting schedulers / guis did not help).It did not seem to matter which dag it was.
The problem still exists in 2.0.2.
What you expected to happen:
Deleting a dag should result in that dag being re-added in short order if the
.py
file is still present.Likely cause
I've tracked it back to an issue with SLA callbacks. I strongly suspect the fix for Issue #14050 was inadvertently responsible, since that was in the 2.0.1 release. In a nutshell, it appears the dag_processor_manager gets into a state where on every single pass it takes so long to process SLA checks for one of the dag files that the entire processor times out and is killed. As a result, some of the dag files (that are queued behind the poison pill file) never get processed and thus we don't reinstate the deleted dag unless the system gets quiet and the SLA checks clear down.
To reproduce in my setup, I created a clean airflow instance. The only materially important config setting I use is
AIRFLOW__SCHEDULER__PARSING_PROCESSES=1
which helps keep things deterministic.I then started adding in dag files from the production system until I found a file that caused the problem. Most of our dags do not have SLAs, but this one did. After adding it, I started seeing lines like this in
dag_processor_manager.log
(file names have been changed to keep things simple)Additionally, the stats contained lines like:
(i.e. 3 minutes to process a single file!)
Of note, the parse time of the affected file got longer on each pass until the processor was killed. Increasing
AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT
to e.g. 300 did nothing to help; it simply bought a few more iterations of the parse loop before it blew up.Browsing the log file for
scheduler/2021-04-29/problematic.py.log
I could see the following:Log file entries in 2.0.2
Two important points: from the above logs:
Likely location of the problem:
This is where I start to run out of steam. I believe the culprit is this line: https://github.com/apache/airflow/blob/2.0.2/airflow/jobs/scheduler_job.py#L1813
It seems to me that the above leads to a feedback where each time you send a dag callback to the processor you include a free SLA callback as well, hence the steadily growing SLA processing log messages / behaviour I observed. As noted above, this method call was in 2.0.0 but until Issue #14050 was fixed, the SLAs were ignored, so the problem only kicked in from 2.0.1 onwards.
Unfortunately, my airflow-fu is not good enough for me to suggest a fix beyond the Gordian solution of removing the line completely (!); in particular, it's not clear to me how / where SLAs should be being checked. Should the dag_processor_manager be doing them? Should it be another component (I mean, naively, I would have thought it should be the workers, so that SLA checks can scale with the rest of your system)? How should the checks be enqueued? I dunno enough to give a good answer. 🤷
How to reproduce it:
In our production system, it would blow up every time, immediately. Reliably reproducing in a clean system depends on how fast your test system is; the trick appears to be getting the scan of the dag file to take long enough that the SLA checks start to snowball. The dag below did it for me; if your machine seems to be staying on top of processing the dags, try increasing the number of tasks in a single dag (or buy a slower computer!)
Simple dag that causes the problem
To reproduce:
scheduler/[date]/sla_example.py.log
file (assuming you called the abovesla_example.py
, of course)Anything else we need to know:
The text was updated successfully, but these errors were encountered: