Using SLAs causes DagFileProcessorManager timeouts and prevents deleted dags from being recreated #15596

argibbs · 2021-04-29T20:21:20Z

Apache Airflow version: 2.0.1 and 2.0.2

Kubernetes version (if you are using kubernetes) (use kubectl version): N/A

Environment: Celery executors, Redis + Postgres

Cloud provider or hardware configuration: Running inside docker
OS (e.g. from /etc/os-release): Centos (inside Docker)

What happens:

In 2.0.0 if you delete a dag from the GUI when the .py file is still present, the dag is re-added within a few seconds (albeit with no history etc. etc.). Upon attempting to upgrade to 2.0.1 we found that after deleting a dag it would take tens of minutes to come back (or more!), and its reappearance was seemingly at random (i.e. restarting schedulers / guis did not help).

It did not seem to matter which dag it was.

The problem still exists in 2.0.2.

What you expected to happen:

Deleting a dag should result in that dag being re-added in short order if the .py file is still present.

Likely cause

I've tracked it back to an issue with SLA callbacks. I strongly suspect the fix for Issue #14050 was inadvertently responsible, since that was in the 2.0.1 release. In a nutshell, it appears the dag_processor_manager gets into a state where on every single pass it takes so long to process SLA checks for one of the dag files that the entire processor times out and is killed. As a result, some of the dag files (that are queued behind the poison pill file) never get processed and thus we don't reinstate the deleted dag unless the system gets quiet and the SLA checks clear down.

To reproduce in my setup, I created a clean airflow instance. The only materially important config setting I use is AIRFLOW__SCHEDULER__PARSING_PROCESSES=1 which helps keep things deterministic.

I then started adding in dag files from the production system until I found a file that caused the problem. Most of our dags do not have SLAs, but this one did. After adding it, I started seeing lines like this in dag_processor_manager.log (file names have been changed to keep things simple)

[2021-04-29 16:27:19,259] {dag_processing.py:1129} ERROR - Processor for /home/airflow/dags/problematic.py with PID 309 started at 2021-04-29T16:24:19.073027+00:00 has timed out, killing it.

Additionally, the stats contained lines like:

File Path                                                            PID  Runtime      # DAGs    # Errors  Last Runtime    Last Run
-----------------------------------------------------------------  -----  ---------  --------  ----------  --------------  -------------------
/home/airflow/dags/problematic.py                                    309  167.29s           8           0  158.78s         2021-04-29T16:24:19

(i.e. 3 minutes to process a single file!)

Of note, the parse time of the affected file got longer on each pass until the processor was killed. Increasing AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT to e.g. 300 did nothing to help; it simply bought a few more iterations of the parse loop before it blew up.

Browsing the log file for scheduler/2021-04-29/problematic.py.log I could see the following:

Log file entries in 2.0.2

[2021-04-29 16:06:44,633] {scheduler_job.py:629} INFO - Processing file /home/airflow/dags/problematic.py for tasks to queue
[2021-04-29 16:06:44,634] {logging_mixin.py:104} INFO - [2021-04-29 16:06:44,634] {dagbag.py:451} INFO - Filling up the DagBag from /home/airflow/dags/problematic
[2021-04-29 16:06:45,001] {scheduler_job.py:639} INFO - DAG(s) dict_keys(['PARQUET-BASIC-DATA-PIPELINE-YESTERDAY-S1-weekends', 'PARQUET-BASIC-DATA-PIPELINE-YESTERDAY-APPEND-S1-weekends', 'PARQUET-BASIC-DATA-PIPELINE-TODAY-S1-weekends', 'PARQUET-BASIC-DATA-PIPELINE-TODAY-APPEND-S1-weekends', 'PARQUET-BASIC-DATA-PIPELINE-TODAY-S2-weekends', 'PARQUET-BASIC-DATA-PIPELINE-YESTERDAY-S2-weekends', 'PARQUET-BASIC-DATA-PIPELINE-TODAY-S3-weekends', 'PARQUET-BASIC-DATA-PIPELINE-YESTERDAY-S3-weekends']) retrieved from /home/airflow/dags/problematic.py
[2021-04-29 16:06:45,001] {scheduler_job.py:396} INFO - Running SLA Checks for PARQUET-BASIC-DATA-PIPELINE-YESTERDAY-APPEND-S1-weekends
[2021-04-29 16:06:46,398] {scheduler_job.py:396} INFO - Running SLA Checks for PARQUET-BASIC-DATA-PIPELINE-YESTERDAY-APPEND-S1-weekends
[2021-04-29 16:06:47,615] {scheduler_job.py:396} INFO - Running SLA Checks for PARQUET-BASIC-DATA-PIPELINE-YESTERDAY-APPEND-S1-weekends
[2021-04-29 16:06:48,852] {scheduler_job.py:396} INFO - Running SLA Checks for PARQUET-BASIC-DATA-PIPELINE-TODAY-APPEND-S1-weekends
[2021-04-29 16:06:49,411] {scheduler_job.py:396} INFO - Running SLA Checks for PARQUET-BASIC-DATA-PIPELINE-TODAY-APPEND-S1-weekends
[2021-04-29 16:06:50,156] {scheduler_job.py:396} INFO - Running SLA Checks for PARQUET-BASIC-DATA-PIPELINE-TODAY-APPEND-S1-weekends
[2021-04-29 16:06:50,845] {scheduler_job.py:396} INFO - Running SLA Checks for PARQUET-BASIC-DATA-PIPELINE-YESTERDAY-APPEND-SP500_Index_1-weekends
[2021-04-29 16:06:52,164] {scheduler_job.py:396} INFO - Running SLA Checks for PARQUET-BASIC-DATA-PIPELINE-YESTERDAY-APPEND-S1-weekends
[2021-04-29 16:06:53,474] {scheduler_job.py:396} INFO - Running SLA Checks for PARQUET-BASIC-DATA-PIPELINE-YESTERDAY-APPEND-S1-weekends
[2021-04-29 16:06:54,731] {scheduler_job.py:396} INFO - Running SLA Checks for PARQUET-BASIC-DATA-PIPELINE-TODAY-APPEND-SP500_Index_1-weekends
[2021-04-29 16:06:55,345] {scheduler_job.py:396} INFO - Running SLA Checks for PARQUET-BASIC-DATA-PIPELINE-TODAY-APPEND-S1-weekends
[2021-04-29 16:06:55,920] {scheduler_job.py:396} INFO - Running SLA Checks for PARQUET-BASIC-DATA-PIPELINE-TODAY-APPEND-S1-weekends

and so on for 100+ more lines like this...

Two important points: from the above logs:

We seem to be running checks on the same dags multiple times
The number of checks grows on each pass (i.e. the number of log lines beginning "Running SLA Checks..." increases on each pass until the processor manager is restarted, and then it begins afresh)

Likely location of the problem:

This is where I start to run out of steam. I believe the culprit is this line: https://github.com/apache/airflow/blob/2.0.2/airflow/jobs/scheduler_job.py#L1813

It seems to me that the above leads to a feedback where each time you send a dag callback to the processor you include a free SLA callback as well, hence the steadily growing SLA processing log messages / behaviour I observed. As noted above, this method call was in 2.0.0 but until Issue #14050 was fixed, the SLAs were ignored, so the problem only kicked in from 2.0.1 onwards.

Unfortunately, my airflow-fu is not good enough for me to suggest a fix beyond the Gordian solution of removing the line completely (!); in particular, it's not clear to me how / where SLAs should be being checked. Should the dag_processor_manager be doing them? Should it be another component (I mean, naively, I would have thought it should be the workers, so that SLA checks can scale with the rest of your system)? How should the checks be enqueued? I dunno enough to give a good answer. 🤷

How to reproduce it:

In our production system, it would blow up every time, immediately. Reliably reproducing in a clean system depends on how fast your test system is; the trick appears to be getting the scan of the dag file to take long enough that the SLA checks start to snowball. The dag below did it for me; if your machine seems to be staying on top of processing the dags, try increasing the number of tasks in a single dag (or buy a slower computer!)

Simple dag that causes the problem

import datetime as dt
import pendulum

from airflow import DAG
from airflow.operators.bash import BashOperator


def create_graph(dag):
    prev_task = None
    for i in range(10):
        next_task = BashOperator(
            task_id=f'simple_task_{i}',
            bash_command="echo SLA issue",
            dag=dag)

        if prev_task:
            prev_task >> next_task

        prev_task = next_task


def create_dag(name: str) -> DAG:
    tz_to_use = pendulum.timezone('UTC')

    default_args = {
        'owner': '[email protected]',
        'start_date': dt.datetime(2018, 11, 13, tzinfo=tz_to_use),
        'email': ['[email protected]'],
        'email_on_failure': False,
        'email_on_retry': False,
        'sla': dt.timedelta(hours=13),
    }

    dag = DAG(name,
              catchup=False,
              default_args=default_args,
              max_active_runs=10,
              schedule_interval="* * * * *")

    create_graph(dag)

    return dag

for i in range(100):
  name = f"sla_dag_{i}"
  globals()[name] = create_dag(name)

To reproduce:

Configure an empty airflow instance, s.t. it only has one parsing process (as per config above).
Add the file above into the install. The file simply creates 100 near-trivial dags. On my system, airflow can't stay ahead, and is basically permanently busy processing the backlog. Your cpu may have more hamsters, in which case you'll need to up the number of tasks and/or dags.
Locate and tail the scheduler/[date]/sla_example.py.log file (assuming you called the above sla_example.py, of course)
This is the non-deterministic part. On my system, within a few minutes, the processor manager is taking noticeably longer to process the file and you should be able to see lots of SLA log messages like my example above ☝️. Like all good exponential growth it takes many iterations to go from 1 second to 1.5 seconds to 2 seconds, but not very long at all to go from 10 seconds to 30 to 💥

Anything else we need to know:

I'm working around this for now by simply removing the SLAs from the dag. This solves the problem since the SLA callbacks are then dropped. But SLAs are a great feature, and I'd like them back (please!).
Thanks for making airflow and thanks for making it this far down the report!

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2021-04-29T20:21:22Z

Thanks for opening your first issue here! Be sure to follow the issue template!

jhtimmins · 2021-05-26T04:12:44Z

@kaxil @ashb Should DAGs deleted via the UI actually be re-added if the file still exists? That seems to defeat the purpose of deleteng

argibbs · 2021-05-26T12:45:48Z

@kaxil @ashb Should DAGs deleted via the UI actually be re-added if the file still exists? That seems to defeat the purpose of deleteng

Couple of thoughts on this:

Re-adding the DAG is useful; I consider it functionality equivalent to selecting all existing dag runs and deleting them. If I want the dag gone, I'll remove the underlying file first. Changing this would be a major departure.
The fact the dag is not re-added is (slightly) missing the point. The problem is that the SLAs trigger a feedback loop that prevents the dag processor manager from iterating over all files in a finite time. This means that once this problem is present, adding a new dag is also not picked up until / unless the SLA feedback clears down.

kaxil · 2021-05-26T13:56:32Z

@kaxil @ashb Should DAGs deleted via the UI actually be re-added if the file still exists? That seems to defeat the purpose of deleteng

No I think that is fair -- If the file exists on the system it is likely for a reason. If a user does not want to add that DAG, they should just remove the dag file

ashb · 2021-05-26T14:56:16Z

It didn't used to be possible to delete a dag when the file style existed.

(Which led to problems with multi-dag files)

jhtimmins · 2021-05-27T01:56:30Z

Ok sounds like a fix is needed then. @argibbs are you interested in submitting a fix for this? If so we can get it in the next release (either 2.0.1 or 2.2.0).

argibbs · 2021-06-18T08:32:46Z

Hi there,

I'd be up for making a change, but I don't know what the right solution is (as my lengthy screed above hopefully made clear). I could just comment the call entirely, but that would break SLAs.

Doing more than that would require a bit more guidance. If someone's able to look at it and tell me "ah right, the callback should only be sent every tenth time" or "the callback should only be sent under the following additional conditions: ..." then I'm happy to pick it up and can probably get it out within the next milestone or two.

If you want me to look at it, and take ownership of working out how SLAs are supposed to work, well, that's going to take a while. I'd still be game, but you'd need to adjust your expectations appropriately :)

argibbs · 2022-07-22T11:59:59Z

This should be fixed by #25147

argibbs · 2022-07-25T08:35:43Z

#25147 improves things slightly, but add enough SLAs to your system and you hit another issue.

Re-opening while I work on a second MR.

argibbs added the kind:bug This is a clearly a bug label Apr 29, 2021

kaxil added area:core area:Scheduler including HA (high availability) scheduler labels Apr 29, 2021

kaxil added this to the Airflow 2.1 milestone Apr 29, 2021

vikramkoka added the affected_version:2.0 Issues Reported for 2.0 label May 3, 2021

jhtimmins modified the milestones: Airflow 2.1, Airflow 2.1.1 May 11, 2021

kaxil added the priority:medium Bug that should be fixed before next release but would not block a release label May 26, 2021

kaxil removed this from the Airflow 2.1.1 milestone Jun 22, 2021

kaxil assigned ephraimbuddy Dec 30, 2021

ephraimbuddy mentioned this issue Jan 5, 2022

Send SLA callback to processor when DagRun has completed #20683

Merged

potiuk closed this as completed in #20683 Jan 9, 2022

ephraimbuddy reopened this Jan 24, 2022

This was referenced Jul 18, 2022

sla_miss.execution_date is incorrect for daily dagruns #22532

Closed

Stop SLA callbacks gazumping other callbacks and DOS'ing the DagProcessorManager queue #25147

Merged

argibbs closed this as completed Jul 22, 2022

argibbs reopened this Jul 25, 2022

argibbs mentioned this issue Aug 2, 2022

Dag processor manager queue split (fixes SLAs) #25489

Closed

ephraimbuddy mentioned this issue Aug 20, 2022

Status of testing of Apache Airflow 2.3.4rc1 #25846

Closed

53 tasks

argibbs mentioned this issue Oct 27, 2022

Dag processor manager queue split (fixes SLAs) #27317

Closed

argibbs mentioned this issue Mar 13, 2023

SLA callbacks no longer add files to the dag_processing manager queue #30076

Merged

potiuk closed this as completed in #30076 Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using SLAs causes DagFileProcessorManager timeouts and prevents deleted dags from being recreated #15596

Using SLAs causes DagFileProcessorManager timeouts and prevents deleted dags from being recreated #15596

argibbs commented Apr 29, 2021

boring-cyborg bot commented Apr 29, 2021

jhtimmins commented May 26, 2021

argibbs commented May 26, 2021

kaxil commented May 26, 2021

ashb commented May 26, 2021

jhtimmins commented May 27, 2021

argibbs commented Jun 18, 2021 •

edited

Loading

argibbs commented Jul 22, 2022

argibbs commented Jul 25, 2022

Using SLAs causes DagFileProcessorManager timeouts and prevents deleted dags from being recreated #15596

Using SLAs causes DagFileProcessorManager timeouts and prevents deleted dags from being recreated #15596

Comments

argibbs commented Apr 29, 2021

boring-cyborg bot commented Apr 29, 2021

jhtimmins commented May 26, 2021

argibbs commented May 26, 2021

kaxil commented May 26, 2021

ashb commented May 26, 2021

jhtimmins commented May 27, 2021

argibbs commented Jun 18, 2021 • edited Loading

argibbs commented Jul 22, 2022

argibbs commented Jul 25, 2022

argibbs commented Jun 18, 2021 •

edited

Loading