Constant "The scheduler does not appear to be running" warning on the UI following 2.6.0 upgrade #31200

arjunanan6 · 2023-05-11T07:51:57Z

Apache Airflow version

2.6.0

What happened

Ever since we upgraded to Airflow 2.6.0 from 2.5.2, we have seen that there is a warning stating "The scheduler does not appear to be running" intermittently.

This warning goes away by simply refreshing the page. And this conforms with our findings that the scheduler has not been down at all, at any point. By calling the /health point constantly, we can get it to show an "unhealthy" status:

These are just approx. 6 seconds apart:

{"metadatabase": {"status": "healthy"}, "scheduler": {"latest_scheduler_heartbeat": "2023-05-11T07:42:36.857007+00:00", "status": "healthy"}}

{"metadatabase": {"status": "healthy"}, "scheduler": {"latest_scheduler_heartbeat": "2023-05-11T07:42:42.409344+00:00", "status": "unhealthy"}}

This causes no operational issues, but it is misleading for end-users. What could be causing this?

What you think should happen instead

The warning should not be shown unless the last heartbeat was at least 30 sec earlier (default config).

How to reproduce

There are no concrete steps to reproduce it, but the warning appears in the UI after a few seconds of browsing around, or simply refresh the /health endpoint constantly.

Operating System

Debian GNU/Linux 11

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==8.0.0
apache-airflow-providers-celery==3.1.0
apache-airflow-providers-cncf-kubernetes==6.1.0
apache-airflow-providers-common-sql==1.4.0
apache-airflow-providers-docker==3.6.0
apache-airflow-providers-elasticsearch==4.4.0
apache-airflow-providers-ftp==3.3.1
apache-airflow-providers-google==10.0.0
apache-airflow-providers-grpc==3.1.0
apache-airflow-providers-hashicorp==3.3.1
apache-airflow-providers-http==4.3.0
apache-airflow-providers-imap==3.1.1
apache-airflow-providers-microsoft-azure==6.0.0
apache-airflow-providers-microsoft-mssql==3.3.2
apache-airflow-providers-microsoft-psrp==2.2.0
apache-airflow-providers-microsoft-winrm==3.0.0
apache-airflow-providers-mysql==5.0.0
apache-airflow-providers-odbc==3.2.1
apache-airflow-providers-oracle==3.0.0
apache-airflow-providers-postgres==5.4.0
apache-airflow-providers-redis==3.1.0
apache-airflow-providers-sendgrid==3.1.0
apache-airflow-providers-sftp==4.2.4
apache-airflow-providers-slack==7.2.0
apache-airflow-providers-snowflake==4.0.5
apache-airflow-providers-sqlite==3.3.2
apache-airflow-providers-ssh==3.6.0

Deployment

Official Apache Airflow Helm Chart

Deployment details

Deployed on AKS with helm

Anything else

None more than in the description above.

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

PhoenixChamb · 2023-05-11T07:57:35Z

We are facing the same problem after upgrading Airflow from 2.5.3 to 2.6.0

potiuk · 2023-05-11T12:01:07Z

@arjunanan6 and @PhoenixChamb could you please increase the https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#job-heartbeat-sec to say 7 s and then increase it to 10 if it does not help (if you have default 5) and see if that decreases the probability of such an event and report it back here?

That would help us to be able to see if just increasing the default is a good idea because of increased resource usage in 2.6 or whether we have some other problem.

arjunanan6 · 2023-05-11T12:19:47Z

@potiuk On it, I will report back shortly.

arjunanan6 · 2023-05-11T12:37:51Z

@potiuk I'm afraid that did not really help the situation. I need more refreshes to see the warning, but it still appears 'incorrectly'.

ptran32 · 2023-05-11T19:18:08Z

We also faced this issue after upgrading from Airflow 2.5.3 to 2.6.0, the scheduler runs for 2 minutes, then restarts

Scheduler logs: https://gist.github.com/ptran32/56703c86a854ec80bdae1d5195c182e7

We also notice an error on the liveness of the pod:

Liveness probe failed: Traceback (most recent call last): File "<string>", line 2, in <module> ModuleNotFoundError: No module named 'airflow.jobs.scheduler_job'

arjunanan6 · 2023-05-12T11:00:35Z

As @ptran32 noted, we are also seeing unnecessary restarts for the scheduler because a livenessprobe failed. Interestingly, this was not a problem today, but something that started late last night - several hours after the upgrade.

But, I really cannot see any issues from the scheduler logs that would cause the probe to fail. As a temporary solution, I am trying out increasing the following values:

scheduler.livenessProbe.timeoutSeconds to 30
scheduler.livenessProbe.failureThreshold to 10

Hopefully, this results in no/fewer restarts.

ptran32 · 2023-05-12T11:14:29Z

As @ptran32 noted, we are also seeing unnecessary restarts for the scheduler because a livenessprobe failed. Interestingly, this was not a problem today, but something that started late last night - several hours after the upgrade.

But, I really cannot see any issues from the scheduler logs that would cause the probe to fail. As a temporary solution, I am trying out increasing the following values:

scheduler.livenessProbe.timeoutSeconds to 30

scheduler.livenessProbe.failureThreshold to 10

Hopefully, this results in no/fewer restarts.

Hi,

if you are using Helm to deploy Airflow, the 3.6.0 version will cause the scheduler to restart in loop, not because of the timeout (at least in my case) but because the liveness check is not compatible with the new release

Scheduler logs after upgrade to 3.6.0

Liveness probe failed: Traceback (most recent call last): File "<string>", line 2, in <module> ModuleNotFoundError: No module named 'airflow.jobs.scheduler_job'

We fixed it by adjusting the liveness check command in the helm chart:

      livenessProbe:
          exec:
            command:
            - python
            - -Wignore
            - -c
            - |
              from typing import List
              from airflow.jobs.scheduler_job_runner import SchedulerJobRunner

I hope it helps.

rcheatham-q · 2023-05-12T14:33:04Z

I can confirm that the default liveness probe for the scheduler is failing. The helm chart v1.9.0 defines the liveness probe as follows:

CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint airflow jobs check --job-type SchedulerJob --local

However, executing that command on a live and functioning scheduler pod fails (I omitted the exec command so I could check the exit code):

airflow@airflow-scheduler-866db5c895-qt5pq:/opt/airflow$ CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR /entrypoint airflow jobs check --job-type SchedulerJob --local
No alive jobs found.
airflow@airflow-scheduler-866db5c895-qt5pq:/opt/airflow$ echo $?
1
airflow@airflow-scheduler-866db5c895-qt5pq:/opt/airflow$

I can confirm that the liveness check listed by @ptran32 does function, however it does not seem to directly detect if the scheduler job is running.

I also tested the example airflow jobs check commands listed in the CLI docs and both functioned the same.

It appears that the scheduler healthchecks in general are broken in version 2.6.0.

ptran32 · 2023-05-12T17:51:47Z

@rcheatham-q that might help as a workaround for now: airflow-helm/charts#738

potiuk · 2023-05-13T15:30:50Z

@rcheatham-q that might help as a workaround for now: airflow-helm/charts#738

This is not the problem. The workaround is in the non-Airflow community managed Helm chart that did not catch up with the new scheduler probe and even the workaround there is very wrong. This is a different issue (And I am looking into it now). I doubt it is caused by liveness check not working (it works in general). This is a different issue.

potiuk · 2023-05-13T16:06:41Z

I would love to get to the bottom of it. Can anyone having the problem (with the Airflow Community chart https://airflow.apache.org/docs/helm-chart/stable/index.html ) run the following on their frequently failing scheduler:

Add AIRFLOW__LOGGING__LOGGING_LEVEL=DEBUG to your scheduler environment and restart it - make sure that log contains debug information
exec into the container of scheduler while it is running (/entrypoint bash) should be the right command to exec into the container.
run this command date and save its output
then AIRFLOW__LOGGING__LOGGING_LEVEL=DEBUG airflow jobs check --job-type SchedulerJob --local several times while airflow is running and save somewhere the output. Note down how long it takes to run the commands.
if you notice that it switches from Found one alive job into No alive jobs found. do as follows
run this commanddate and save its output
run airflow db shell
run SELECT * from job; and save its output
repeat few times and see if it changes (last time seen) - save the outputs
When your scheduler gets killed because of liveness checks -> find the logs of scheduler from before that faiure and save them to share
create some way (gists?) to share all the dumped information (please annotate and comment where the log is from)
please also add all the version/cluster information you have

And ping me.

@ptran32 @rcheatham-q @arjunanan6 - can I count on your help here?

potiuk · 2023-05-13T16:16:38Z

@PhoenixChamb - you as well ^^ Can I count on your help too ?

thesuperzapper · 2023-05-13T23:01:29Z

@potiuk this issue is because the scheduler_health_check_threshold config is ignored since airflow 2.6.0, this is because SchedulerJobRunner.is_alive() is not called anymore, only the Job.is_alive() which does not consider the config. This is because, since PR #30302, the SQL query used by the /health endpoint returns a Job rather than a SchedulerJobRunner instance.

This means that the job_heartbeat_sec is used instead, and as that has a default of 5 seconds (compared with the scheduler_health_check_threshold default of 30), I expect we will see many users reporting their Schedulers are unhealthy.

Also, I think airflow's internal checking for the health of the scheduler is NOT affected by this bug, as it looks like adopt_or_reset_orphaned_tasks() correctly uses scheduler_health_check_threshold.

thesuperzapper · 2023-05-13T23:18:06Z

After thinking about it, I think the easiest way to fix the issue I described in #31200 (comment) is to update Job.is_alive() to specifically check if its self.job_type is "SchedulerJob", and in that case, use the same logic as SchedulerJobRunner.is_alive().

We may also want to remove the SchedulerJobRunner.is_alive() method altogether, because I am pretty sure it's not used anymore, and will cause confusion.

thesuperzapper · 2023-05-14T02:29:06Z

For anyone watching who wants a workaround for Airflow 2.6.0, you can simply set AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC to 30, or whatever you had AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD set to.

rcheatham-q · 2023-05-14T03:20:36Z

For anyone watching who wants a workaround for Airflow 2.6.0, you can simply set AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC to 30, or whatever you had AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD set to.

I can confirm that this workaround removes the warning from the UI and the updated liveness probe in the linked PR also works. Thanks @thesuperzapper for the workarounds.

@potiuk given what @thesuperzapper has described, do you still need the debugging data you mentioned? I'm happy to provide it, but don't want to go through the effort if it is no longer required.

potiuk · 2023-05-14T11:11:03Z

Thanks @thesuperzapper !

This is a fantastic find! Thanks for the analsis - really helpful and absolutely @rcheatham-q and others - no need for the analysis.

Indeed, there was a separate case for scheduler job where instead of the 2.1 grace period we overwrote it with the configuration value - and both the configuration value and is_alive in SchedulerJobRunner have not been used.

I am applying a fix now that will make it into rc3 of 2.6.1 - and it will fix both the /health endpoint and airflow jobs check command to use the configuration variable for the check.

Still I think (just a comment the @thesuperzapper) you should change your Helm Chart to use airflow jobs check as of 2.6.1 - see my comment here: airflow-helm/charts#738 (comment) .

As of 2.6 we are far more precise on what is and what is not the public interface of Airlfow - and things will break without a warning (even in patchlevel releases) if you base your API interfacing with Alrflow: specifically when relying on database structure, specific queries, and "random" airflow code executed from outside. You've been warned.

While there might be occasional bugs like this one where we missed a case during refactoring, those will be quickly fixed (this one in 2.6.1) and they will remain forward-compatible (airlfow jobs check aim is to provide the right check - for example if we get rid of Job table in the future - which might happen without a warning in any release - it will continue to work).

potiuk · 2023-05-14T11:33:50Z

Fix in #31277

The change #apache#30302 split Job from JobRunner, but it missed the fact that SchedulerJob had a special case of checking the threshold - instead of using the standard grace multiplier, it used whatever has been defined in the `scheduler_helth_check_threshold`. The `is_alive` method in SchedulerJobRunner has remained unused, and default 2.1 grace multiplier has been used for both /health endpoint and `airflow jobs check`. This PR brings the exception for SchedulerJob back and clarifies that the same treshold is also used for airflow jobs check in the documentation. Fixes: apache#31200

The change ##30302 split Job from JobRunner, but it missed the fact that SchedulerJob had a special case of checking the threshold - instead of using the standard grace multiplier, it used whatever has been defined in the `scheduler_helth_check_threshold`. The `is_alive` method in SchedulerJobRunner has remained unused, and default 2.1 grace multiplier has been used for both /health endpoint and `airflow jobs check`. This PR brings the exception for SchedulerJob back and clarifies that the same treshold is also used for airflow jobs check in the documentation. Fixes: #31200

arjunanan6 · 2023-05-15T08:38:07Z

@potiuk I pushed the suggested change from @thesuperzapper about an hour ago and everything looks fine now. No more warnings on the UI, or scheduler restarts. Thanks a lot!

potiuk · 2023-05-15T08:44:21Z

Yes. Thanks @arjunanan6 -> we just also merged #31277 with the fix to 2.6.1 (and rc3 will be up shortly for voting/testing - so if you want to take it to a spin for 2.6.1 and revert back to the original settings/configuration, that would be perfect.

The change ##30302 split Job from JobRunner, but it missed the fact that SchedulerJob had a special case of checking the threshold - instead of using the standard grace multiplier, it used whatever has been defined in the `scheduler_helth_check_threshold`. The `is_alive` method in SchedulerJobRunner has remained unused, and default 2.1 grace multiplier has been used for both /health endpoint and `airflow jobs check`. This PR brings the exception for SchedulerJob back and clarifies that the same treshold is also used for airflow jobs check in the documentation. Fixes: #31200 (cherry picked from commit f366d95)

thesuperzapper · 2023-05-15T17:33:20Z

Still I think (just a comment the @thesuperzapper) you should change your Helm Chart to use airflow jobs check as of 2.6.1 - see my comment here: airflow-helm/charts#738 (comment) .

As of 2.6 we are far more precise on what is and what is not the public interface of Airlfow - and things will break without a warning (even in patchlevel releases) if you base your API interfacing with Alrflow: specifically when relying on database structure, specific queries, and "random" airflow code executed from outside. You've been warned.

While there might be occasional bugs like this one where we missed a case during refactoring, those will be quickly fixed (this one in 2.6.1) and they will remain forward-compatible (airlfow jobs check aim is to provide the right check - for example if we get rid of Job table in the future - which might happen without a warning in any release - it will continue to work).

@potiuk the problem is that I'm doing a slightly more complex check than just if there is a SchedulerJob running.

There is an optional secondary check that ensures tasks are actually being scheduled (in the form of new LocalTaskJob appearing), but this check is only started once at least one scheduler has been running for a minimum period (to prevent deadlocks where the probe restarts the scheduler before it gets a chance to schedule something).

The reason we need this check is because older versions of airflow will sometimes deadlock and not schedule new tasks but still have a heartbeat, so they look healthy to the normal probe.

I'm not sure that this is possible to achieve with the current airflow jobs check.

potiuk · 2023-05-15T17:50:11Z

I'm not sure that this is possible to achieve with the current airflow jobs check.

For sure you can do it with the Stable REST API.

Or maybe this is a great idea to contribute such check to Airflow. This is the only way to make sure that anything you do there is not going to change in the next version. If you do not have unit tests in Airflow covering it and making sure that regression tests are passing, there is no way we can maintain backwards compatibility of the code. If you have not made any effort (by adding and making sure that the tests are in Airflow that can guarantee something in Airflow), we are not able to promise absolutely anything.

This is one of the reasons why the bug crept into 2.6.0 - because it had no test to prevent regressions - now it has https://github.com/apache/airflow/pull/31277/files#diff-9499319fda165fa31190eba1879d7ecf71871ce3536da0a8505ba529378095c7R151

It would be absolutely unreasonable to expect that things will not change - code is a living thing. And what Airflow team can do is to promise that will make all the effort to keep the things that are public interface backwards compatible.

And this is what public interface of Airflow document is about. Look there. If you find something that is not there, don't rely on it. Even more if you find that something is specifically excluded (like direct DB access) - don't use it even more. Figure a different way (Stable REST API is a good idea). Just stick to it. There is no argument about it anyway, until you bring the diccussion into the devlist and convince the people to change their decision about it. You have to change your ways or things will break. You have been warned (for the third time).

And just to make a bit more illustration of what you are asking for. I know it's largely exaggerated and what you want is not as extreme as this but the line of thinking you have here is very similar to this one:

With Airlfow explicitly defining what is public interface and what is not, you are basicaly asking for this^^

Atif8Ted · 2023-06-28T05:49:31Z

I am still facing the issue . the health endpoint still shows scheduler as unhealthy randomly. We are using 2.6.2.
How do we debug this ?

potiuk · 2023-06-28T06:11:04Z

As usual - look at the logs, and your monitoring systems. Quite often there are issues with the deployment and resources that are causing that - incluidng lack of memory, cpu, disk, I/O limitatins.

Looking at the logs and analysing if there are any warnings or information that it could indicate that scheduler goes down or that there are errors and warnigns that would indicate abnormal behaviour. Generally if you see any warnings or errors you should make sure to address them - usually you can find the reasons by reading the messages and applying what they say or you cna use airflow docs/google/search the issues and discussions here to find out if others have similar problems. Usually you will find answers, other's people logs and some fix suggestion if people had similar issue.

Nothing really special - just the usual way when you run open-source software like that where we prepare documentation and have forums where people share their problems ans solutions. The usual Open-source community.

Look at the monitoring of yours - depending on the deployment (like with any other software) you should have some ways to monitor memory, cpu, I/O. other resources and they might show some anomalies that can cause instabilities that will not be visible in airflow logs (because for example you have not enough resources to run airflow and it gets killed externally). Again this is nothing special for airlfow - standard way how any applications would be managed and monitored for your deployment
so you can apply techniquest that you usully apply for other apps of yours. Managing the deployment of yours is an importat responsibility of people like you (Deployment Managers) - we are just releasing airflow software, but it's the Deployment Managers who need to manage, monitor, and tune airflow - following the documentation we release together with the software. And it also varies accross the deployment that the Deployment Manager chose - for example. a lot of monitoring and tuning that the Deployment Manager would have to do is handled for the manager by the managed service if you choose to run managed service rather than deploy it on your own. Generally speaking, you have this page https://airflow.apache.org/docs/apache-airflow/stable/installation/index.html which describes what kind of skills and what kind of effort and which part of the deployment is expected from the deployment manager like you depending on the choice of deployment you make.

If you are unsure and just learning Airflow, you can play with the allocation of thsoe resources if you have no monitoring in place - changing memory to be bigger, increasing CPU limits etc. This is a valid technique, Airflow has many knobs to turn and many options you can configure also it can run your own code that you provide, which might change Airflows expectations for resources etc. , so it's a very valid technique to try-and-see, rather than "foresee" what kind of resources you might have. In process control theory, it is very good approach when the system has many variables. You go in the loop "guess what can be changed -> change -> observe -> see the impact -> loop back" . That process control with feedback loop.

Finally, you can also play with fine-tuning the scheduler - depending on how your Airlfow is deployed, which database, which filesystem what architecture and executor you chose, you have many knobs to turn in the configuration (again it's Deployment's Manager job to fine-tune it to the right configuration. This page https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/scheduler.html#fine-tuning-your-scheduler-performance has more detailed explanation on the knobs you can turn, what effects they have and which part of the system you should diagnose and observe in order to make your decisions.

I hope it will help.

arjunanan6 added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels May 11, 2023

ephraimbuddy removed the needs-triage label for new issues that we didn't triage yet label May 11, 2023

ephraimbuddy added this to the Airflow 2.6.2 milestone May 11, 2023

eladkal added the affected_version:2.6 Issues Reported for 2.6 label May 12, 2023

thesuperzapper mentioned this issue May 14, 2023

fix: liveness probes in airflow 2.6.0 airflow-helm/charts#743

Merged

4 tasks

potiuk modified the milestones: Airflow 2.6.2, Airflow 2.6.1 May 14, 2023

potiuk mentioned this issue May 14, 2023

Fix calculation of health check threshold for SchedulerJob #31277

Merged

potiuk closed this as completed in #31277 May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constant "The scheduler does not appear to be running" warning on the UI following 2.6.0 upgrade #31200

Constant "The scheduler does not appear to be running" warning on the UI following 2.6.0 upgrade #31200

arjunanan6 commented May 11, 2023

PhoenixChamb commented May 11, 2023

potiuk commented May 11, 2023 •

edited

Loading

arjunanan6 commented May 11, 2023

arjunanan6 commented May 11, 2023

ptran32 commented May 11, 2023 •

edited

Loading

arjunanan6 commented May 12, 2023

ptran32 commented May 12, 2023 •

edited

Loading

rcheatham-q commented May 12, 2023

ptran32 commented May 12, 2023

potiuk commented May 13, 2023

potiuk commented May 13, 2023 •

edited

Loading

potiuk commented May 13, 2023

thesuperzapper commented May 13, 2023 •

edited

Loading

thesuperzapper commented May 13, 2023 •

edited

Loading

thesuperzapper commented May 14, 2023

rcheatham-q commented May 14, 2023

potiuk commented May 14, 2023 •

edited

Loading

potiuk commented May 14, 2023

arjunanan6 commented May 15, 2023

potiuk commented May 15, 2023

thesuperzapper commented May 15, 2023

potiuk commented May 15, 2023 •

edited

Loading

Atif8Ted commented Jun 28, 2023 •

edited

Loading

potiuk commented Jun 28, 2023

Constant "The scheduler does not appear to be running" warning on the UI following 2.6.0 upgrade #31200

Constant "The scheduler does not appear to be running" warning on the UI following 2.6.0 upgrade #31200

Comments

arjunanan6 commented May 11, 2023

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

PhoenixChamb commented May 11, 2023

potiuk commented May 11, 2023 • edited Loading

arjunanan6 commented May 11, 2023

arjunanan6 commented May 11, 2023

ptran32 commented May 11, 2023 • edited Loading

arjunanan6 commented May 12, 2023

ptran32 commented May 12, 2023 • edited Loading

rcheatham-q commented May 12, 2023

ptran32 commented May 12, 2023

potiuk commented May 13, 2023

potiuk commented May 13, 2023 • edited Loading

potiuk commented May 13, 2023

thesuperzapper commented May 13, 2023 • edited Loading

thesuperzapper commented May 13, 2023 • edited Loading

thesuperzapper commented May 14, 2023

rcheatham-q commented May 14, 2023

potiuk commented May 14, 2023 • edited Loading

potiuk commented May 14, 2023

arjunanan6 commented May 15, 2023

potiuk commented May 15, 2023

thesuperzapper commented May 15, 2023

potiuk commented May 15, 2023 • edited Loading

Atif8Ted commented Jun 28, 2023 • edited Loading

potiuk commented Jun 28, 2023

potiuk commented May 11, 2023 •

edited

Loading

ptran32 commented May 11, 2023 •

edited

Loading

ptran32 commented May 12, 2023 •

edited

Loading

potiuk commented May 13, 2023 •

edited

Loading

thesuperzapper commented May 13, 2023 •

edited

Loading

thesuperzapper commented May 13, 2023 •

edited

Loading

potiuk commented May 14, 2023 •

edited

Loading

potiuk commented May 15, 2023 •

edited

Loading

Atif8Ted commented Jun 28, 2023 •

edited

Loading