Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant "The scheduler does not appear to be running" warning on the UI following 2.6.0 upgrade #31200

Closed
1 of 2 tasks
arjunanan6 opened this issue May 11, 2023 · 24 comments · Fixed by #31277
Closed
1 of 2 tasks
Labels
affected_version:2.6 Issues Reported for 2.6 area:core kind:bug This is a clearly a bug
Milestone

Comments

@arjunanan6
Copy link
Contributor

Apache Airflow version

2.6.0

What happened

Ever since we upgraded to Airflow 2.6.0 from 2.5.2, we have seen that there is a warning stating "The scheduler does not appear to be running" intermittently.

This warning goes away by simply refreshing the page. And this conforms with our findings that the scheduler has not been down at all, at any point. By calling the /health point constantly, we can get it to show an "unhealthy" status:

These are just approx. 6 seconds apart:

{"metadatabase": {"status": "healthy"}, "scheduler": {"latest_scheduler_heartbeat": "2023-05-11T07:42:36.857007+00:00", "status": "healthy"}}

{"metadatabase": {"status": "healthy"}, "scheduler": {"latest_scheduler_heartbeat": "2023-05-11T07:42:42.409344+00:00", "status": "unhealthy"}}

This causes no operational issues, but it is misleading for end-users. What could be causing this?

What you think should happen instead

The warning should not be shown unless the last heartbeat was at least 30 sec earlier (default config).

How to reproduce

There are no concrete steps to reproduce it, but the warning appears in the UI after a few seconds of browsing around, or simply refresh the /health endpoint constantly.

Operating System

Debian GNU/Linux 11

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==8.0.0
apache-airflow-providers-celery==3.1.0
apache-airflow-providers-cncf-kubernetes==6.1.0
apache-airflow-providers-common-sql==1.4.0
apache-airflow-providers-docker==3.6.0
apache-airflow-providers-elasticsearch==4.4.0
apache-airflow-providers-ftp==3.3.1
apache-airflow-providers-google==10.0.0
apache-airflow-providers-grpc==3.1.0
apache-airflow-providers-hashicorp==3.3.1
apache-airflow-providers-http==4.3.0
apache-airflow-providers-imap==3.1.1
apache-airflow-providers-microsoft-azure==6.0.0
apache-airflow-providers-microsoft-mssql==3.3.2
apache-airflow-providers-microsoft-psrp==2.2.0
apache-airflow-providers-microsoft-winrm==3.0.0
apache-airflow-providers-mysql==5.0.0
apache-airflow-providers-odbc==3.2.1
apache-airflow-providers-oracle==3.0.0
apache-airflow-providers-postgres==5.4.0
apache-airflow-providers-redis==3.1.0
apache-airflow-providers-sendgrid==3.1.0
apache-airflow-providers-sftp==4.2.4
apache-airflow-providers-slack==7.2.0
apache-airflow-providers-snowflake==4.0.5
apache-airflow-providers-sqlite==3.3.2
apache-airflow-providers-ssh==3.6.0

Deployment

Official Apache Airflow Helm Chart

Deployment details

Deployed on AKS with helm

Anything else

None more than in the description above.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@arjunanan6 arjunanan6 added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels May 11, 2023
@PhoenixChamb
Copy link

We are facing the same problem after upgrading Airflow from 2.5.3 to 2.6.0

@ephraimbuddy ephraimbuddy removed the needs-triage label for new issues that we didn't triage yet label May 11, 2023
@ephraimbuddy ephraimbuddy added this to the Airflow 2.6.2 milestone May 11, 2023
@potiuk
Copy link
Member

potiuk commented May 11, 2023

@arjunanan6 and @PhoenixChamb could you please increase the https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#job-heartbeat-sec to say 7 s and then increase it to 10 if it does not help (if you have default 5) and see if that decreases the probability of such an event and report it back here?

That would help us to be able to see if just increasing the default is a good idea because of increased resource usage in 2.6 or whether we have some other problem.

@arjunanan6
Copy link
Contributor Author

@potiuk On it, I will report back shortly.

@arjunanan6
Copy link
Contributor Author

@potiuk I'm afraid that did not really help the situation. I need more refreshes to see the warning, but it still appears 'incorrectly'.

@ptran32
Copy link

ptran32 commented May 11, 2023

We also faced this issue after upgrading from Airflow 2.5.3 to 2.6.0, the scheduler runs for 2 minutes, then restarts

Scheduler logs: https://gist.github.com/ptran32/56703c86a854ec80bdae1d5195c182e7

We also notice an error on the liveness of the pod:

Liveness probe failed: Traceback (most recent call last): File "<string>", line 2, in <module> ModuleNotFoundError: No module named 'airflow.jobs.scheduler_job'

@arjunanan6
Copy link
Contributor Author

As @ptran32 noted, we are also seeing unnecessary restarts for the scheduler because a livenessprobe failed. Interestingly, this was not a problem today, but something that started late last night - several hours after the upgrade.

But, I really cannot see any issues from the scheduler logs that would cause the probe to fail. As a temporary solution, I am trying out increasing the following values:

  • scheduler.livenessProbe.timeoutSeconds to 30
  • scheduler.livenessProbe.failureThreshold to 10

Hopefully, this results in no/fewer restarts.

@ptran32
Copy link

ptran32 commented May 12, 2023

As @ptran32 noted, we are also seeing unnecessary restarts for the scheduler because a livenessprobe failed. Interestingly, this was not a problem today, but something that started late last night - several hours after the upgrade.

But, I really cannot see any issues from the scheduler logs that would cause the probe to fail. As a temporary solution, I am trying out increasing the following values:

  • scheduler.livenessProbe.timeoutSeconds to 30
  • scheduler.livenessProbe.failureThreshold to 10

Hopefully, this results in no/fewer restarts.

Hi,

if you are using Helm to deploy Airflow, the 3.6.0 version will cause the scheduler to restart in loop, not because of the timeout (at least in my case) but because the liveness check is not compatible with the new release

Scheduler logs after upgrade to 3.6.0

Liveness probe failed: Traceback (most recent call last): File "<string>", line 2, in <module> ModuleNotFoundError: No module named 'airflow.jobs.scheduler_job'

We fixed it by adjusting the liveness check command in the helm chart:

      livenessProbe:
          exec:
            command:
            - python
            - -Wignore
            - -c
            - |
              from typing import List
              from airflow.jobs.scheduler_job_runner import SchedulerJobRunner

I hope it helps.

@rcheatham-q
Copy link

I can confirm that the default liveness probe for the scheduler is failing. The helm chart v1.9.0 defines the liveness probe as follows:

CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint airflow jobs check --job-type SchedulerJob --local

However, executing that command on a live and functioning scheduler pod fails (I omitted the exec command so I could check the exit code):

airflow@airflow-scheduler-866db5c895-qt5pq:/opt/airflow$ CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR /entrypoint airflow jobs check --job-type SchedulerJob --local
No alive jobs found.
airflow@airflow-scheduler-866db5c895-qt5pq:/opt/airflow$ echo $?
1
airflow@airflow-scheduler-866db5c895-qt5pq:/opt/airflow$ 

I can confirm that the liveness check listed by @ptran32 does function, however it does not seem to directly detect if the scheduler job is running.

I also tested the example airflow jobs check commands listed in the CLI docs and both functioned the same.

It appears that the scheduler healthchecks in general are broken in version 2.6.0.

@ptran32
Copy link

ptran32 commented May 12, 2023

@rcheatham-q that might help as a workaround for now: airflow-helm/charts#738

@eladkal eladkal added the affected_version:2.6 Issues Reported for 2.6 label May 12, 2023
@potiuk
Copy link
Member

potiuk commented May 13, 2023

@rcheatham-q that might help as a workaround for now: airflow-helm/charts#738

This is not the problem. The workaround is in the non-Airflow community managed Helm chart that did not catch up with the new scheduler probe and even the workaround there is very wrong. This is a different issue (And I am looking into it now). I doubt it is caused by liveness check not working (it works in general). This is a different issue.

@potiuk
Copy link
Member

potiuk commented May 13, 2023

I would love to get to the bottom of it. Can anyone having the problem (with the Airflow Community chart https://airflow.apache.org/docs/helm-chart/stable/index.html ) run the following on their frequently failing scheduler:

  1. Add AIRFLOW__LOGGING__LOGGING_LEVEL=DEBUG to your scheduler environment and restart it - make sure that log contains debug information
  2. exec into the container of scheduler while it is running (/entrypoint bash) should be the right command to exec into the container.
  3. run this command date and save its output
  4. then AIRFLOW__LOGGING__LOGGING_LEVEL=DEBUG airflow jobs check --job-type SchedulerJob --local several times while airflow is running and save somewhere the output. Note down how long it takes to run the commands.
  5. if you notice that it switches from Found one alive job into No alive jobs found. do as follows
  6. run this commanddate and save its output
  7. run airflow db shell
  8. run SELECT * from job; and save its output
  9. repeat few times and see if it changes (last time seen) - save the outputs
  10. When your scheduler gets killed because of liveness checks -> find the logs of scheduler from before that faiure and save them to share
  11. create some way (gists?) to share all the dumped information (please annotate and comment where the log is from)
  12. please also add all the version/cluster information you have

And ping me.

@ptran32 @rcheatham-q @arjunanan6 - can I count on your help here?

@potiuk
Copy link
Member

potiuk commented May 13, 2023

@PhoenixChamb - you as well ^^ Can I count on your help too ?

@thesuperzapper
Copy link
Contributor

thesuperzapper commented May 13, 2023

@potiuk this issue is because the scheduler_health_check_threshold config is ignored since airflow 2.6.0, this is because SchedulerJobRunner.is_alive() is not called anymore, only the Job.is_alive() which does not consider the config. This is because, since PR #30302, the SQL query used by the /health endpoint returns a Job rather than a SchedulerJobRunner instance.

This means that the job_heartbeat_sec is used instead, and as that has a default of 5 seconds (compared with the scheduler_health_check_threshold default of 30), I expect we will see many users reporting their Schedulers are unhealthy.

Also, I think airflow's internal checking for the health of the scheduler is NOT affected by this bug, as it looks like adopt_or_reset_orphaned_tasks() correctly uses scheduler_health_check_threshold.

@thesuperzapper
Copy link
Contributor

thesuperzapper commented May 13, 2023

After thinking about it, I think the easiest way to fix the issue I described in #31200 (comment) is to update Job.is_alive() to specifically check if its self.job_type is "SchedulerJob", and in that case, use the same logic as SchedulerJobRunner.is_alive().

We may also want to remove the SchedulerJobRunner.is_alive() method altogether, because I am pretty sure it's not used anymore, and will cause confusion.

@thesuperzapper
Copy link
Contributor

For anyone watching who wants a workaround for Airflow 2.6.0, you can simply set AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC to 30, or whatever you had AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD set to.

@rcheatham-q
Copy link

For anyone watching who wants a workaround for Airflow 2.6.0, you can simply set AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC to 30, or whatever you had AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD set to.

I can confirm that this workaround removes the warning from the UI and the updated liveness probe in the linked PR also works. Thanks @thesuperzapper for the workarounds.

@potiuk given what @thesuperzapper has described, do you still need the debugging data you mentioned? I'm happy to provide it, but don't want to go through the effort if it is no longer required.

@potiuk
Copy link
Member

potiuk commented May 14, 2023

Thanks @thesuperzapper !

This is a fantastic find! Thanks for the analsis - really helpful and absolutely @rcheatham-q and others - no need for the analysis.

Indeed, there was a separate case for scheduler job where instead of the 2.1 grace period we overwrote it with the configuration value - and both the configuration value and is_alive in SchedulerJobRunner have not been used.

I am applying a fix now that will make it into rc3 of 2.6.1 - and it will fix both the /health endpoint and airflow jobs check command to use the configuration variable for the check.

Still I think (just a comment the @thesuperzapper) you should change your Helm Chart to use airflow jobs check as of 2.6.1 - see my comment here: airflow-helm/charts#738 (comment) .

As of 2.6 we are far more precise on what is and what is not the public interface of Airlfow - and things will break without a warning (even in patchlevel releases) if you base your API interfacing with Alrflow: specifically when relying on database structure, specific queries, and "random" airflow code executed from outside. You've been warned.

While there might be occasional bugs like this one where we missed a case during refactoring, those will be quickly fixed (this one in 2.6.1) and they will remain forward-compatible (airlfow jobs check aim is to provide the right check - for example if we get rid of Job table in the future - which might happen without a warning in any release - it will continue to work).

@potiuk
Copy link
Member

potiuk commented May 14, 2023

Fix in #31277

potiuk added a commit to potiuk/airflow that referenced this issue May 14, 2023
The change #apache#30302 split Job from JobRunner, but it missed the fact
that SchedulerJob had a special case of checking the threshold -
instead of using the standard grace multiplier, it used whatever
has been defined in the `scheduler_helth_check_threshold`. The
`is_alive` method in SchedulerJobRunner has remained unused, and
default 2.1 grace multiplier has been used for both /health
endpoint and `airflow jobs check`.

This PR brings the exception for SchedulerJob back and clarifies
that the same treshold is also used for airflow jobs check in
the documentation.

Fixes: apache#31200
potiuk added a commit that referenced this issue May 15, 2023
The change ##30302 split Job from JobRunner, but it missed the fact
that SchedulerJob had a special case of checking the threshold -
instead of using the standard grace multiplier, it used whatever
has been defined in the `scheduler_helth_check_threshold`. The
`is_alive` method in SchedulerJobRunner has remained unused, and
default 2.1 grace multiplier has been used for both /health
endpoint and `airflow jobs check`.

This PR brings the exception for SchedulerJob back and clarifies
that the same treshold is also used for airflow jobs check in
the documentation.

Fixes: #31200
@arjunanan6
Copy link
Contributor Author

@potiuk I pushed the suggested change from @thesuperzapper about an hour ago and everything looks fine now. No more warnings on the UI, or scheduler restarts. Thanks a lot!

@potiuk
Copy link
Member

potiuk commented May 15, 2023

Yes. Thanks @arjunanan6 -> we just also merged #31277 with the fix to 2.6.1 (and rc3 will be up shortly for voting/testing - so if you want to take it to a spin for 2.6.1 and revert back to the original settings/configuration, that would be perfect.

ephraimbuddy pushed a commit that referenced this issue May 15, 2023
The change ##30302 split Job from JobRunner, but it missed the fact
that SchedulerJob had a special case of checking the threshold -
instead of using the standard grace multiplier, it used whatever
has been defined in the `scheduler_helth_check_threshold`. The
`is_alive` method in SchedulerJobRunner has remained unused, and
default 2.1 grace multiplier has been used for both /health
endpoint and `airflow jobs check`.

This PR brings the exception for SchedulerJob back and clarifies
that the same treshold is also used for airflow jobs check in
the documentation.

Fixes: #31200
(cherry picked from commit f366d95)
@thesuperzapper
Copy link
Contributor

Still I think (just a comment the @thesuperzapper) you should change your Helm Chart to use airflow jobs check as of 2.6.1 - see my comment here: airflow-helm/charts#738 (comment) .

As of 2.6 we are far more precise on what is and what is not the public interface of Airlfow - and things will break without a warning (even in patchlevel releases) if you base your API interfacing with Alrflow: specifically when relying on database structure, specific queries, and "random" airflow code executed from outside. You've been warned.

While there might be occasional bugs like this one where we missed a case during refactoring, those will be quickly fixed (this one in 2.6.1) and they will remain forward-compatible (airlfow jobs check aim is to provide the right check - for example if we get rid of Job table in the future - which might happen without a warning in any release - it will continue to work).

@potiuk the problem is that I'm doing a slightly more complex check than just if there is a SchedulerJob running.

There is an optional secondary check that ensures tasks are actually being scheduled (in the form of new LocalTaskJob appearing), but this check is only started once at least one scheduler has been running for a minimum period (to prevent deadlocks where the probe restarts the scheduler before it gets a chance to schedule something).

The reason we need this check is because older versions of airflow will sometimes deadlock and not schedule new tasks but still have a heartbeat, so they look healthy to the normal probe.

I'm not sure that this is possible to achieve with the current airflow jobs check.

@potiuk
Copy link
Member

potiuk commented May 15, 2023

I'm not sure that this is possible to achieve with the current airflow jobs check.

For sure you can do it with the Stable REST API.

Or maybe this is a great idea to contribute such check to Airflow. This is the only way to make sure that anything you do there is not going to change in the next version. If you do not have unit tests in Airflow covering it and making sure that regression tests are passing, there is no way we can maintain backwards compatibility of the code. If you have not made any effort (by adding and making sure that the tests are in Airflow that can guarantee something in Airflow), we are not able to promise absolutely anything.

This is one of the reasons why the bug crept into 2.6.0 - because it had no test to prevent regressions - now it has https://github.com/apache/airflow/pull/31277/files#diff-9499319fda165fa31190eba1879d7ecf71871ce3536da0a8505ba529378095c7R151

It would be absolutely unreasonable to expect that things will not change - code is a living thing. And what Airflow team can do is to promise that will make all the effort to keep the things that are public interface backwards compatible.

And this is what public interface of Airflow document is about. Look there. If you find something that is not there, don't rely on it. Even more if you find that something is specifically excluded (like direct DB access) - don't use it even more. Figure a different way (Stable REST API is a good idea). Just stick to it. There is no argument about it anyway, until you bring the diccussion into the devlist and convince the people to change their decision about it. You have to change your ways or things will break. You have been warned (for the third time).

And just to make a bit more illustration of what you are asking for. I know it's largely exaggerated and what you want is not as extreme as this but the line of thinking you have here is very similar to this one:

image

With Airlfow explicitly defining what is public interface and what is not, you are basicaly asking for this^^

@Atif8Ted
Copy link

Atif8Ted commented Jun 28, 2023

I am still facing the issue . the health endpoint still shows scheduler as unhealthy randomly. We are using 2.6.2.
How do we debug this ?

@potiuk
Copy link
Member

potiuk commented Jun 28, 2023

As usual - look at the logs, and your monitoring systems. Quite often there are issues with the deployment and resources that are causing that - incluidng lack of memory, cpu, disk, I/O limitatins.

Looking at the logs and analysing if there are any warnings or information that it could indicate that scheduler goes down or that there are errors and warnigns that would indicate abnormal behaviour. Generally if you see any warnings or errors you should make sure to address them - usually you can find the reasons by reading the messages and applying what they say or you cna use airflow docs/google/search the issues and discussions here to find out if others have similar problems. Usually you will find answers, other's people logs and some fix suggestion if people had similar issue.

Nothing really special - just the usual way when you run open-source software like that where we prepare documentation and have forums where people share their problems ans solutions. The usual Open-source community.

Look at the monitoring of yours - depending on the deployment (like with any other software) you should have some ways to monitor memory, cpu, I/O. other resources and they might show some anomalies that can cause instabilities that will not be visible in airflow logs (because for example you have not enough resources to run airflow and it gets killed externally). Again this is nothing special for airlfow - standard way how any applications would be managed and monitored for your deployment
so you can apply techniquest that you usully apply for other apps of yours. Managing the deployment of yours is an importat responsibility of people like you (Deployment Managers) - we are just releasing airflow software, but it's the Deployment Managers who need to manage, monitor, and tune airflow - following the documentation we release together with the software. And it also varies accross the deployment that the Deployment Manager chose - for example. a lot of monitoring and tuning that the Deployment Manager would have to do is handled for the manager by the managed service if you choose to run managed service rather than deploy it on your own. Generally speaking, you have this page https://airflow.apache.org/docs/apache-airflow/stable/installation/index.html which describes what kind of skills and what kind of effort and which part of the deployment is expected from the deployment manager like you depending on the choice of deployment you make.

If you are unsure and just learning Airflow, you can play with the allocation of thsoe resources if you have no monitoring in place - changing memory to be bigger, increasing CPU limits etc. This is a valid technique, Airflow has many knobs to turn and many options you can configure also it can run your own code that you provide, which might change Airflows expectations for resources etc. , so it's a very valid technique to try-and-see, rather than "foresee" what kind of resources you might have. In process control theory, it is very good approach when the system has many variables. You go in the loop "guess what can be changed -> change -> observe -> see the impact -> loop back" . That process control with feedback loop.

Finally, you can also play with fine-tuning the scheduler - depending on how your Airlfow is deployed, which database, which filesystem what architecture and executor you chose, you have many knobs to turn in the configuration (again it's Deployment's Manager job to fine-tune it to the right configuration. This page https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/scheduler.html#fine-tuning-your-scheduler-performance has more detailed explanation on the knobs you can turn, what effects they have and which part of the system you should diagnose and observe in order to make your decisions.

I hope it will help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.6 Issues Reported for 2.6 area:core kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants