-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constant "The scheduler does not appear to be running" warning on the UI following 2.6.0 upgrade #31200
Comments
We are facing the same problem after upgrading Airflow from 2.5.3 to 2.6.0 |
@arjunanan6 and @PhoenixChamb could you please increase the https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#job-heartbeat-sec to say 7 s and then increase it to 10 if it does not help (if you have default 5) and see if that decreases the probability of such an event and report it back here? That would help us to be able to see if just increasing the default is a good idea because of increased resource usage in 2.6 or whether we have some other problem. |
@potiuk On it, I will report back shortly. |
@potiuk I'm afraid that did not really help the situation. I need more refreshes to see the warning, but it still appears 'incorrectly'. |
We also faced this issue after upgrading from Airflow 2.5.3 to 2.6.0, the scheduler runs for 2 minutes, then restarts Scheduler logs: https://gist.github.com/ptran32/56703c86a854ec80bdae1d5195c182e7 We also notice an error on the liveness of the pod:
|
As @ptran32 noted, we are also seeing unnecessary restarts for the scheduler because a livenessprobe failed. Interestingly, this was not a problem today, but something that started late last night - several hours after the upgrade. But, I really cannot see any issues from the scheduler logs that would cause the probe to fail. As a temporary solution, I am trying out increasing the following values:
Hopefully, this results in no/fewer restarts. |
Hi, if you are using Helm to deploy Airflow, the 3.6.0 version will cause the scheduler to restart in loop, not because of the timeout (at least in my case) but because the liveness check is not compatible with the new release Scheduler logs after upgrade to 3.6.0
We fixed it by adjusting the liveness check command in the helm chart:
I hope it helps. |
I can confirm that the default liveness probe for the scheduler is failing. The helm chart v1.9.0 defines the liveness probe as follows:
However, executing that command on a live and functioning scheduler pod fails (I omitted the
I can confirm that the liveness check listed by @ptran32 does function, however it does not seem to directly detect if the scheduler job is running. I also tested the example It appears that the scheduler healthchecks in general are broken in version 2.6.0. |
@rcheatham-q that might help as a workaround for now: airflow-helm/charts#738 |
This is not the problem. The workaround is in the non-Airflow community managed Helm chart that did not catch up with the new scheduler probe and even the workaround there is very wrong. This is a different issue (And I am looking into it now). I doubt it is caused by liveness check not working (it works in general). This is a different issue. |
I would love to get to the bottom of it. Can anyone having the problem (with the Airflow Community chart https://airflow.apache.org/docs/helm-chart/stable/index.html ) run the following on their frequently failing scheduler:
And ping me. @ptran32 @rcheatham-q @arjunanan6 - can I count on your help here? |
@PhoenixChamb - you as well ^^ Can I count on your help too ? |
@potiuk this issue is because the This means that the Also, I think airflow's internal checking for the health of the scheduler is NOT affected by this bug, as it looks like |
After thinking about it, I think the easiest way to fix the issue I described in #31200 (comment) is to update We may also want to remove the |
For anyone watching who wants a workaround for Airflow 2.6.0, you can simply set |
I can confirm that this workaround removes the warning from the UI and the updated liveness probe in the linked PR also works. Thanks @thesuperzapper for the workarounds. @potiuk given what @thesuperzapper has described, do you still need the debugging data you mentioned? I'm happy to provide it, but don't want to go through the effort if it is no longer required. |
Thanks @thesuperzapper ! This is a fantastic find! Thanks for the analsis - really helpful and absolutely @rcheatham-q and others - no need for the analysis. Indeed, there was a separate case for scheduler job where instead of the 2.1 grace period we overwrote it with the configuration value - and both the configuration value and I am applying a fix now that will make it into rc3 of 2.6.1 - and it will fix both the Still I think (just a comment the @thesuperzapper) you should change your Helm Chart to use As of 2.6 we are far more precise on what is and what is not the public interface of Airlfow - and things will break without a warning (even in patchlevel releases) if you base your API interfacing with Alrflow: specifically when relying on database structure, specific queries, and "random" airflow code executed from outside. You've been warned. While there might be occasional bugs like this one where we missed a case during refactoring, those will be quickly fixed (this one in 2.6.1) and they will remain forward-compatible ( |
Fix in #31277 |
The change #apache#30302 split Job from JobRunner, but it missed the fact that SchedulerJob had a special case of checking the threshold - instead of using the standard grace multiplier, it used whatever has been defined in the `scheduler_helth_check_threshold`. The `is_alive` method in SchedulerJobRunner has remained unused, and default 2.1 grace multiplier has been used for both /health endpoint and `airflow jobs check`. This PR brings the exception for SchedulerJob back and clarifies that the same treshold is also used for airflow jobs check in the documentation. Fixes: apache#31200
The change ##30302 split Job from JobRunner, but it missed the fact that SchedulerJob had a special case of checking the threshold - instead of using the standard grace multiplier, it used whatever has been defined in the `scheduler_helth_check_threshold`. The `is_alive` method in SchedulerJobRunner has remained unused, and default 2.1 grace multiplier has been used for both /health endpoint and `airflow jobs check`. This PR brings the exception for SchedulerJob back and clarifies that the same treshold is also used for airflow jobs check in the documentation. Fixes: #31200
@potiuk I pushed the suggested change from @thesuperzapper about an hour ago and everything looks fine now. No more warnings on the UI, or scheduler restarts. Thanks a lot! |
Yes. Thanks @arjunanan6 -> we just also merged #31277 with the fix to 2.6.1 (and rc3 will be up shortly for voting/testing - so if you want to take it to a spin for 2.6.1 and revert back to the original settings/configuration, that would be perfect. |
The change ##30302 split Job from JobRunner, but it missed the fact that SchedulerJob had a special case of checking the threshold - instead of using the standard grace multiplier, it used whatever has been defined in the `scheduler_helth_check_threshold`. The `is_alive` method in SchedulerJobRunner has remained unused, and default 2.1 grace multiplier has been used for both /health endpoint and `airflow jobs check`. This PR brings the exception for SchedulerJob back and clarifies that the same treshold is also used for airflow jobs check in the documentation. Fixes: #31200 (cherry picked from commit f366d95)
@potiuk the problem is that I'm doing a slightly more complex check than just if there is a There is an optional secondary check that ensures tasks are actually being scheduled (in the form of new The reason we need this check is because older versions of airflow will sometimes deadlock and not schedule new tasks but still have a heartbeat, so they look healthy to the normal probe. I'm not sure that this is possible to achieve with the current |
For sure you can do it with the Stable REST API. Or maybe this is a great idea to contribute such check to Airflow. This is the only way to make sure that anything you do there is not going to change in the next version. If you do not have unit tests in Airflow covering it and making sure that regression tests are passing, there is no way we can maintain backwards compatibility of the code. If you have not made any effort (by adding and making sure that the tests are in Airflow that can guarantee something in Airflow), we are not able to promise absolutely anything. This is one of the reasons why the bug crept into 2.6.0 - because it had no test to prevent regressions - now it has https://github.com/apache/airflow/pull/31277/files#diff-9499319fda165fa31190eba1879d7ecf71871ce3536da0a8505ba529378095c7R151 It would be absolutely unreasonable to expect that things will not change - code is a living thing. And what Airflow team can do is to promise that will make all the effort to keep the things that are public interface backwards compatible. And this is what public interface of Airflow document is about. Look there. If you find something that is not there, don't rely on it. Even more if you find that something is specifically excluded (like direct DB access) - don't use it even more. Figure a different way (Stable REST API is a good idea). Just stick to it. There is no argument about it anyway, until you bring the diccussion into the devlist and convince the people to change their decision about it. You have to change your ways or things will break. You have been warned (for the third time). And just to make a bit more illustration of what you are asking for. I know it's largely exaggerated and what you want is not as extreme as this but the line of thinking you have here is very similar to this one: With Airlfow explicitly defining what is public interface and what is not, you are basicaly asking for this^^ |
I am still facing the issue . the health endpoint still shows |
As usual - look at the logs, and your monitoring systems. Quite often there are issues with the deployment and resources that are causing that - incluidng lack of memory, cpu, disk, I/O limitatins. Looking at the logs and analysing if there are any warnings or information that it could indicate that scheduler goes down or that there are errors and warnigns that would indicate abnormal behaviour. Generally if you see any warnings or errors you should make sure to address them - usually you can find the reasons by reading the messages and applying what they say or you cna use airflow docs/google/search the issues and discussions here to find out if others have similar problems. Usually you will find answers, other's people logs and some fix suggestion if people had similar issue. Nothing really special - just the usual way when you run open-source software like that where we prepare documentation and have forums where people share their problems ans solutions. The usual Open-source community. Look at the monitoring of yours - depending on the deployment (like with any other software) you should have some ways to monitor memory, cpu, I/O. other resources and they might show some anomalies that can cause instabilities that will not be visible in airflow logs (because for example you have not enough resources to run airflow and it gets killed externally). Again this is nothing special for airlfow - standard way how any applications would be managed and monitored for your deployment If you are unsure and just learning Airflow, you can play with the allocation of thsoe resources if you have no monitoring in place - changing memory to be bigger, increasing CPU limits etc. This is a valid technique, Airflow has many knobs to turn and many options you can configure also it can run your own code that you provide, which might change Airflows expectations for resources etc. , so it's a very valid technique to try-and-see, rather than "foresee" what kind of resources you might have. In process control theory, it is very good approach when the system has many variables. You go in the loop "guess what can be changed -> change -> observe -> see the impact -> loop back" . That process control with feedback loop. Finally, you can also play with fine-tuning the scheduler - depending on how your Airlfow is deployed, which database, which filesystem what architecture and executor you chose, you have many knobs to turn in the configuration (again it's Deployment's Manager job to fine-tune it to the right configuration. This page https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/scheduler.html#fine-tuning-your-scheduler-performance has more detailed explanation on the knobs you can turn, what effects they have and which part of the system you should diagnose and observe in order to make your decisions. I hope it will help. |
Apache Airflow version
2.6.0
What happened
Ever since we upgraded to Airflow 2.6.0 from 2.5.2, we have seen that there is a warning stating "The scheduler does not appear to be running" intermittently.
This warning goes away by simply refreshing the page. And this conforms with our findings that the scheduler has not been down at all, at any point. By calling the /health point constantly, we can get it to show an "unhealthy" status:
These are just approx. 6 seconds apart:
This causes no operational issues, but it is misleading for end-users. What could be causing this?
What you think should happen instead
The warning should not be shown unless the last heartbeat was at least 30 sec earlier (default config).
How to reproduce
There are no concrete steps to reproduce it, but the warning appears in the UI after a few seconds of browsing around, or simply refresh the /health endpoint constantly.
Operating System
Debian GNU/Linux 11
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==8.0.0
apache-airflow-providers-celery==3.1.0
apache-airflow-providers-cncf-kubernetes==6.1.0
apache-airflow-providers-common-sql==1.4.0
apache-airflow-providers-docker==3.6.0
apache-airflow-providers-elasticsearch==4.4.0
apache-airflow-providers-ftp==3.3.1
apache-airflow-providers-google==10.0.0
apache-airflow-providers-grpc==3.1.0
apache-airflow-providers-hashicorp==3.3.1
apache-airflow-providers-http==4.3.0
apache-airflow-providers-imap==3.1.1
apache-airflow-providers-microsoft-azure==6.0.0
apache-airflow-providers-microsoft-mssql==3.3.2
apache-airflow-providers-microsoft-psrp==2.2.0
apache-airflow-providers-microsoft-winrm==3.0.0
apache-airflow-providers-mysql==5.0.0
apache-airflow-providers-odbc==3.2.1
apache-airflow-providers-oracle==3.0.0
apache-airflow-providers-postgres==5.4.0
apache-airflow-providers-redis==3.1.0
apache-airflow-providers-sendgrid==3.1.0
apache-airflow-providers-sftp==4.2.4
apache-airflow-providers-slack==7.2.0
apache-airflow-providers-snowflake==4.0.5
apache-airflow-providers-sqlite==3.3.2
apache-airflow-providers-ssh==3.6.0
Deployment
Official Apache Airflow Helm Chart
Deployment details
Deployed on AKS with helm
Anything else
None more than in the description above.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: