-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix calculation of health check threshold for SchedulerJob #31277
Conversation
@@ -264,27 +264,6 @@ def _debug_dump(self, signum: int, frame: FrameType | None) -> None: | |||
self.job.executor.debug_dump() | |||
self.log.info("-" * 80) | |||
|
|||
def is_alive(self, grace_multiplier: float | None = None) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used any more - I missed during the refactor that this "special" case for SchedulerJob is not called any more.
if self.job_type == "SchedulerJob": | ||
health_check_threshold: int = conf.getint("scheduler", "scheduler_health_check_threshold") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to hard-code this, we can create an internal method on this base class that returns health_check_threshold
and override that on SchedulerJob
class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no SchedulerJob
class any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the gist of the change in #30302 and #30255 - to get rid of the Polymorphism we had and keep one "Job" which is not a "BaseJob" any more. We can (as next step and I might do it) move the threshold calculation to SchedulerJobRunner, but this one is the easiest and safest fix to put specific exception in the "is_alive" method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW. There is also a follow-up from @jedcunningham to move the Job to models #31194 (where it actually belongs) and we could do it as part of that move - the whole decoupling actually made it possible to do such a move.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tiny a change in behaviour.
Previously if you called SchedulerJob.is_alive(2.2)
it would use the multiplier instead of the configures threshold.
This is probably okay to ignore the difference but should be called out in a news fragment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yeah - might be actually good to spell it out that you should use airflow jobs check
now instead of SchedulerJob model as we know poeple DID use the queries (In Matthew's chart for one). Let me add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've addded description. I will be weakly available this afternoon to follow-up, but I am perfectly fine with refining the message I explained in the newsfragment if someone would like to describe it better (feel free to add changes and merge without me doing so :) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashb - I think it makes little sense to mention the multiplier as a change - it's better to say (if you've used it in the past - don't do it at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @kaxil -> I also added another fixup, where I went a bit further - i.e. I implemented in the way that SchedulerJobRunner is overriding the threshold, so that it is not "hard-coded" (so what I discussed above) - I think it might be a bit nicer in the sense it is now not "hard-coded" in Job and moved to the runner- Job is now not aware about Scheduler Job at all. Maybe that is a better idea?
(and I removed the grace_multiplier altogether as parameter - converted it to a constant).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it back - I realized that we cannot do it - > the check should be based on the job_type not set externally by the runner - the way it is used by job check
is that it retrives it from the DB - and runner is not involved. So really the "is_alive" check must apply different threshold based on the type, not set by the runner.
The change #apache#30302 split Job from JobRunner, but it missed the fact that SchedulerJob had a special case of checking the threshold - instead of using the standard grace multiplier, it used whatever has been defined in the `scheduler_helth_check_threshold`. The `is_alive` method in SchedulerJobRunner has remained unused, and default 2.1 grace multiplier has been used for both /health endpoint and `airflow jobs check`. This PR brings the exception for SchedulerJob back and clarifies that the same treshold is also used for airflow jobs check in the documentation. Fixes: apache#31200
37cf65e
to
7add1fc
Compare
The change ##30302 split Job from JobRunner, but it missed the fact that SchedulerJob had a special case of checking the threshold - instead of using the standard grace multiplier, it used whatever has been defined in the `scheduler_helth_check_threshold`. The `is_alive` method in SchedulerJobRunner has remained unused, and default 2.1 grace multiplier has been used for both /health endpoint and `airflow jobs check`. This PR brings the exception for SchedulerJob back and clarifies that the same treshold is also used for airflow jobs check in the documentation. Fixes: #31200 (cherry picked from commit f366d95)
The change ##30302 split Job from JobRunner, but it missed the fact that SchedulerJob had a special case of checking the threshold - instead of using the standard grace multiplier, it used whatever has been defined in the
scheduler_helth_check_threshold
. Theis_alive
method in SchedulerJobRunner has remained unused, and default 2.1 grace multiplier has been used for both /health endpoint andairflow jobs check
.This PR brings the exception for SchedulerJob back and clarifies that the same treshold is also used for airflow jobs check in the documentation.
Fixes: #31200
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.