Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix calculation of health check threshold for SchedulerJob #31277

Merged
merged 1 commit into from
May 15, 2023

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented May 14, 2023

The change ##30302 split Job from JobRunner, but it missed the fact that SchedulerJob had a special case of checking the threshold - instead of using the standard grace multiplier, it used whatever has been defined in the scheduler_helth_check_threshold. The is_alive method in SchedulerJobRunner has remained unused, and default 2.1 grace multiplier has been used for both /health endpoint and airflow jobs check.

This PR brings the exception for SchedulerJob back and clarifies that the same treshold is also used for airflow jobs check in the documentation.

Fixes: #31200


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@potiuk potiuk requested review from kaxil, ashb and XD-DENG as code owners May 14, 2023 11:13
@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label May 14, 2023
@@ -264,27 +264,6 @@ def _debug_dump(self, signum: int, frame: FrameType | None) -> None:
self.job.executor.debug_dump()
self.log.info("-" * 80)

def is_alive(self, grace_multiplier: float | None = None) -> bool:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used any more - I missed during the refactor that this "special" case for SchedulerJob is not called any more.

Comment on lines +133 to +134
if self.job_type == "SchedulerJob":
health_check_threshold: int = conf.getint("scheduler", "scheduler_health_check_threshold")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to hard-code this, we can create an internal method on this base class that returns health_check_threshold and override that on SchedulerJob class

Copy link
Member Author

@potiuk potiuk May 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no SchedulerJob class any more.

Copy link
Member Author

@potiuk potiuk May 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the gist of the change in #30302 and #30255 - to get rid of the Polymorphism we had and keep one "Job" which is not a "BaseJob" any more. We can (as next step and I might do it) move the threshold calculation to SchedulerJobRunner, but this one is the easiest and safest fix to put specific exception in the "is_alive" method.

Copy link
Member Author

@potiuk potiuk May 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW. There is also a follow-up from @jedcunningham to move the Job to models #31194 (where it actually belongs) and we could do it as part of that move - the whole decoupling actually made it possible to do such a move.

Copy link
Member

@ashb ashb May 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tiny a change in behaviour.

Previously if you called SchedulerJob.is_alive(2.2) it would use the multiplier instead of the configures threshold.

This is probably okay to ignore the difference but should be called out in a news fragment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yeah - might be actually good to spell it out that you should use airflow jobs check now instead of SchedulerJob model as we know poeple DID use the queries (In Matthew's chart for one). Let me add it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addded description. I will be weakly available this afternoon to follow-up, but I am perfectly fine with refining the message I explained in the newsfragment if someone would like to describe it better (feel free to add changes and merge without me doing so :) )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashb - I think it makes little sense to mention the multiplier as a change - it's better to say (if you've used it in the past - don't do it at all).

Copy link
Member Author

@potiuk potiuk May 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @kaxil -> I also added another fixup, where I went a bit further - i.e. I implemented in the way that SchedulerJobRunner is overriding the threshold, so that it is not "hard-coded" (so what I discussed above) - I think it might be a bit nicer in the sense it is now not "hard-coded" in Job and moved to the runner- Job is now not aware about Scheduler Job at all. Maybe that is a better idea?

(and I removed the grace_multiplier altogether as parameter - converted it to a constant).

Copy link
Member Author

@potiuk potiuk May 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it back - I realized that we cannot do it - > the check should be based on the job_type not set externally by the runner - the way it is used by job check is that it retrives it from the DB - and runner is not involved. So really the "is_alive" check must apply different threshold based on the type, not set by the runner.

The change #apache#30302 split Job from JobRunner, but it missed the fact
that SchedulerJob had a special case of checking the threshold -
instead of using the standard grace multiplier, it used whatever
has been defined in the `scheduler_helth_check_threshold`. The
`is_alive` method in SchedulerJobRunner has remained unused, and
default 2.1 grace multiplier has been used for both /health
endpoint and `airflow jobs check`.

This PR brings the exception for SchedulerJob back and clarifies
that the same treshold is also used for airflow jobs check in
the documentation.

Fixes: apache#31200
@potiuk potiuk force-pushed the fix-scheduler-health-rate-check branch 3 times, most recently from 37cf65e to 7add1fc Compare May 14, 2023 21:12
@potiuk potiuk merged commit f366d95 into apache:main May 15, 2023
@potiuk potiuk deleted the fix-scheduler-health-rate-check branch May 15, 2023 08:31
@ephraimbuddy ephraimbuddy added this to the Airflow 2.6.1 milestone May 15, 2023
@ephraimbuddy ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label May 15, 2023
ephraimbuddy pushed a commit that referenced this pull request May 15, 2023
The change ##30302 split Job from JobRunner, but it missed the fact
that SchedulerJob had a special case of checking the threshold -
instead of using the standard grace multiplier, it used whatever
has been defined in the `scheduler_helth_check_threshold`. The
`is_alive` method in SchedulerJobRunner has remained unused, and
default 2.1 grace multiplier has been used for both /health
endpoint and `airflow jobs check`.

This PR brings the exception for SchedulerJob back and clarifies
that the same treshold is also used for airflow jobs check in
the documentation.

Fixes: #31200
(cherry picked from commit f366d95)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Constant "The scheduler does not appear to be running" warning on the UI following 2.6.0 upgrade
4 participants