Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple "job runner" from BaseJob ORM model #30255

Merged
merged 2 commits into from
Apr 10, 2023

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Mar 23, 2023

Originally BaseJob ORM model was extended and Polymorphism has been used to tie different execution logic to different job types. This has proven to be difficult to handle during AIP-44 implementation (internal API) because LocalTaskJob, DagProcessorJob and TriggererJob are all going to not use the ORM BaseJob model, but they should use BaseJobPydantic instead. In order to make it possible, we introduce a new type of object BaseJobRunner and make BaseJob use the runners instead.

This way, the BaseJobRunners are used for the logic of each of the job, where single, non-polimorphic BaseJob is used to keep the records in the database - as a follow up it will allow to completely decouple the job database operations and move it to internal_api component when db-lesss mode is enabled.

Closes: #30294


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:API Airflow's REST/HTTP API area:CLI area:Scheduler including HA (high availability) scheduler area:webserver Webserver related Issues labels Mar 23, 2023
@potiuk potiuk requested review from Taragolis and mhenc March 23, 2023 16:17
@potiuk
Copy link
Member Author

potiuk commented Mar 23, 2023

cc: @vincbeck

@potiuk potiuk added the AIP-44 Airflow Internal API label Mar 23, 2023
@@ -1,225 +0,0 @@
#
Copy link
Member Author

@potiuk potiuk Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be old and unused/unusable test class.

@potiuk potiuk requested a review from uranusjr March 23, 2023 16:40
@potiuk
Copy link
Member Author

potiuk commented Mar 23, 2023

Hello Everyone, I've been working on it on-off for quite a while as I attempted to make things work for the AIP-44. I actually attempted to do in a few different ways and I did not like the previous ones. Ths one I think is at the sweet-spot of solving the (a little) intertwined BaseJob dependencies with the need of decoupling of the Task running from the ORM models..

A little context on that one.

LocalTaskJob as we have it curently implemented inherits from BaseJob (same as other jobs). It is in fact a polimorphic dependency - all of the jobs are stored in the same 'BaseJob' table. This is (and always has been) a little problematic - because the ''Job" objects inherit from the ORM object and there is an assumption that they are DB-related, on the other had they also had the "running" logic implemented in _execute method of the same ORM-derived objects. That made it rather dificult to split out the logic from database - for TriggererJob, LocalTaskJob and DagFileProcessorJob especially.

I attempted to do it in various ways but I had the goals:

a) the resulting architecture wil decouple from the ORM object from the logic (so that we could have serialized Pydantic objects introduced in #29776 used instead (so basically we should be able to pass and use BaseJob and BaseJobPydantic around)

b) it shoud touch as little logic change as possible (basically shuffling around objects and calling different objects was most of the changes I wanted to do) - so that it will be easy to review and reason about.

c) the resulting architecture will make sense

Result

I think I finally achieved all three goals. Summarizing of what has been done here:

  • I renamed the *Jobs (LocalTaskJob, SchedulerJob etc. to JobRunners which are NOT ORM objects - the whole logic is kept in there. I am just passing the job (BaseJob or BaseJobPydantic) as property of those JobRunners, and I changed all the calls to the internal job fields ot call that job property.

  • where needed (heartbeat mostly) I extracted the logic of updating the job so that it will save changed heartbeat to database even if JobBasePydantic is used

  • we do not have any longer polimorphic BaseJob - BaseJob is just regular ORM class that keeps records for all the runners - uses the same job "types" as before so it is backwards-compatible - but simpler. All the places where polimorphism was used have been updated.

  • I think the resulting architecture is quite a bit easier to reason about as it effectively allows to decouple DB operations to manage state and actual job done by a running task.

Current state

It looks like a huge change but if you look closly most of the changes are changes in tests to adapt to the new object hierarchy. So I hope the review will not be that difficult.

I still have a few (heartbeat) tests not passing and I am working on those (likely something missing in heartbeat processing) - but other than those, I think everything else is in place.

Future

Now - this is not YET AIP-44-compatible change. This refactor is just a basic decoupling.

We will need to implement several other follow up changes after this one is merged:

  • separate DB/non-DB operations and turn the BaseJob's run() into a standalone method that will be possible to run with/without DB
  • (possibly) move the JobRunners to a separate package (or rename modules) to not confuse them with BaseJob - they are quite a different beast
  • decorate the DB methods with @internal_api calls to make them AIP-44 ready

Follow ups

I am not sure of that but those changes also make it possible to do something else. Namely they allow us to limit for how long connections are opened from running tasks. Previously we kept them open all the time when the task was running and that was kinda strange as in most circumstances we only needed it to do some initial setup, heartbeat and save job state when complete. I think this change will enable something else (but that's something to see when the other changes are completed - we could optimize that away and (mimicking what Internal API will be doing) we could only get the session/connections established for short times by the running task. I hope we can get there.

Looking forward to comments and feedback.

BTW. It's I think impossible to split this PR into smaller ones :(

BTW2. DON'T be scared about the size of hte change. It's not as big as it seems - it just needed a lot of changes in tests.

The "code" changes:

+355 -219 (21 files)

The test changes:

+1142 -1032 (28 files)

So it is not that big.

@potiuk potiuk requested a review from o-nikolas March 23, 2023 16:51
.gitignore Outdated Show resolved Hide resolved
Copy link
Contributor

@vincbeck vincbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice decoupling! I really like that. That would make things easier for AIP-44. I only have 2 nits, the decoupling + the code make sense to me.

PS: +1 on the "it is easier to review than it seems"

airflow/jobs/base_job.py Outdated Show resolved Hide resolved
airflow/jobs/job_runner.py Outdated Show resolved Hide resolved
@potiuk potiuk force-pushed the decouple_execute_from_job branch from 5ba89fb to 242c9ef Compare March 23, 2023 19:43
airflow/cli/commands/standalone_command.py Show resolved Hide resolved
airflow/cli/commands/standalone_command.py Outdated Show resolved Hide resolved
airflow/jobs/local_task_job.py Outdated Show resolved Hide resolved
airflow/task/task_runner/base_task_runner.py Outdated Show resolved Hide resolved
potiuk added a commit to potiuk/airflow that referenced this pull request Apr 11, 2023
As a follow-up after decoupling of the job logic from the BaseJob
ORM object (apache#30255), the `run` method of BaseJob should also be
decoupled from it (allowing BaseJobPydantic to be passed) as well
as split into three steps, in order to allow db-less mode.

The "prepare" and "complete" steps of the `run` method are modifying
BaseJob ORM-mapped object, so they should be called over the
internal-api from LocalTask, DafFileProcessor and Triggerer running
in db-less mode. The "execute" method however does not need the
database however and should be run locally.

This is not yet full AIP-44 conversion, this is a prerequisite to do
so - and AIP-44 conversion will be done as a follow-up after this one.

However we added a mermaid diagram showing the job lifecycle with and
without Internal API to make it easier to reason about it

Closes: apache#30295
potiuk added a commit that referenced this pull request Apr 11, 2023
…#30308)

* Separate and split run job method into prepare/execute/complete steps

As a follow-up after decoupling of the job logic from the BaseJob
ORM object (#30255), the `run` method of BaseJob should also be
decoupled from it (allowing BaseJobPydantic to be passed) as well
as split into three steps, in order to allow db-less mode.

The "prepare" and "complete" steps of the `run` method are modifying
BaseJob ORM-mapped object, so they should be called over the
internal-api from LocalTask, DafFileProcessor and Triggerer running
in db-less mode. The "execute" method however does not need the
database however and should be run locally.

This is not yet full AIP-44 conversion, this is a prerequisite to do
so - and AIP-44 conversion will be done as a follow-up after this one.

However we added a mermaid diagram showing the job lifecycle with and
without Internal API to make it easier to reason about it

Closes: #30295

Co-authored-by: Jed Cunningham <[email protected]>
potiuk added a commit to potiuk/airflow that referenced this pull request Apr 11, 2023
This is the final step of decoupling of the job runner from ORM
based BaseJob. After this change, finally we rich the state that
the BaseJob is just a state of the Job being run, but all
the logic is kept in separate "JobRunner" entity which just
keeps the reference to the job. Also it makes sure that
job in each runner is defined as appropriate for each job type:

* SchedulerJobRunner, BackfillJobRunner can only use BaseJob
* DagProcessorJobRunner, TriggererJobRunner and especially the
  LocalTaskJobRunner can keep both BaseJob and it's Pydantic
  BaseJobPydantic representation - for AIP-44 usage.

The highlights of this change:

* Job does not have job_runner reference any more
* Job is a mandatory parameter when creating each JobRunner
* run_job method takes as parameter the job (i.e. where the state
  of the job is called) and executor_callable - i.e. the method
  to run when the job gets executed
* heartbeat callback is also passed a generic callable in order
  to execute the post-heartbeat operation of each of the job
  type
* there is no more need to specify job_type when you create
  BaseJob, the job gets its type by a simply creating a runner
  with the job

This is the final stage of refactoring that was split into
reviewable stages: apache#30255 -> apache#30302 -> apache#30308 -> this PR.

Closes: apache#30325
potiuk added a commit that referenced this pull request Apr 11, 2023
This is the final step of decoupling of the job runner from ORM
based BaseJob. After this change, finally we rich the state that
the BaseJob is just a state of the Job being run, but all
the logic is kept in separate "JobRunner" entity which just
keeps the reference to the job. Also it makes sure that
job in each runner is defined as appropriate for each job type:

* SchedulerJobRunner, BackfillJobRunner can only use BaseJob
* DagProcessorJobRunner, TriggererJobRunner and especially the
  LocalTaskJobRunner can keep both BaseJob and it's Pydantic
  BaseJobPydantic representation - for AIP-44 usage.

The highlights of this change:

* Job does not have job_runner reference any more
* Job is a mandatory parameter when creating each JobRunner
* run_job method takes as parameter the job (i.e. where the state
  of the job is called) and executor_callable - i.e. the method
  to run when the job gets executed
* heartbeat callback is also passed a generic callable in order
  to execute the post-heartbeat operation of each of the job
  type
* there is no more need to specify job_type when you create
  BaseJob, the job gets its type by a simply creating a runner
  with the job

This is the final stage of refactoring that was split into
reviewable stages: #30255 -> #30302 -> #30308 -> this PR.

Closes: #30325
potiuk added a commit to potiuk/airflow that referenced this pull request Mar 8, 2024
Sinc apache#30255 scheduler heartrate has not been properly calculated.
We missed the check for SchedulerJob type and setting heartrate
value from `scheduler_health_check_threshold`.

This PR fixes it.

Fix: apache#37971
potiuk added a commit that referenced this pull request Mar 8, 2024
Sinc #30255 scheduler heartrate has not been properly calculated.
We missed the check for SchedulerJob type and setting heartrate
value from `scheduler_health_check_threshold`.

This PR fixes it.

Fix: #37971
jedcunningham pushed a commit that referenced this pull request Mar 18, 2024
Sinc #30255 scheduler heartrate has not been properly calculated.
We missed the check for SchedulerJob type and setting heartrate
value from `scheduler_health_check_threshold`.

This PR fixes it.

Fix: #37971
(cherry picked from commit 01e40ab)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AIP-44 Airflow Internal API area:API Airflow's REST/HTTP API area:CLI area:Scheduler including HA (high availability) scheduler area:webserver Webserver related Issues changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..)
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

AIP-44 decouple BaseJob ORM from Job logic
7 participants