-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store callbacks in database if standalone_dag_processor config is True. #21731
Conversation
88f30c1
to
58f8973
Compare
29a7721
to
93b61e9
Compare
airflow/jobs/scheduler_job.py
Outdated
@@ -858,7 +865,7 @@ def _do_scheduling(self, session) -> int: | |||
self.log.error("DAG '%s' not found in serialized_dag table", dag_run.dag_id) | |||
continue | |||
|
|||
self._send_dag_callbacks_to_processor(dag, callback_to_run) | |||
callbacks_to_send.append((dag, callback_to_run)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have concern about this part - is it critical to send callbacks at this moment?
Is it a big problem if we move it to the bottom - outside of the commit-guard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this may change the behaviour in some cases. Judging from the expunge_all()
call below, I suspect the callbacks are allowed to update the database before the critical section below. Moving the callbacks to after the section means they no longer have this ability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively we could end the previous guard just before, send callbacks to DB then start another guard below.
I don't know how important it is to keep the single guard for all these lines 843-900
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I think that guard couid be stopped here and restart later (or maybe even it is not needed later @ashb? I tihink the main reason of the "prohibit_commit" guard was to make sure the DagRun rows are locked, but after the commit above - they are already freed anyway. I think the guard is used from this place on to make sure that simply none of the code is using accidental commit (and it does not protect the DagRun rows which are freed at this point), so conceptually, I see no problem if the Callbacks are run outside of the guard and we have a separate guard for critical section.
That would make quite a lot of sense actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah make sense Jarek:
with prohibit_commit(session) as guard:
...
guard.commit()
# Send the callbacks after we commit to ensure the context is up to date when it gets run
for dag_run, callback_to_run in callback_tuples:
dag = self.dagbag.get_dag(dag_run.dag_id, session=session)
if not dag:
self.log.error("DAG '%s' not found in serialized_dag table", dag_run.dag_id)
continue
self._send_dag_callbacks_to_processor(dag, callback_to_run)
with prohibit_commit(session) as guard:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
if hasattr(ti, 'run_as_user'): | ||
self._run_as_user = ti.run_as_user | ||
self._pool: str = ti.pool | ||
self._run_as_user = run_as_user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nice :)
return NotImplemented | ||
|
||
@classmethod | ||
def from_ti(cls, ti: TaskInstance): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I have a few comments - less important, but I think the "parallelism" query for fetching callbacks is important to discuss. I might be missing something though, so happy to discuss it.
airflow/dag_processing/manager.py
Outdated
# Nothing to do if callbacks are not stored in the database | ||
return | ||
self.log.debug("Fetching callbacks from the database.") | ||
query = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add the "commit guard" here. The code is very simple now and probably we will not make a mistake but this will prevent future mistakes where someone by mistake will add a commit to release the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added guard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more small things from my side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am cool with that! Good job @mhenc !
Some tests are failing though :) |
Co-authored-by: Ephraim Anierobi <[email protected]>
Hi @mhenc, I have not tested this completely but got some issues even I'm not sure are they valid or not. Let me explain them -
If there is anything that I need to change please let me know. |
@mhenc - any commments on that ? |
@jyotsa09 I just tested it and callbacks seem executed - however instead of "print" I wrote to the file as I am not sure whether stdout from callbacks is redirected to logs.
Note that callbacks are removed from the DB by DagProcessor when fetching them |
FYI - I disabled the standalone DAG processor and it fixed my issues, but while enabled my scheduler would die sometimes. Here are some logs where it died almost immediately after I deployed Airflow:
A pull request was already created (I discussed this on Slack), so I am not sure if an issue should be raised or not: #24366 |
Hello, I'm also using separated scheduler and dag-processor processes in version 2.3.2 and my scheduler is dying with |
This is likely a known issue - it comes from separated dag-processor and it should get fixed in 2.3.3 with #24366 . The proposed fix is very simple (literally one line of code) so you might want to simply apply it to your installation and you will also get it when we release 2.3.3 (but you should monitor the changes to #24366) and see if we find a better/more complete fix before merging it. |
I added this to my image to make it work:
|
This change introduces new config:
[scheduler]standalone_dag_processor
- if set to True, all callbacks are stored in the dedicatedcallback_request
table in the database.They are then read from there by DagProcessor and executed.
If option is False then callbacks are still send to DagProcessor using Pipe.
This change will allow us to run DagProcessor in a separate process, independently from SchedulerJob.
Part of AIP-43
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-43+DAG+Processor+separation