-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deferrable operator tasks do not call on_kill
method when fail or restarted
#36090
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
on_kill
method when fail or restarted
Related : #19929 . It seems this was reported earlier but got closed as stale issue. We would also find this feature useful to cancel remote jobs tracked by triggerer on clear/fail. |
Yes. Probably it should be changed. The problem is that this would have to bring the task from deferred state first to run it's on_kill method, but that would require someone to implement the logic for that. |
Thanks @potiuk.
|
1, Not that I am aware of. But you are asking random maintainer who might not know everything. In OSS everything is in the open so if this is not mentioned here, there is no other place where someone working on it would be recorded.
|
We're working with Slurm deferrable operator and waiting this feature so much 🙏 |
Just a kind reminder: if you and your company wait for a feature badly, the most certain way to get it is to spend engineering effort and contribute it back. Other than that the only thing to do is to 🙏 that someone will do it. Also paying someone to contribute such feature works. This is how open-source project works. |
Assign it ot @sunank200 as he already created a PR to avoid others work on it |
I guess this issue needs a more general solution across core triggered and the linked PR is more specific to a particular operator. One of the solutions we are trying to implement is to have something like on_cancel executed on on CancellationError but a defined interface to implement like run method for cancellation would help users. cc: @andrewgodwin |
I just found out that there is already a cleanup method as part of the interface. The problem is that cleanup is called during triggerer restarts too as part of deployment due to which we don't want to cleanup like deleting remote jobs since the trigger will start tracking invalid jobs, So I thought of this change where there could be a marker attribute set only when trigger is cancelled as part of to_cancel loop and call my custom cleanup function. This helps with executing cleanup when tasks are cleared or marked success/failure. I tried using contextvar but for some reason the update from triggerer_job_runner is not propagated to the trigger. In Python 3.9 the msg parameter was added to https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/triggers/base/index.html#airflow.triggers.base.BaseTrigger.cleanup class CustomTrigger(BaseTrigger):
async def cleanup(self):
"""
cleanup is called when trigger is cancelled by the triggerer which can happen as part of to_cancel loop
and also when triggerer exits/restarts but we want to execute cleanup only when it's part of the cancel loop
since triggerer could be restarted during deployment or marked as unhealthy due to which we don't want to
do cleanup like deleting jobs tracked upstream which becomes an issue as the triggerer starts to track invalid jobs.
If the trigger is cancelled from the to_cancel loop then the trigger is not present in the database
with _cancelled_from_job_runner set as True with which the custom cleanup is executed.
E.g. cleared task instance where trigger is cancelled
"""
cancelled_from_runner = getattr(self, "_cancelled_from_job_runner", False)
if cancelled_from_runner:
self.custom_cleanup()
await super().cleanup() diff --git a/airflow/jobs/triggerer_job_runner.py b/airflow/jobs/triggerer_job_runner.py
index bb151b32cc..e80a910008 100644
--- a/airflow/jobs/triggerer_job_runner.py
+++ b/airflow/jobs/triggerer_job_runner.py
@@ -497,6 +497,7 @@ class TriggerRunner(threading.Thread, LoggingMixin):
"name": f"{ti.dag_id}/{ti.run_id}/{ti.task_id}/{ti.map_index}/{ti.try_number} "
f"(ID {trigger_id})",
"events": 0,
+ "trigger": trigger_instance
}
else:
self.log.warning("Trigger %s had insertion attempted twice", trigger_id)
@@ -512,6 +513,11 @@ class TriggerRunner(threading.Thread, LoggingMixin):
trigger_id = self.to_cancel.popleft()
if trigger_id in self.triggers:
# We only delete if it did not exit already
+ # These are tasks cancelled by triggerer since they are not found in the database
+ # E.g. task instance cleared from UI. _cancelled_from_job_runner is set so that
+ # our cleanup is executed only as needed and not during triggerer process shutdown
+ # when cancel is called to call cleanup but this attribut is not present.
+ self.triggers[trigger_id]["trigger"]._cancelled_from_job_runner = True
self.triggers[trigger_id]["task"].cancel()
await asyncio.sleep(0) |
If i was feeling super-hacky in my own fork, I'd be tweaking |
WARNINGThe proposed solution of capturing
These PRs will result in the external job being canceled if the triggerer itself is restarted (or crashes), not just when users set the state of a deferred task to "success", "failed", or "clear". Also note, Airflow will be unaware that the external job has been canceled, and will reschedule the deferred operator on another triggerer instance (which could cause all kinds of strange behaviour). It makes more sense to find a way for Airflow itself to run However, as I am sure vendors will want their deferred operators to work correctly (when users set deferred tasks to "clear", "success" or "failed") here is a possible workaround (which needs testing). Possible WorkaroundWe can still capture EDIT: the original solution did not handle the case that the task was "cleared" causing the task to be rescheduled, and be deferred again. Fixing this requires us to distinguish if the TaskInstance for the run is the same one that created the deferred operator which is being canceled. Luckily we can do this by comparing the Here is a basic triggerer which pretends to run an external job. It shows if it has "canceled" the job, by writing to import asyncio
import os
from datetime import timedelta
from typing import Any, AsyncIterator, Dict, Optional
import pendulum
from airflow import DAG
from airflow.exceptions import AirflowException
from airflow.models import BaseOperator
from airflow.models.taskinstance import TaskInstance
from airflow.settings import Session
from airflow.triggers.base import BaseTrigger, TriggerEvent
from airflow.utils import timezone
from airflow.utils.context import Context
from airflow.utils.dates import days_ago
from airflow.utils.session import provide_session
from airflow.utils.state import TaskInstanceState
from pendulum.datetime import datetime
# define a trigger that sleeps until a given datetime, and then sends an event
# and "handles" `asyncio.CancelledError` by writing to a file
class DateTimeTriggerWithCancel(BaseTrigger):
def __init__(
self,
dag_id: str,
task_id: str,
run_id: str,
map_index: int,
job_id: int,
statement_name: str,
moment: datetime.datetime,
):
super().__init__()
self.dag_id = dag_id
self.task_id = task_id
self.run_id = run_id
self.map_index = map_index
self.job_id = job_id
self.statement_name = statement_name
# set and validate the moment
if not isinstance(moment, datetime.datetime):
raise TypeError(
f"Expected 'datetime.datetime' type for moment. Got '{type(moment)}'"
)
elif moment.tzinfo is None:
raise ValueError("You cannot pass naive datetime")
else:
self.moment: pendulum.DateTime = timezone.convert_to_utc(moment)
def serialize(self) -> tuple[str, dict[str, Any]]:
#
# TODO: for Airflow 2.6.0+ you can get the `TaskInstance` from `self.task_instance` which removes
# the need to store `dag_id`, `task_id`, `run_id`, `map_index`, and `job_id` in the trigger
# serialization. However, you STILL NEED TO QUERY for the latest TaskInstance state.
#
return (
"test_on_kill_deferred.DateTimeTriggerWithCancel",
{
"dag_id": self.dag_id,
"task_id": self.task_id,
"run_id": self.run_id,
"map_index": self.map_index,
"job_id": self.job_id,
"statement_name": self.statement_name,
"moment": self.moment,
},
)
@provide_session
def get_task_instance(self, session: Session) -> TaskInstance:
query = session.query(TaskInstance).filter(
TaskInstance.dag_id == self.dag_id,
TaskInstance.task_id == self.task_id,
TaskInstance.run_id == self.run_id,
TaskInstance.map_index == self.map_index,
)
task_instance = query.one_or_none()
if task_instance is None:
raise AirflowException(
"TaskInstance with dag_id: %s, task_id: %s, run_id: %s and map_index: %s is not found",
self.dag_id,
self.task_id,
self.run_id,
self.map_index,
)
return task_instance
def safe_to_cancel(self) -> bool:
"""
Whether it is safe to cancel the external job which is being executed by this trigger.
This is to avoid the case that `asyncio.CancelledError` is called because the trigger
itself is stopped. Because in those cases, we should NOT cancel the external job.
"""
# Database query is needed to get the latest state of the task instance.
task_instance = self.get_task_instance()
# If the current job_id is different from when the trigger was created,
# then we should cancel the external job we are waiting on because the task has been
# cleared and a new job has been created.
if int(task_instance.job_id) != int(self.job_id):
return True
# If the task is not in a deferred state, then something else has happened to the task
# since we were deferred (e.g. a manual state change), so we should cancel the external
# job we are waiting on.
return task_instance.state != TaskInstanceState.DEFERRED
async def run(self) -> AsyncIterator[TriggerEvent]:
self.log.info("trigger starting")
try:
# Sleep a second at a time
while self.moment > pendulum.instance(timezone.utcnow()):
self.log.info("sleeping 1 second...")
await asyncio.sleep(1)
# Send our single event and then we're done
self.log.info("yielding event with payload %r", self.moment)
yield TriggerEvent(
{
"statement_name": self.statement_name,
"status": "success",
"moment": self.moment,
}
)
except asyncio.CancelledError:
self.log.info("asyncio.CancelledError was called")
if self.statement_name:
if self.safe_to_cancel():
self.log.warning("Cancelling query: %s", self.statement_name)
# Cancel the query (mock by writing to a file)
output_folder = (
f"/tmp/testing/on_kill_deferred/{self.dag_id}/{self.task_id}"
)
os.makedirs(output_folder, exist_ok=True)
with open(f"{output_folder}/log_trigger.txt", "a") as f:
f.write(
f"asyncio.CancelledError was called: {self.statement_name}\n"
)
yield TriggerEvent({"status": "cancelled"})
else:
self.log.warning("Triggerer probably stopped, not cancelling query")
else:
self.log.error("self.statement_name is None")
except Exception as e:
self.log.exception("Exception occurred while checking for query completion")
yield TriggerEvent({"status": "error", "message": str(e)})
# an operator that sleeps for a given number of seconds using a deferred trigger
class TestDeferredOperator(BaseOperator):
statement_name: Optional[str]
wait_seconds: int
moment: Optional[datetime.datetime]
def __init__(self, wait_seconds: int = 120, **kwargs):
super().__init__(**kwargs)
self.wait_seconds = wait_seconds
self.statement_name = None
self.moment = None
def execute(self, context: Context) -> None:
self.statement_name = (
f"airflow"
f"::{self.dag.dag_id}"
f"::{self.task_id}"
f"::{pendulum.now(timezone.utc).isoformat()}"
)
self.moment = pendulum.instance(timezone.utcnow()).add(
seconds=self.wait_seconds
)
self.defer(
trigger=DateTimeTriggerWithCancel(
dag_id=self.dag.dag_id,
task_id=self.task_id,
run_id=context["run_id"],
map_index=context["task_instance"].map_index,
job_id=context["task_instance"].job_id,
statement_name=self.statement_name,
moment=self.moment,
),
method_name="execute_complete",
)
def execute_complete(
self,
context: Context,
event: Optional[Dict[str, Any]] = None,
) -> None:
if event is None:
raise AirflowException("Trigger event is None")
if event["status"] == "error":
msg = f"context: {context}, error message: {event['message']}"
raise AirflowException(msg)
if event["status"] == "cancelled":
self.log.info(f"external job was cancelled: {self.statement_name}")
return
self.log.info("%s completed successfully.", self.task_id)
def on_kill(self):
output_folder = (
f"/tmp/testing/on_kill_deferred/{self.dag.dag_id}/{self.task_id}"
)
os.makedirs(output_folder, exist_ok=True)
with open(f"{output_folder}/log_operator.txt", "a") as f:
f.write(f"on_kill was called: {self.statement_name}\n")
with DAG(
dag_id="test_on_kill_deferred",
schedule_interval="0 0 * * *",
start_date=days_ago(1),
dagrun_timeout=timedelta(minutes=60),
) as dag:
# task 1
task_1 = TestDeferredOperator(
task_id="task_1",
wait_seconds=60,
) |
For those watching, I have done a little more testing with the workaround proposed in #36090 (comment), and at least in my tests, it works correctly when That is, the However, my preference is still finding a way to make |
Same here. I have created the PR for BigQuery/InsertJobOperator based on suggestions. Please review : #39442 I have tested it and it works correctly. cc: @thesuperzapper Currently, I have used the workaround. I will work on the general solution across core triggerer after these PRs fixed. |
@sunank200 also note that using the Upstream this should be fine, as all providers currently pin But for those wanting to backport this fix into Airflow 2.5.3 (and older), they will need to use something like what I have done in #36090 (comment) and add the |
@sunank200 also we will need to update all the other providers that set an This search might help us find them: |
@sunank200 @akaul my original solution proposed in #36090 (comment) had a critical part missing, which means that the solution you implemented (in the BigQuery and DataProc operators) needs to be updated, along with the unmerged update to the Databricks operator (PR #39373). The problem was that we would not correctly cancel the external job if the task was CLEARED, rather than being set explicitly to SUCCESS or FAILED. This is because if the task is cleared, the new job will likely end up DEFERRED before the I found a solution, which is to update the For example, here is the updated def safe_to_cancel(self) -> bool:
"""
Whether it is safe to cancel the external job which is being executed by this trigger.
This is to avoid the case that `asyncio.CancelledError` is called because the trigger
itself is stopped. Because in those cases, we should NOT cancel the external job.
"""
# Database query is needed to get the latest state of the task instance.
task_instance = self.get_task_instance()
# If the current job_id is different from when the trigger was created,
# then we should cancel the external job we are waiting on because the task has been
# cleared and a new job has been created.
if int(task_instance.job_id) != int(self.job_id):
return True
# If the task is not in a deferred state, then something else has happened to the task
# since we were deferred (e.g. a manual state change), so we should cancel the external
# job we are waiting on.
return task_instance.state != TaskInstanceState.DEFERRED So people can do more testing, I have updated my reference example in #36090 (comment) NOTE: my reference is designed to work on all versions of airflow with deferable operators (e.g. 2.4.0+), but can be simplified if we require 2.6.0+ like you have in the upstream providers, see the "TODO" for more context. @potiuk we might want to consider implementing my workaround as a default method on the We also need to update all other operators that currently define |
@thesuperzapper I tried clearing the task for |
@sunank200 it's a race condition, it will only fail to cancel the external job if after clearing the task becomes deferred again before the exception handler on the triggerer runs. It's more important for jobs which are very quick to start (like submitting a database query to the Redshift Data API), but you should definitely fix it in your operator too. |
Probably worth implementing it on the "airflow" level as you mentioned. Ate you willing to submit a PR on that @thesuperzapper ? Did I understand well? |
@potiuk I can definitely help with reviewing PR to implement the workaround above into existing deferred operators that use In terms of implementing some kind of solution at the airflow level, I'm happy to discuss possible solutions, but since we have no idea what the implementation would be yet, I can't commit to anything. I guess there are two approaches to a generic solution:
|
As usual with OSS - we need someone to take a lead on that, so yeah - I see you would love to discuss the solution but it needs someone to take the lead and create a PR for that |
@sunank200 reported that he is working on a generic solution. |
Apologies for the late reply, but I have been working on this PR: #40084. I will be able to address this issue after the above PR is completed. |
@sunank200 do you know if we are going to make it in time for 2.10? If not, let us know so that we can remove it from the milestone |
Given that I am working on another priority task it seems difficult to make it for 2.10. I will be working on this post #40084 |
@sunank200 which approach are we planning to take on this issue?
I think we should do both, starting with the workaround, as we should be able to get that in for 2.10.1. It's very important we do something soon, because users don't expect this behavior, and it can be quite detrimental to your data pipelines if external jobs/resources keep running when you manually mark tasks as failed/success/clear. |
@kaxil @ashb I think this issue is critical to fix before Airflow 3. Without a fix, deferrable operators that trigger external jobs are not safe to use. Users expect that setting a deferred task to "success", "failed", "cleared" will cancel the external job, which is not currently the case. We need to go through all the providers which contain deferrable operators and have |
@sunank200 is occupied with other developments for Airflow 3. Would you or any one from the community be up for spending some engineering efforts on this @thesuperzapper ? |
@msumit is also interested in this afaik. Would be great if someone in the community is willing to take this on. |
TLDR: I coded the fix for the same problem in a more generic way, which should solve this for all operators. I'll try to raise a diff by the EOW. |
@msumit whatever your solution is, make sure that it meets the following requirements:
|
Sorry for the bummer folks, I don't think the solution I was thinking of is generic enough, but here are 2 different approaches I could think of
In short, we need an owner here who could either find a generic way or take the lead in making all trigger classes |
@msumit we already have a generic workaround based on my code in: #36090 (comment) It's not perfect, it doesn't handle the case that the triggerer is force killed (power loss) at the exact same time that the user manually changes the state of the task, but that's a very rare situation (and also applies to non-deferred It relies on catching the The maintainers of the providers really need to implement this workaround for all deferrable operators with Airflow 3 Proposal: My proposal for a unified clean-up solution is to introduce a new TaskInstance state called When an operator defines a "cleanup handler", the task would enter this state when it terminates for any reason, including success, failure, and clearing. The "cleanup handler" is guaranteed to run at least once, and would be passed the following information to help it decide what to do:
The question then becomes where to run the cleanup handler code. I think it makes sense to run it in the existing triggerers, as we can reuse the existing task serialization behavior, and know that it will run at least once. |
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
When marking a task that is using a deferrable operators as failed or cleared, on_kill method is not called and the remote job is never stopped.
Task log output:
Operating System
Debian GNU
Versions of Apache Airflow Providers
apache-airflow==2.6.2
apache-airflow-providers-google==10.9.0
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: