-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle SIGTERM received by prefect-agent gracefully #8270
Comments
@madkinsz is that sufficient? any additional info needed from my side? any code links you could share that might be relevant to this issue? |
Small clarification: when a k8s Pod running fastapi goes into state Terminating (SIGTERM sent to the process), k8s stops routing requests to the pod, and fastapi finishes handling ongoing requests. at work, mostly after 30 secs the pod will be gone. we left the I think a similar mechanic would be beneficial here: no more new flow runs consumed from the queue upon receiving SIGTERM, but not forwarding the SIGTERM to the running subprocesses so they can finish. |
I agree that SIGTERM should stop the agent from checking for more work. We probably want to forward it to any children processes though, they need to know that they will shutdown and have an opportunity to exit gracefully. Most user's flows do not run in less than 60s. We could consider sending it on a configurable delay, i.e. after 20s the agent will forward the signal but I think that should be considered after the initial change. |
@madkinsz are there additional people you could tag here, especially regarding the design? I'm also keen to contribute once there is consensus concerning the approach :) |
cc @cicdw / @desertaxle / @anticorrelator worth thinking ahead about how this fits with workers and cancellation |
Is this maybe already the case currently?
Could you give me a pointer how/where you would like this to be implemented? I traced back here: https://github.com/PrefectHQ/prefect/blob/2.7.12/src/prefect/agent.py#L476 but I'm not sure what anyio's behaviour is when it gets the SIGTERM (and this line is being awaited while the subprocess is running). |
if this sounds like a plan for a first iteration, could you give me a hint how to create such a |
Hi 👋 We found out that agent actually shuts down gracefully when receiving a SIGINT. So I think this ticket can be closed by simply forwarding SIGTERM to SIGINT! I've opened ddelange#17 When agent receives SIGINT, it stops dequeueing new FlowRuns, and runs until the subprocesses finish, which is exactly what's desired. When the k8s grace period is not enough, and k8s sends a SIGKILL, the server will lose heartbeat and mark the FlowRuns as crashed. This can then be retried per user config. In the logs below, you can observe this behaviour (agent stops gracefully once the FlowRun completes almost two minutes after receiving SIGINT):
|
I would re-open that PR with PrefectHQ as base org once #7948 merges. Any eyes until then would be appreciated! |
This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment. |
This issue was closed because it has been stale for 14 days with no activity. If this issue is important or you have more to add feel free to re-open it. |
@madkinsz can you reopen? |
This issue is about allowing flow runs some time to finish when prefect-agent receives a SIGTERM signal from user/docker/kubernetes, specifically deployed in combination with a horizontal pod autoscaler.
The use case of a k8s user wanting to (auto)scale up their data pipelines using prefect:
prefect-agent
Deployment behind a HPA (like is currently already possible in the orion server helm chart)terminationGracePeriodSeconds
of the agent to e.g. ~2x the average duration of a flow run in its queueterminationGracePeriodSeconds
to send a SIGKILL to get rid of the Pod. this guy explains it betterterminationGracePeriodSeconds
, the main gets a SIGKILL (like when k8s kills the main PID due to OOM).e.g. FastAPI works like this: it will allow ongoing requests to finish and will gracefully close the loop when a SIGTERM is received. k8s counts on this behaviour in the scale-down mechanic of the Deployment/StatefulSet/ReplicaSet.
there is also still the issue of k8s SIGKILLing subprocesses to avoid the container from going OOM, which I think is relevant for this particular issue (and the CRASH detection) as well:
#7948 (comment)
Originally posted by @ddelange in #7800 (comment)
The text was updated successfully, but these errors were encountered: