Handle SIGTERM received by prefect-agent gracefully #8270

ddelange · 2023-01-26T18:25:07Z

This issue is about allowing flow runs some time to finish when prefect-agent receives a SIGTERM signal from user/docker/kubernetes, specifically deployed in combination with a horizontal pod autoscaler.

The use case of a k8s user wanting to (auto)scale up their data pipelines using prefect:

easiest would be to run prefect-agent Deployment behind a HPA (like is currently already possible in the orion server helm chart)
- different worker Deployments for different queues, depending on average mem usage, cpu-bound/io-bound, etc
set the terminationGracePeriodSeconds of the agent to e.g. ~2x the average duration of a flow run in its queue
- when a Pod is scaled down in kubernetes, k8s sends a SIGTERM to the container(s) main PID(s). then it waits terminationGracePeriodSeconds to send a SIGKILL to get rid of the Pod. this guy explains it better
when the main agent PID receives SIGTERM, allow the flow runs to finish gracefully
if somehow a flow run is not done after terminationGracePeriodSeconds, the main gets a SIGKILL (like when k8s kills the main PID due to OOM).
- in the case of for instance the AMQP protocol, an orphaned/zombie task would be marked as open again in the queue and another worker would pick it up.
- marking as crashed is the other alternative, and the user would add a limited amount of retries in case of undefined crash (lost connection or whatever). if there was an actual python Exception, that's a different case and the user might not want to retry.

e.g. FastAPI works like this: it will allow ongoing requests to finish and will gracefully close the loop when a SIGTERM is received. k8s counts on this behaviour in the scale-down mechanic of the Deployment/StatefulSet/ReplicaSet.

there is also still the issue of k8s SIGKILLing subprocesses to avoid the container from going OOM, which I think is relevant for this particular issue (and the CRASH detection) as well:

on another note (agent pod, not orion server), we just watched an agent's subprocess (a flow run) get OOM SIGKILLed by kubernetes. interestingly, it only kills the subprocess (ref) and the main process hangs (configured to sequentially handle only 1 flow run at a time) ~~indefinitely and~~ (nvm, it takes the next task) happily.

probably doesn't warrant an issue or so, but making sure you're aware of this behaviour of k8s
<frozen runpy>:128: RuntimeWarning: 'prefect.engine' found in sys.modules after import of package 'prefect', but prior to execution of 'prefect.engine'; this may result in unpredictable behaviour
19:44:53.133 | INFO | Flow run 'phi2-kelis-m' - Created task run 'parse_crawl-259eac6e-0' for task 'parse_crawl'
19:44:53.136 | INFO | Flow run 'phi2-kelis-m' - Executing 'parse_crawl-259eac6e-0' immediately...
19:45:15.900 | ERROR | prefect.infrastructure.process - Process 'process-0-2-0rc4' exited with status code: -9; This indicates that the process exited due to a SIGKILL signal. Typically, this is caused by high memory usage causing the operating system to terminate the process.
19:45:15.900 | INFO | prefect.agent - Reporting flow run 'fd0e2945-6994-4eee-87c4-57ed659f2bf4' as crashed due to non-zero status code.

#7948 (comment)

Originally posted by @ddelange in #7800 (comment)

The text was updated successfully, but these errors were encountered:

ddelange · 2023-01-26T18:27:13Z

@madkinsz is that sufficient? any additional info needed from my side? any code links you could share that might be relevant to this issue?

ddelange · 2023-01-26T18:56:32Z

Small clarification: when a k8s Pod running fastapi goes into state Terminating (SIGTERM sent to the process), k8s stops routing requests to the pod, and fastapi finishes handling ongoing requests. at work, mostly after 30 secs the pod will be gone. we left the terminationGracePeriodSeconds on the default 60. so in practice, k8s never needs to send SIGKILL unless something is wrong.

I think a similar mechanic would be beneficial here: no more new flow runs consumed from the queue upon receiving SIGTERM, but not forwarding the SIGTERM to the running subprocesses so they can finish.

zanieb · 2023-01-31T17:01:31Z

I agree that SIGTERM should stop the agent from checking for more work.

We probably want to forward it to any children processes though, they need to know that they will shutdown and have an opportunity to exit gracefully. Most user's flows do not run in less than 60s. We could consider sending it on a configurable delay, i.e. after 20s the agent will forward the signal but I think that should be considered after the initial change.

ddelange · 2023-02-02T23:33:09Z

@madkinsz are there additional people you could tag here, especially regarding the design? I'm also keen to contribute once there is consensus concerning the approach :)

zanieb · 2023-02-02T23:42:01Z

cc @cicdw / @desertaxle / @anticorrelator worth thinking ahead about how this fits with workers and cancellation

ddelange · 2023-02-09T15:16:27Z

I agree that SIGTERM should stop the agent from checking for more work.

Is this maybe already the case currently?

We probably want to forward it to any children processes though, they need to know that they will shutdown and have an opportunity to exit gracefully.

Could you give me a pointer how/where you would like this to be implemented? I traced back here: https://github.com/PrefectHQ/prefect/blob/2.7.12/src/prefect/agent.py#L476 but I'm not sure what anyio's behaviour is when it gets the SIGTERM (and this line is being awaited while the subprocess is running).

ddelange · 2023-02-10T12:24:06Z

could set up an anyio signal handler for SIGTERM ref in the top level prefect agent command ref
- would need to get all FlowRun objects that this OrionAgent is currently responsible for
  - could call OrionAgent.cancel_run to mark them in db and 'forward' the signal ref
    - which uses Process.kill which has a grace period ref

if this sounds like a plan for a first iteration, could you give me a hint how to create such a OrionAgent.get_current_runs(self) -> List[FlowRun] which uses self.client to get all the FlowRuns (pid's in case of Process infrastructure) that need killing on this machine?

ddelange · 2023-02-24T09:50:02Z

Hi 👋

We found out that agent actually shuts down gracefully when receiving a SIGINT.

So I think this ticket can be closed by simply forwarding SIGTERM to SIGINT!

I've opened ddelange#17

When agent receives SIGINT, it stops dequeueing new FlowRuns, and runs until the subprocesses finish, which is exactly what's desired. When the k8s grace period is not enough, and k8s sends a SIGKILL, the server will lose heartbeat and mark the FlowRuns as crashed. This can then be retried per user config.

In the logs below, you can observe this behaviour (agent stops gracefully once the FlowRun completes almost two minutes after receiving SIGINT):

Agent started! Looking for work from queue(s) that start with the prefix: ['']...
13:52:31.475 | INFO    | prefect.agent - Matched new work queues: default
13:52:48.393 | INFO    | prefect.agent - Submitting flow run 'e02634f3-8532-4b0c-9100-af7dc1720411'
13:52:48.477 | INFO    | prefect.infrastructure.process - Opening process 'futuristic-poodle'...
13:52:48.494 | INFO    | prefect.agent - Completed submission of flow run 'e02634f3-8532-4b0c-9100-af7dc1720411'
13:52:50.600 | INFO    | Flow run 'futuristic-poodle' - Downloading flow code from storage at '/home/coder/code/prefect'
13:52:51.046 | INFO    | Flow run 'futuristic-poodle' - flow started (sleep for 120 seconds)
13:53:00.635 | INFO    | prefect.agent - 

Sending SIGINT to the Prefect agent (PID 49819)...
13:54:51.115 | INFO    | Flow run 'futuristic-poodle' - flow completed
13:54:51.194 | INFO    | Flow run 'futuristic-poodle' - Finished in state Completed()
13:54:51.697 | INFO    | prefect.infrastructure.process - Process 'futuristic-poodle' exited cleanly.
Agent stopped!

ddelange · 2023-02-24T10:48:47Z

I would re-open that PR with PrefectHQ as base org once #7948 merges. Any eyes until then would be appreciated!

cc @madkinsz @bunchesofdonald

github-actions · 2023-03-30T21:03:57Z

This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.

github-actions · 2023-04-13T22:01:58Z

This issue was closed because it has been stale for 14 days with no activity. If this issue is important or you have more to add feel free to re-open it.

ddelange · 2023-04-14T07:41:28Z

@madkinsz can you reopen?

zanieb added needs:design Blocked by a need for an implementation outline priority:medium labels Jan 31, 2023

ddelange mentioned this issue Feb 24, 2023

Handle SIGTERM received by agent gracefully ddelange/prefect#17

Closed

This was referenced Mar 1, 2023

Subprocess Killed Task is Not Marked as Failed #6024

Open

Handle SIGTERM received by agent gracefully #8691

Merged

Orion server ignores SIGTERM termination signal from docker/k8s #7938

Closed

github-actions bot added the status:stale label Mar 30, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 13, 2023

desertaxle closed this as completed in #8691 May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle SIGTERM received by prefect-agent gracefully #8270

Handle SIGTERM received by prefect-agent gracefully #8270

ddelange commented Jan 26, 2023

ddelange commented Jan 26, 2023

ddelange commented Jan 26, 2023

zanieb commented Jan 31, 2023

ddelange commented Feb 2, 2023

zanieb commented Feb 2, 2023 •

edited

Loading

ddelange commented Feb 9, 2023

ddelange commented Feb 10, 2023 •

edited

Loading

ddelange commented Feb 24, 2023

ddelange commented Feb 24, 2023

github-actions bot commented Mar 30, 2023

github-actions bot commented Apr 13, 2023

ddelange commented Apr 14, 2023

Handle SIGTERM received by prefect-agent gracefully #8270

Handle SIGTERM received by prefect-agent gracefully #8270

Comments

ddelange commented Jan 26, 2023

ddelange commented Jan 26, 2023

ddelange commented Jan 26, 2023

zanieb commented Jan 31, 2023

ddelange commented Feb 2, 2023

zanieb commented Feb 2, 2023 • edited Loading

ddelange commented Feb 9, 2023

ddelange commented Feb 10, 2023 • edited Loading

ddelange commented Feb 24, 2023

ddelange commented Feb 24, 2023

github-actions bot commented Mar 30, 2023

github-actions bot commented Apr 13, 2023

ddelange commented Apr 14, 2023

zanieb commented Feb 2, 2023 •

edited

Loading

ddelange commented Feb 10, 2023 •

edited

Loading