Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle SIGTERM received by prefect-agent gracefully #8270

Closed
ddelange opened this issue Jan 26, 2023 · 12 comments · Fixed by #8691
Closed

Handle SIGTERM received by prefect-agent gracefully #8270

ddelange opened this issue Jan 26, 2023 · 12 comments · Fixed by #8691
Labels
needs:design Blocked by a need for an implementation outline

Comments

@ddelange
Copy link
Contributor

This issue is about allowing flow runs some time to finish when prefect-agent receives a SIGTERM signal from user/docker/kubernetes, specifically deployed in combination with a horizontal pod autoscaler.

The use case of a k8s user wanting to (auto)scale up their data pipelines using prefect:

  • easiest would be to run prefect-agent Deployment behind a HPA (like is currently already possible in the orion server helm chart)
    • different worker Deployments for different queues, depending on average mem usage, cpu-bound/io-bound, etc
  • set the terminationGracePeriodSeconds of the agent to e.g. ~2x the average duration of a flow run in its queue
    • when a Pod is scaled down in kubernetes, k8s sends a SIGTERM to the container(s) main PID(s). then it waits terminationGracePeriodSeconds to send a SIGKILL to get rid of the Pod. this guy explains it better
  • when the main agent PID receives SIGTERM, allow the flow runs to finish gracefully
  • if somehow a flow run is not done after terminationGracePeriodSeconds, the main gets a SIGKILL (like when k8s kills the main PID due to OOM).
    • in the case of for instance the AMQP protocol, an orphaned/zombie task would be marked as open again in the queue and another worker would pick it up.
    • marking as crashed is the other alternative, and the user would add a limited amount of retries in case of undefined crash (lost connection or whatever). if there was an actual python Exception, that's a different case and the user might not want to retry.

e.g. FastAPI works like this: it will allow ongoing requests to finish and will gracefully close the loop when a SIGTERM is received. k8s counts on this behaviour in the scale-down mechanic of the Deployment/StatefulSet/ReplicaSet.

there is also still the issue of k8s SIGKILLing subprocesses to avoid the container from going OOM, which I think is relevant for this particular issue (and the CRASH detection) as well:

on another note (agent pod, not orion server), we just watched an agent's subprocess (a flow run) get OOM SIGKILLed by kubernetes. interestingly, it only kills the subprocess (ref) and the main process hangs (configured to sequentially handle only 1 flow run at a time) indefinitely and (nvm, it takes the next task) happily.

probably doesn't warrant an issue or so, but making sure you're aware of this behaviour of k8s

<frozen runpy>:128: RuntimeWarning: 'prefect.engine' found in sys.modules after import of package 'prefect', but prior to execution of 'prefect.engine'; this may result in unpredictable behaviour
19:44:53.133 | INFO | Flow run 'phi2-kelis-m' - Created task run 'parse_crawl-259eac6e-0' for task 'parse_crawl'
19:44:53.136 | INFO | Flow run 'phi2-kelis-m' - Executing 'parse_crawl-259eac6e-0' immediately...
19:45:15.900 | ERROR | prefect.infrastructure.process - Process 'process-0-2-0rc4' exited with status code: -9; This indicates that the process exited due to a SIGKILL signal. Typically, this is caused by high memory usage causing the operating system to terminate the process.
19:45:15.900 | INFO | prefect.agent - Reporting flow run 'fd0e2945-6994-4eee-87c4-57ed659f2bf4' as crashed due to non-zero status code.

#7948 (comment)

Originally posted by @ddelange in #7800 (comment)

@ddelange
Copy link
Contributor Author

@madkinsz is that sufficient? any additional info needed from my side? any code links you could share that might be relevant to this issue?

@ddelange
Copy link
Contributor Author

Small clarification: when a k8s Pod running fastapi goes into state Terminating (SIGTERM sent to the process), k8s stops routing requests to the pod, and fastapi finishes handling ongoing requests. at work, mostly after 30 secs the pod will be gone. we left the terminationGracePeriodSeconds on the default 60. so in practice, k8s never needs to send SIGKILL unless something is wrong.

I think a similar mechanic would be beneficial here: no more new flow runs consumed from the queue upon receiving SIGTERM, but not forwarding the SIGTERM to the running subprocesses so they can finish.

@zanieb
Copy link
Contributor

zanieb commented Jan 31, 2023

I agree that SIGTERM should stop the agent from checking for more work.

We probably want to forward it to any children processes though, they need to know that they will shutdown and have an opportunity to exit gracefully. Most user's flows do not run in less than 60s. We could consider sending it on a configurable delay, i.e. after 20s the agent will forward the signal but I think that should be considered after the initial change.

@zanieb zanieb added needs:design Blocked by a need for an implementation outline priority:medium labels Jan 31, 2023
@ddelange
Copy link
Contributor Author

ddelange commented Feb 2, 2023

@madkinsz are there additional people you could tag here, especially regarding the design? I'm also keen to contribute once there is consensus concerning the approach :)

@zanieb
Copy link
Contributor

zanieb commented Feb 2, 2023

cc @cicdw / @desertaxle / @anticorrelator worth thinking ahead about how this fits with workers and cancellation

@ddelange
Copy link
Contributor Author

ddelange commented Feb 9, 2023

I agree that SIGTERM should stop the agent from checking for more work.

Is this maybe already the case currently?

We probably want to forward it to any children processes though, they need to know that they will shutdown and have an opportunity to exit gracefully.

Could you give me a pointer how/where you would like this to be implemented? I traced back here: https://github.com/PrefectHQ/prefect/blob/2.7.12/src/prefect/agent.py#L476 but I'm not sure what anyio's behaviour is when it gets the SIGTERM (and this line is being awaited while the subprocess is running).

@ddelange
Copy link
Contributor Author

ddelange commented Feb 10, 2023

  • could set up an anyio signal handler for SIGTERM ref in the top level prefect agent command ref
    • would need to get all FlowRun objects that this OrionAgent is currently responsible for
      • could call OrionAgent.cancel_run to mark them in db and 'forward' the signal ref
        • which uses Process.kill which has a grace period ref

if this sounds like a plan for a first iteration, could you give me a hint how to create such a OrionAgent.get_current_runs(self) -> List[FlowRun] which uses self.client to get all the FlowRuns (pid's in case of Process infrastructure) that need killing on this machine?

@ddelange
Copy link
Contributor Author

Hi 👋

We found out that agent actually shuts down gracefully when receiving a SIGINT.

So I think this ticket can be closed by simply forwarding SIGTERM to SIGINT!

I've opened ddelange#17

When agent receives SIGINT, it stops dequeueing new FlowRuns, and runs until the subprocesses finish, which is exactly what's desired. When the k8s grace period is not enough, and k8s sends a SIGKILL, the server will lose heartbeat and mark the FlowRuns as crashed. This can then be retried per user config.

In the logs below, you can observe this behaviour (agent stops gracefully once the FlowRun completes almost two minutes after receiving SIGINT):

Agent started! Looking for work from queue(s) that start with the prefix: ['']...
13:52:31.475 | INFO    | prefect.agent - Matched new work queues: default
13:52:48.393 | INFO    | prefect.agent - Submitting flow run 'e02634f3-8532-4b0c-9100-af7dc1720411'
13:52:48.477 | INFO    | prefect.infrastructure.process - Opening process 'futuristic-poodle'...
13:52:48.494 | INFO    | prefect.agent - Completed submission of flow run 'e02634f3-8532-4b0c-9100-af7dc1720411'
13:52:50.600 | INFO    | Flow run 'futuristic-poodle' - Downloading flow code from storage at '/home/coder/code/prefect'
13:52:51.046 | INFO    | Flow run 'futuristic-poodle' - flow started (sleep for 120 seconds)
13:53:00.635 | INFO    | prefect.agent - 

Sending SIGINT to the Prefect agent (PID 49819)...
13:54:51.115 | INFO    | Flow run 'futuristic-poodle' - flow completed
13:54:51.194 | INFO    | Flow run 'futuristic-poodle' - Finished in state Completed()
13:54:51.697 | INFO    | prefect.infrastructure.process - Process 'futuristic-poodle' exited cleanly.
Agent stopped!

@ddelange
Copy link
Contributor Author

I would re-open that PR with PrefectHQ as base org once #7948 merges. Any eyes until then would be appreciated!

cc @madkinsz @bunchesofdonald

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.

@github-actions
Copy link
Contributor

This issue was closed because it has been stale for 14 days with no activity. If this issue is important or you have more to add feel free to re-open it.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 13, 2023
@ddelange
Copy link
Contributor Author

@madkinsz can you reopen?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs:design Blocked by a need for an implementation outline
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants