Flow run that receives SIGTERM does not go through CRASH detection #7800

zanieb · 2022-12-07T18:03:50Z

First check

I added a descriptive title to this issue.
I used the GitHub search to find a similar request and didn't find it.
I searched the Prefect documentation for this feature.

Prefect Version

2.x

Describe the current behavior

Given a flow

@flow
def foo():
    print(get_run_context().flow_run.id)
    print(os.getpid())
    time.sleep(1000)

A kill signal

kill -TERM 36467

Results in immediate exit of the run

10:35:02.131 | INFO    | prefect.engine - Created flow run 'brilliant-chupacabra' for flow 'foo'
7472a4eb-2162-4933-b833-8609d9081a10
36467
zsh: terminated  python example.py

Describe the proposed behavior

This flow should go through our crash handling.

To ensure cancelled flows are not reported as crashed, we should add a rule for transition to CRASHED that checks the parent state for CANCELLED and sets CANCELLED instead

Example Use

—

Additional context

No response

The text was updated successfully, but these errors were encountered:

zanieb · 2022-12-07T18:04:40Z

Should help with #7732

ddelange · 2023-01-18T11:35:32Z

Hi @madkinsz 👋

I'll leave some thoughts here. maybe it helps, maybe I can help? either way, here we go:

the use case of a k8s user wanting to (auto)scale up their data pipelines using prefect:

easiest would be to run prefect-agent Deployment behind a HPA (like is currently already possible in the orion server helm chart)
- different worker Deployments for different queues, depending on average mem usage, cpu-bound/io-bound, etc
set the terminationGracePeriodSeconds of the agent to e.g. ~2x the average duration of a flow run in its queue
- when a Pod is scaled down in kubernetes, k8s sends a SIGTERM to the container(s) main PID(s). then it waits terminationGracePeriodSeconds to send a SIGKILL to get rid of the Pod. this guy explains it better
when the main agent PID receives SIGTERM, allow the flow runs to finish gracefully
if somehow a flow run is not done after terminationGracePeriodSeconds, the main gets a SIGKILL (like when k8s kills the main PID due to OOM).
- in the case of for instance the AMQP protocol, an orphaned/zombie task would be marked as open again in the queue and another worker would pick it up.
- marking as crashed is the other alternative, and the user would add a limited amount of retries in case of undefined crash (lost connection or whatever). if there was an actual python Exception, that's a different case and the user might not want to retry.

e.g. FastAPI works like this: it will allow ongoing requests to finish and will gracefully close the loop when a SIGTERM is received. k8s counts on this behaviour in the scale-down mechanic of the Deployment/StatefulSet/ReplicaSet.

there is also still the issue of k8s SIGKILLing subprocesses to avoid the container from going OOM, which I think is relevant for this particular issue (and the CRASH detection) as well:

on another note (agent pod, not orion server), we just watched an agent's subprocess (a flow run) get OOM SIGKILLed by kubernetes. interestingly, it only kills the subprocess (ref) and the main process hangs (configured to sequentially handle only 1 flow run at a time) ~~indefinitely and~~ (nvm, it takes the next task) happily.

probably doesn't warrant an issue or so, but making sure you're aware of this behaviour of k8s
<frozen runpy>:128: RuntimeWarning: 'prefect.engine' found in sys.modules after import of package 'prefect', but prior to execution of 'prefect.engine'; this may result in unpredictable behaviour
19:44:53.133 | INFO | Flow run 'phi2-kelis-m' - Created task run 'parse_crawl-259eac6e-0' for task 'parse_crawl'
19:44:53.136 | INFO | Flow run 'phi2-kelis-m' - Executing 'parse_crawl-259eac6e-0' immediately...
19:45:15.900 | ERROR | prefect.infrastructure.process - Process 'process-0-2-0rc4' exited with status code: -9; This indicates that the process exited due to a SIGKILL signal. Typically, this is caused by high memory usage causing the operating system to terminate the process.
19:45:15.900 | INFO | prefect.agent - Reporting flow run 'fd0e2945-6994-4eee-87c4-57ed659f2bf4' as crashed due to non-zero status code.

#7948 (comment)

ddelange · 2023-01-26T05:30:56Z

Hi @chrisguidry @anticorrelator @madkinsz 👋

Taking the liberty to ping you three ref #8126 :)

Could you have a quick look at my comment above and let me know if this is something that's covered, or whether it warrants a separate issue?

Many thanks!

ddelange · 2023-01-26T08:11:19Z

Is it handled by #8127? Like, could we start running prefect-agent behind an HPA as described above?

zanieb · 2023-01-26T17:27:45Z

We only added handling for graceful exit of flow runs, not of agents. It sounds a bit like a separate issue is merited.

zanieb added enhancement An improvement of an existing feature status:triage and removed status:triage labels Dec 7, 2022

zanieb assigned bunchesofdonald Dec 7, 2022

zanieb added the priority:high label Dec 7, 2022

zanieb mentioned this issue Dec 9, 2022

Agent fails to cancel kubernetes jobs #7739

Closed

4 tasks

bunchesofdonald mentioned this issue Dec 14, 2022

Shutdown engine on SIGTERM #7887

Merged

3 tasks

zanieb linked a pull request Jan 11, 2023 that will close this issue

Refactor and improve cancellation handling #8126

Merged

anticorrelator closed this as completed in #8126 Jan 25, 2023

ddelange mentioned this issue Jan 26, 2023

Handle SIGTERM received by prefect-agent gracefully #8270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flow run that receives SIGTERM does not go through CRASH detection #7800

Flow run that receives SIGTERM does not go through CRASH detection #7800

zanieb commented Dec 7, 2022

zanieb commented Dec 7, 2022

ddelange commented Jan 18, 2023

ddelange commented Jan 26, 2023

ddelange commented Jan 26, 2023

zanieb commented Jan 26, 2023

Flow run that receives SIGTERM does not go through CRASH detection #7800

Flow run that receives SIGTERM does not go through CRASH detection #7800

Comments

zanieb commented Dec 7, 2022

First check

Prefect Version

Describe the current behavior

Describe the proposed behavior

Example Use

Additional context

zanieb commented Dec 7, 2022

ddelange commented Jan 18, 2023

ddelange commented Jan 26, 2023

ddelange commented Jan 26, 2023

zanieb commented Jan 26, 2023