Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flow run that receives SIGTERM does not go through CRASH detection #7800

Closed
3 tasks done
zanieb opened this issue Dec 7, 2022 · 5 comments · Fixed by #8126
Closed
3 tasks done

Flow run that receives SIGTERM does not go through CRASH detection #7800

zanieb opened this issue Dec 7, 2022 · 5 comments · Fixed by #8126
Assignees
Labels
enhancement An improvement of an existing feature

Comments

@zanieb
Copy link
Contributor

zanieb commented Dec 7, 2022

First check

  • I added a descriptive title to this issue.
  • I used the GitHub search to find a similar request and didn't find it.
  • I searched the Prefect documentation for this feature.

Prefect Version

2.x

Describe the current behavior

Given a flow

@flow
def foo():
    print(get_run_context().flow_run.id)
    print(os.getpid())
    time.sleep(1000)

A kill signal

kill -TERM 36467 

Results in immediate exit of the run

10:35:02.131 | INFO    | prefect.engine - Created flow run 'brilliant-chupacabra' for flow 'foo'
7472a4eb-2162-4933-b833-8609d9081a10
36467
zsh: terminated  python example.py

Describe the proposed behavior

This flow should go through our crash handling.

To ensure cancelled flows are not reported as crashed, we should add a rule for transition to CRASHED that checks the parent state for CANCELLED and sets CANCELLED instead

Example Use

Additional context

No response

@zanieb zanieb added enhancement An improvement of an existing feature status:triage and removed status:triage labels Dec 7, 2022
@zanieb
Copy link
Contributor Author

zanieb commented Dec 7, 2022

Should help with #7732

@ddelange
Copy link
Contributor

Hi @madkinsz 👋

I'll leave some thoughts here. maybe it helps, maybe I can help? either way, here we go:

the use case of a k8s user wanting to (auto)scale up their data pipelines using prefect:

  • easiest would be to run prefect-agent Deployment behind a HPA (like is currently already possible in the orion server helm chart)
    • different worker Deployments for different queues, depending on average mem usage, cpu-bound/io-bound, etc
  • set the terminationGracePeriodSeconds of the agent to e.g. ~2x the average duration of a flow run in its queue
    • when a Pod is scaled down in kubernetes, k8s sends a SIGTERM to the container(s) main PID(s). then it waits terminationGracePeriodSeconds to send a SIGKILL to get rid of the Pod. this guy explains it better
  • when the main agent PID receives SIGTERM, allow the flow runs to finish gracefully
  • if somehow a flow run is not done after terminationGracePeriodSeconds, the main gets a SIGKILL (like when k8s kills the main PID due to OOM).
    • in the case of for instance the AMQP protocol, an orphaned/zombie task would be marked as open again in the queue and another worker would pick it up.
    • marking as crashed is the other alternative, and the user would add a limited amount of retries in case of undefined crash (lost connection or whatever). if there was an actual python Exception, that's a different case and the user might not want to retry.

e.g. FastAPI works like this: it will allow ongoing requests to finish and will gracefully close the loop when a SIGTERM is received. k8s counts on this behaviour in the scale-down mechanic of the Deployment/StatefulSet/ReplicaSet.

there is also still the issue of k8s SIGKILLing subprocesses to avoid the container from going OOM, which I think is relevant for this particular issue (and the CRASH detection) as well:

on another note (agent pod, not orion server), we just watched an agent's subprocess (a flow run) get OOM SIGKILLed by kubernetes. interestingly, it only kills the subprocess (ref) and the main process hangs (configured to sequentially handle only 1 flow run at a time) indefinitely and (nvm, it takes the next task) happily.

probably doesn't warrant an issue or so, but making sure you're aware of this behaviour of k8s

<frozen runpy>:128: RuntimeWarning: 'prefect.engine' found in sys.modules after import of package 'prefect', but prior to execution of 'prefect.engine'; this may result in unpredictable behaviour
19:44:53.133 | INFO | Flow run 'phi2-kelis-m' - Created task run 'parse_crawl-259eac6e-0' for task 'parse_crawl'
19:44:53.136 | INFO | Flow run 'phi2-kelis-m' - Executing 'parse_crawl-259eac6e-0' immediately...
19:45:15.900 | ERROR | prefect.infrastructure.process - Process 'process-0-2-0rc4' exited with status code: -9; This indicates that the process exited due to a SIGKILL signal. Typically, this is caused by high memory usage causing the operating system to terminate the process.
19:45:15.900 | INFO | prefect.agent - Reporting flow run 'fd0e2945-6994-4eee-87c4-57ed659f2bf4' as crashed due to non-zero status code.

#7948 (comment)

@ddelange
Copy link
Contributor

Hi @chrisguidry @anticorrelator @madkinsz 👋

Taking the liberty to ping you three ref #8126 :)

Could you have a quick look at my comment above and let me know if this is something that's covered, or whether it warrants a separate issue?

Many thanks!

@ddelange
Copy link
Contributor

Is it handled by #8127? Like, could we start running prefect-agent behind an HPA as described above?

@zanieb
Copy link
Contributor Author

zanieb commented Jan 26, 2023

We only added handling for graceful exit of flow runs, not of agents. It sounds a bit like a separate issue is merited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement of an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants