Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subprocess Killed Task is Not Marked as Failed #6024

Open
ColeMurray opened this issue Jul 19, 2022 · 14 comments
Open

Subprocess Killed Task is Not Marked as Failed #6024

ColeMurray opened this issue Jul 19, 2022 · 14 comments
Labels
bug Something isn't working needs:mre Needs minimal reproduction

Comments

@ColeMurray
Copy link
Contributor

Description

When prefect's subprocess is killed (in this case due to exceeding container limits), Prefect does not properly mark the task as failed.

Prefect appears to be aware the task has failed, but fails to mark it:

prefect.flow_runner.subprocess - Subprocess for flow run '4bc2d0cc-6d84-4ef1-9434-b0703e72cd67' exited with bad code: -9 | 23:43:46.427 \| ERROR \| prefect.flow_runner.subprocess - Subprocess for flow run '4bc2d0cc-6d84-4ef1-9434-b0703e72cd67' exited with bad code: -9

In this case, I would expect Prefect to mark the task as failed, otherwise it is left stuck in pending.

Reproduction / Example

Difficult to give repro. Issue is being encountered by exceeding container memory limits. Issuing a SIGKILL to the subprocess while executing should repro the issue.

@ddelange
Copy link
Contributor

ddelange commented Mar 1, 2023

I can confirm this issue: when an agent gets SIGKILL'ed (or one of its subprocesses receive SIGKILL), the agent has no chance to update the db.

@madkinsz can server mark such flows as failed when it loses heartbeat from an agent?

This is particularly common in a kubernetes environment, as k8s reserves the right to SIGKILL any subprocesses without warning to prevent the main process from going OOM. I also reported this here: #7948 (comment)

Related to but different from #8270

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.

@ddelange
Copy link
Contributor

ping

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.

@ddelange
Copy link
Contributor

ping

@ddelange
Copy link
Contributor

Hi @desertaxle 👋 taking the liberty to ping you for this one :)

There are two scenarios where a flow needs to be marked as failed:

  1. when a flow subprocess receives a SIGKILL (agent keeps running and working from the queue)
  2. when the agent (main PID) receives a SIGKILL

Where in the code should this be orchestrated? Both up to the server? Or 1. up to the agent and 2. up to the server?

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.

@ddelange
Copy link
Contributor

ping

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.

@ddelange
Copy link
Contributor

ping

@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2023

This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.

@ddelange
Copy link
Contributor

ddelange commented Aug 9, 2023

ping

@zanieb
Copy link
Contributor

zanieb commented Aug 9, 2023

@billpalombi could you triage this so it doesn't keep getting stale botted?

@ddelange your best bet for moving this forward is to include a MRE

@billpalombi billpalombi added needs:mre Needs minimal reproduction bug Something isn't working status:accepted labels Aug 10, 2023
@billpalombi
Copy link
Contributor

Thanks @zanieb! Not sure how we kept missing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs:mre Needs minimal reproduction
Projects
None yet
Development

No branches or pull requests

5 participants