Handle SIGTERM received by agent gracefully #8691

ddelange · 2023-03-02T10:27:46Z

Closes #8270, based on #7948

Example

We tested this on our staging cluster (agent behind horizontal pod autoscaler) and it is working as expected as long as the terminationGracePeriod is reasonable. If it is too short, k8s sends a SIGKILL and you'll have #6024.

$ prefect agent start --match=
Starting v2.8.3+17.g812619be4.dirty agent with ephemeral API...

  ___ ___ ___ ___ ___ ___ _____     _   ___ ___ _  _ _____
 | _ \ _ \ __| __| __/ __|_   _|   /_\ / __| __| \| |_   _|
 |  _/   / _|| _|| _| (__  | |    / _ \ (_ | _|| .` | | |
 |_| |_|_\___|_| |___\___| |_|   /_/ \_\___|___|_|\_| |_|


Agent started! Looking for work from queue(s) that start with the
prefix: ['']...
09:56:16.588 | INFO    | prefect.agent - Matched new work queues: default
Received SIGTERM. Sending SIGINT to the Prefect agent (PID 11847)...
Received SIGINT. Sending SIGINT to the Prefect agent (PID 11847)...
Agent stopped!

Checklist

This pull request references any related issue by including "closes <link to issue>"
- If no issue exists and your change is not a small fix, please create an issue first.
This pull request includes tests or only affects documentation.
This pull request includes a label categorizing the change e.g. fix, feature, enhancement

Signed-off-by: ddelange <[email protected]>

…erm-testing * 'main' of https://github.com/prefecthq/prefect: (77 commits) Update roles and permissions in documentation (PrefectHQ#8263) Add Prefect Cloud Quickstart tutorial (PrefectHQ#8227) Remove needless log Update comment for consistency Reorder migrations for clarity Refactor cancellation cleanup service Uses canonical `CANCELLING` states for run cancellations (PrefectHQ#8245) Add cancellation cleanup service (PrefectHQ#8128) Improve engine shutdown handling of SIGTERM (PrefectHQ#8127) Create a `CANCELLING` state type (PrefectHQ#7794) Update KubernetesJob options (PrefectHQ#8261) Small work pools UI updates (PrefectHQ#8257) Removes migration logic (PrefectHQ#8255) Consolidate multi-arch docker builds (PrefectHQ#7902) Include nested `pydantic.BaseModel` secret fields in blocks' schema (PrefectHQ#8246) Improve contributing documentation with venv instructions (PrefectHQ#8247) Update Python tests to use a single test matrix for both databases (PrefectHQ#8171) Adds migration logic for work pools (PrefectHQ#8214) Add `project_urls` to `setup.py` (PrefectHQ#8224) Add `is_schedule_active` to client `Deployment` class (PrefectHQ#7430) ...

netlify · 2023-04-11T21:29:53Z

👷 Deploy request for prefect-docs-preview pending review.

Visit the deploys page to approve it

Name	Link
🔨 Latest commit	`f8f5968`

desertaxle

Thanks for the contribution @ddelange! The implementation looks good to me, but I think we should solve the timeout errors that we're seeing in the new test before merging.

desertaxle · 2023-04-26T16:55:51Z

tests/cli/test_start_agent.py

+
+POLL_INTERVAL = 0.5
+STARTUP_TIMEOUT = 20
+SHUTDOWN_TIMEOUT = 60


You might want to decrease the value of this shutdown timeout. The tests timeout at 60 seconds by default, so pytest will cancel the test before this shutdown time is reached.

I've also noticed that some of the new tests will fail due to timeout, which is surprising since I wouldn't expect these agent processes to pick up any flow runs. Is it possible that the signals aren't being reliably forwarded to the agent process?

hi! 👋 in our cluster, this works super reliably. in CI here they also mostly work, but indeed they sometimes hang forever (the timeout you mentioned was at 10secs before, the agent usually exits within 2secs). I have a feeling this is an artifact of my test orchestration. the anyio wait method (according to docs) should exit immediately if the subprocess has already terminated by the time of calling wait. but maybe we're catching a moment where there's limbo and wait hangs forever. I'm lost, as I can't reproduce this 'limbo' locally to debug the state of the agent subprocess...

hi @desertaxle @madkinsz 👋

I've pushed a commit in an attempt to make this more robust.

Could you re-run CI?

desertaxle · 2023-05-10T21:59:23Z

Hey @ddelange, I added our use_hosted_api_server fixture to your tests which helped with the issues I saw in the unit tests run. This way we don't need to use the ephemeral API in the tests.

ddelange · 2023-05-11T04:00:50Z

awesome find, many thanks! 💥

Signed-off-by: ddelange <[email protected]> Co-authored-by: Zanie Adkins <[email protected]> Co-authored-by: Alexander Streed <[email protected]>

ddelange added 30 commits December 20, 2022 07:32

Handle SIGTERM signal gracefully

5a85027

PR Suggestions

12e8307

Signed-off-by: ddelange <[email protected]>

Add graceful SIGTERM shutdown test

338bbd0

Signed-off-by: ddelange <[email protected]>

Add returncode assertions

3352ac7

Signed-off-by: ddelange <[email protected]>

Merge branch 'main' into patch-1

4da8b06

Merge branch 'main' into patch-1

8908490

pre-commit run --all-files

faa9052

Merge branch 'main' into patch-1

39c27ed

Merge branch 'main' into patch-1

513678d

Bump SHUTDOWN_TIMEOUT for tests

7fa5fa8

Merge branch 'main' into patch-1

b6859d0

Merge branch 'main' into patch-1

cab59e5

Merge branch 'main' into patch-1

a7004a5

Merge branch 'main' into patch-1

200005c

Add print statements

1f386ea

Switch from subprocess to anyio

98dce31

Cosmetics

21c4042

Amend docstring

39abfb1

Add output to AssertionError

fc75845

Use anyio.fail_after

462728d

Fix test_sigterm_sends_sigterm_directly

e526e76

Remove returncode assertions

8f48e36

Reduce sleep

05dad74

Typo

d6a5b0b

Close stdout tempfile

fa01b41

Bump timeout a bit

076cff0

Reduce timeout a bit

45f296f

Amend when SIGKILL comes too late

cd9af10

Amend

4663a23

ddelange added 6 commits March 30, 2023 06:40

Merge branch 'main' into graceful-agent

66b5562

Merge branch 'main' into graceful-agent

b6528ca

Merge branch 'main' into graceful-agent

83ed2fc

Merge branch 'main' into graceful-agent

a99e11c

Merge branch 'main' into graceful-agent

8e47fc0

Merge branch 'main' into graceful-agent

1095b7b

ddelange added 8 commits April 12, 2023 08:52

Merge branch 'main' into graceful-agent

94c7cd4

Merge branch 'main' into graceful-agent

3f2a8cd

Merge branch 'main' into graceful-agent

7f8c7e7

Merge branch 'main' into graceful-agent

dbc3da7

Merge branch 'main' into graceful-agent

547126e

Merge branch 'main' into graceful-agent

e2d97fc

Merge branch 'main' into graceful-agent

484a46d

Merge branch 'main' into graceful-agent

16aeb6f

desertaxle added the enhancement An improvement of an existing feature label Apr 26, 2023

desertaxle assigned desertaxle and unassigned rpeden Apr 26, 2023

desertaxle reviewed Apr 26, 2023

View reviewed changes

ddelange and others added 5 commits May 4, 2023 14:24

Merge branch 'main' into graceful-agent

eb4545c

Merge branch 'main' into graceful-agent

4174619

Attempt another process.wait upon timeout

ffe496a

Merge branch 'main' into graceful-agent

989f893

Use hosted API server for tests

f8f5968

desertaxle approved these changes May 10, 2023

View reviewed changes

desertaxle merged commit 93c41a0 into PrefectHQ:main May 10, 2023

desertaxle mentioned this pull request May 11, 2023

Add SIGTERM handling for workers #9530

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle SIGTERM received by agent gracefully #8691

Handle SIGTERM received by agent gracefully #8691

ddelange commented Mar 2, 2023 •

edited by desertaxle

Loading

netlify bot commented Apr 11, 2023 •

edited

Loading

desertaxle left a comment

desertaxle Apr 26, 2023

ddelange Apr 26, 2023

ddelange May 8, 2023

desertaxle commented May 10, 2023

ddelange commented May 11, 2023

Handle SIGTERM received by agent gracefully #8691

Handle SIGTERM received by agent gracefully #8691

Conversation

ddelange commented Mar 2, 2023 • edited by desertaxle Loading

Example

Checklist

netlify bot commented Apr 11, 2023 • edited Loading

👷 Deploy request for prefect-docs-preview pending review.

desertaxle left a comment

Choose a reason for hiding this comment

desertaxle Apr 26, 2023

Choose a reason for hiding this comment

ddelange Apr 26, 2023

Choose a reason for hiding this comment

ddelange May 8, 2023

Choose a reason for hiding this comment

desertaxle commented May 10, 2023

ddelange commented May 11, 2023

ddelange commented Mar 2, 2023 •

edited by desertaxle

Loading

netlify bot commented Apr 11, 2023 •

edited

Loading