Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle SIGTERM received by agent gracefully #8691

Merged
merged 134 commits into from
May 10, 2023

Conversation

ddelange
Copy link
Contributor

@ddelange ddelange commented Mar 2, 2023

Closes #8270, based on #7948

Example

We tested this on our staging cluster (agent behind horizontal pod autoscaler) and it is working as expected as long as the terminationGracePeriod is reasonable. If it is too short, k8s sends a SIGKILL and you'll have #6024.

$ prefect agent start --match=
Starting v2.8.3+17.g812619be4.dirty agent with ephemeral API...

  ___ ___ ___ ___ ___ ___ _____     _   ___ ___ _  _ _____
 | _ \ _ \ __| __| __/ __|_   _|   /_\ / __| __| \| |_   _|
 |  _/   / _|| _|| _| (__  | |    / _ \ (_ | _|| .` | | |
 |_| |_|_\___|_| |___\___| |_|   /_/ \_\___|___|_|\_| |_|


Agent started! Looking for work from queue(s) that start with the
prefix: ['']...
09:56:16.588 | INFO    | prefect.agent - Matched new work queues: default
Received SIGTERM. Sending SIGINT to the Prefect agent (PID 11847)...
Received SIGINT. Sending SIGINT to the Prefect agent (PID 11847)...
Agent stopped!

Checklist

  • This pull request references any related issue by including "closes <link to issue>"
    • If no issue exists and your change is not a small fix, please create an issue first.
  • This pull request includes tests or only affects documentation.
  • This pull request includes a label categorizing the change e.g. fix, feature, enhancement

Signed-off-by: ddelange <[email protected]>
…erm-testing

* 'main' of https://github.com/prefecthq/prefect: (77 commits)
  Update roles and permissions in documentation (PrefectHQ#8263)
  Add Prefect Cloud Quickstart tutorial (PrefectHQ#8227)
  Remove needless log
  Update comment for consistency
  Reorder migrations for clarity
  Refactor cancellation cleanup service
  Uses canonical `CANCELLING` states for run cancellations (PrefectHQ#8245)
  Add cancellation cleanup service (PrefectHQ#8128)
  Improve engine shutdown handling of SIGTERM (PrefectHQ#8127)
  Create a `CANCELLING` state type (PrefectHQ#7794)
  Update KubernetesJob options (PrefectHQ#8261)
  Small work pools UI updates (PrefectHQ#8257)
  Removes migration logic (PrefectHQ#8255)
  Consolidate multi-arch docker builds (PrefectHQ#7902)
  Include nested `pydantic.BaseModel` secret fields in blocks' schema (PrefectHQ#8246)
  Improve contributing documentation with venv instructions (PrefectHQ#8247)
  Update Python tests to use a single test matrix for both databases (PrefectHQ#8171)
  Adds migration logic for work pools (PrefectHQ#8214)
  Add `project_urls` to `setup.py` (PrefectHQ#8224)
  Add `is_schedule_active` to client `Deployment` class (PrefectHQ#7430)
  ...
@netlify
Copy link

netlify bot commented Apr 11, 2023

👷 Deploy request for prefect-docs-preview pending review.

Visit the deploys page to approve it

Name Link
🔨 Latest commit f8f5968

@desertaxle desertaxle added the enhancement An improvement of an existing feature label Apr 26, 2023
@desertaxle desertaxle assigned desertaxle and unassigned rpeden Apr 26, 2023
Copy link
Member

@desertaxle desertaxle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @ddelange! The implementation looks good to me, but I think we should solve the timeout errors that we're seeing in the new test before merging.


POLL_INTERVAL = 0.5
STARTUP_TIMEOUT = 20
SHUTDOWN_TIMEOUT = 60
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to decrease the value of this shutdown timeout. The tests timeout at 60 seconds by default, so pytest will cancel the test before this shutdown time is reached.

I've also noticed that some of the new tests will fail due to timeout, which is surprising since I wouldn't expect these agent processes to pick up any flow runs. Is it possible that the signals aren't being reliably forwarded to the agent process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi! 👋 in our cluster, this works super reliably. in CI here they also mostly work, but indeed they sometimes hang forever (the timeout you mentioned was at 10secs before, the agent usually exits within 2secs). I have a feeling this is an artifact of my test orchestration. the anyio wait method (according to docs) should exit immediately if the subprocess has already terminated by the time of calling wait. but maybe we're catching a moment where there's limbo and wait hangs forever. I'm lost, as I can't reproduce this 'limbo' locally to debug the state of the agent subprocess...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @desertaxle @madkinsz 👋

I've pushed a commit in an attempt to make this more robust.

Could you re-run CI?

@desertaxle
Copy link
Member

Hey @ddelange, I added our use_hosted_api_server fixture to your tests which helped with the issues I saw in the unit tests run. This way we don't need to use the ephemeral API in the tests.

@desertaxle desertaxle merged commit 93c41a0 into PrefectHQ:main May 10, 2023
@ddelange
Copy link
Contributor Author

awesome find, many thanks! 💥

@desertaxle desertaxle mentioned this pull request May 11, 2023
3 tasks
asmundo pushed a commit to asmundo/prefect that referenced this pull request May 11, 2023
Signed-off-by: ddelange <[email protected]>
Co-authored-by: Zanie Adkins <[email protected]>
Co-authored-by: Alexander Streed <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle SIGTERM received by prefect-agent gracefully
4 participants