-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle SIGTERM received by agent gracefully #8691
Conversation
Signed-off-by: ddelange <[email protected]>
Signed-off-by: ddelange <[email protected]>
Signed-off-by: ddelange <[email protected]>
…erm-testing * 'main' of https://github.com/prefecthq/prefect: (77 commits) Update roles and permissions in documentation (PrefectHQ#8263) Add Prefect Cloud Quickstart tutorial (PrefectHQ#8227) Remove needless log Update comment for consistency Reorder migrations for clarity Refactor cancellation cleanup service Uses canonical `CANCELLING` states for run cancellations (PrefectHQ#8245) Add cancellation cleanup service (PrefectHQ#8128) Improve engine shutdown handling of SIGTERM (PrefectHQ#8127) Create a `CANCELLING` state type (PrefectHQ#7794) Update KubernetesJob options (PrefectHQ#8261) Small work pools UI updates (PrefectHQ#8257) Removes migration logic (PrefectHQ#8255) Consolidate multi-arch docker builds (PrefectHQ#7902) Include nested `pydantic.BaseModel` secret fields in blocks' schema (PrefectHQ#8246) Improve contributing documentation with venv instructions (PrefectHQ#8247) Update Python tests to use a single test matrix for both databases (PrefectHQ#8171) Adds migration logic for work pools (PrefectHQ#8214) Add `project_urls` to `setup.py` (PrefectHQ#8224) Add `is_schedule_active` to client `Deployment` class (PrefectHQ#7430) ...
👷 Deploy request for prefect-docs-preview pending review.Visit the deploys page to approve it
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @ddelange! The implementation looks good to me, but I think we should solve the timeout errors that we're seeing in the new test before merging.
tests/cli/test_start_agent.py
Outdated
|
||
POLL_INTERVAL = 0.5 | ||
STARTUP_TIMEOUT = 20 | ||
SHUTDOWN_TIMEOUT = 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to decrease the value of this shutdown timeout. The tests timeout at 60 seconds by default, so pytest will cancel the test before this shutdown time is reached.
I've also noticed that some of the new tests will fail due to timeout, which is surprising since I wouldn't expect these agent processes to pick up any flow runs. Is it possible that the signals aren't being reliably forwarded to the agent process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi! 👋 in our cluster, this works super reliably. in CI here they also mostly work, but indeed they sometimes hang forever (the timeout you mentioned was at 10secs before, the agent usually exits within 2secs). I have a feeling this is an artifact of my test orchestration. the anyio wait method (according to docs) should exit immediately if the subprocess has already terminated by the time of calling wait. but maybe we're catching a moment where there's limbo and wait hangs forever. I'm lost, as I can't reproduce this 'limbo' locally to debug the state of the agent subprocess...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @desertaxle @madkinsz 👋
I've pushed a commit in an attempt to make this more robust.
Could you re-run CI?
Hey @ddelange, I added our |
awesome find, many thanks! 💥 |
Signed-off-by: ddelange <[email protected]> Co-authored-by: Zanie Adkins <[email protected]> Co-authored-by: Alexander Streed <[email protected]>
Closes #8270, based on #7948
Example
We tested this on our staging cluster (agent behind horizontal pod autoscaler) and it is working as expected as long as the
terminationGracePeriod
is reasonable. If it is too short, k8s sends a SIGKILL and you'll have #6024.Checklist
<link to issue>
"fix
,feature
,enhancement