Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: allow incomplete allocrunners to be removed on restore #16638

Merged
merged 1 commit into from
Dec 7, 2023

Conversation

tgross
Copy link
Member

@tgross tgross commented Mar 24, 2023

If an allocrunner is persisted to the client state but the client stops before task runner can start, we end up with an allocation in the database with allocrunner state but no taskrunner state. This ends up mimicking an old pre-0.9.5 state where this state was not recorded and that hits a backwards compatibility shim. This leaves allocations in the client state that can never be restored, but won't ever be removed either.

Update the backwards compatibility shim so that we fail the restore for the allocrunner and remove the allocation from the client state. Taskrunners persist state during graceful shutdown, so it shouldn't be possible to leak tasks that have actually started. This lets us "start over" with the allocation, if the server still wants to place it on the client.


This work came out of discussions in #16623 where old state was kicking around and making log noise that was not useful to the user and making it harder to debug the real problem.

@tgross
Copy link
Member Author

tgross commented Dec 6, 2023

This PR was neglected for a bit while I worked on other stuff. I've finally got around to rebasing it and it's ready for review now.

Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, great debugging. Good example of how ignoring that old compat code can cause issues. 😬

@schmichael
Copy link
Member

Backport since it's a bug fix?

If an allocrunner is persisted to the client state but the client stops before
task runner can start, we end up with an allocation in the database with
allocrunner state but no taskrunner state. This ends up mimicking an old
pre-0.9.5 state where this state was not recorded and that hits a backwards
compatibility shim. This leaves allocations in the client state that can never
be restored, but won't ever be removed either.

Update the backwards compatibility shim so that we fail the restore for the
allocrunner and remove the allocation from the client state. Taskrunners persist
state during graceful shutdown, so it shouldn't be possible to leak tasks that
have actually started. This lets us "start over" with the allocation, if the
server still wants to place it on the client.
@tgross tgross force-pushed the client-state-db-missing-tasks branch from 9edbed0 to 9df7224 Compare December 7, 2023 16:18
@tgross tgross added backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line backport/1.7.x backport to 1.7.x release line labels Dec 7, 2023
@tgross
Copy link
Member Author

tgross commented Dec 7, 2023

Backport since it's a bug fix?

I've updated the changelog to mark it as a bug. Will merge and run the backports once CI is green again

@tgross tgross merged commit d7a5274 into main Dec 7, 2023
25 checks passed
@tgross tgross deleted the client-state-db-missing-tasks branch December 7, 2023 19:04
tgross added a commit that referenced this pull request Dec 7, 2023
If an allocrunner is persisted to the client state but the client stops before
task runner can start, we end up with an allocation in the database with
allocrunner state but no taskrunner state. This ends up mimicking an old
pre-0.9.5 state where this state was not recorded and that hits a backwards
compatibility shim. This leaves allocations in the client state that can never
be restored, but won't ever be removed either.

Update the backwards compatibility shim so that we fail the restore for the
allocrunner and remove the allocation from the client state. Taskrunners persist
state during graceful shutdown, so it shouldn't be possible to leak tasks that
have actually started. This lets us "start over" with the allocation, if the
server still wants to place it on the client.
@tgross tgross modified the milestones: 1.7.x, 1.7.1 Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line backport/1.7.x backport to 1.7.x release line theme/client theme/client-restart type/enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants