Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of client: allow incomplete allocrunners to be removed on restore into release/1.5.x #19360

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #16638 to be assessed for backporting due to the inclusion of the label backport/1.5.x.

🚨

Warning automatic cherry-pick of commits failed. If the first commit failed,
you will see a blank no-op commit below. If at least one commit succeeded, you
will see the cherry-picked commits up to, not including, the commit where
the merge conflict occurred.

The person who merged in the original PR is:
@tgross
This person should manually cherry-pick the original PR into a new backport PR,
and close this one when the manual backport PR is merged in.

merge conflict error: POST https://api.github.com/repos/hashicorp/nomad/merges: 409 Merge conflict []

The below text is copied from the body of the original PR.


If an allocrunner is persisted to the client state but the client stops before task runner can start, we end up with an allocation in the database with allocrunner state but no taskrunner state. This ends up mimicking an old pre-0.9.5 state where this state was not recorded and that hits a backwards compatibility shim. This leaves allocations in the client state that can never be restored, but won't ever be removed either.

Update the backwards compatibility shim so that we fail the restore for the allocrunner and remove the allocation from the client state. Taskrunners persist state during graceful shutdown, so it shouldn't be possible to leak tasks that have actually started. This lets us "start over" with the allocation, if the server still wants to place it on the client.


This work came out of discussions in #16623 where old state was kicking around and making log noise that was not useful to the user and making it harder to debug the real problem.


Overview of commits

@hashicorp-cla
Copy link

hashicorp-cla commented Dec 7, 2023

CLA assistant check
All committers have signed the CLA.

If an allocrunner is persisted to the client state but the client stops before
task runner can start, we end up with an allocation in the database with
allocrunner state but no taskrunner state. This ends up mimicking an old
pre-0.9.5 state where this state was not recorded and that hits a backwards
compatibility shim. This leaves allocations in the client state that can never
be restored, but won't ever be removed either.

Update the backwards compatibility shim so that we fail the restore for the
allocrunner and remove the allocation from the client state. Taskrunners persist
state during graceful shutdown, so it shouldn't be possible to leak tasks that
have actually started. This lets us "start over" with the allocation, if the
server still wants to place it on the client.
@tgross tgross force-pushed the backport/client-state-db-missing-tasks/needlessly-renewed-joey branch from e45d2da to 27bd931 Compare December 7, 2023 19:38
@tgross tgross marked this pull request as ready for review December 7, 2023 19:38
@tgross tgross merged commit 7b9b1de into release/1.5.x Dec 7, 2023
22 of 24 checks passed
@tgross tgross deleted the backport/client-state-db-missing-tasks/needlessly-renewed-joey branch December 7, 2023 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants