Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alloc_runner: wait when starting suspicious allocs #6216

Merged
merged 3 commits into from
Aug 28, 2019

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Aug 27, 2019

This commit aims to help users running with clients susceptible to the destroyed alloc being restrarted bug upgrade to latest. Without this, such users will have their tasks run unexpectedly on upgrade and only see the bug resolved after subsequent restart.

If, on restore, the client sees a pending alloc without any other persisted info, then err on the side that it's an corrupt persisted state of an alloc instead of the client happening to be killed right when alloc is assigned to client.

Few reasons motivate this behavior:

Statistically speaking, corruption being the cause is more likely. A long running client will have higher chance of having allocs persisted incorrectly with pending state. Being killed right when an alloc is about to start is relatively unlikely.

Also, delaying starting an alloc that hasn't started (by hopefully seconds) is not as severe as launching too many allocs that may bring client down.

More importantly, this helps customers upgrade their clients without risking taking their clients down and destablizing their cluster. We don't want existing users to force triggering the bug while they upgrade and restart cluster.

This is a follow up to #6207
Fixes #5984 (comment)

@notnoop notnoop requested a review from schmichael August 27, 2019 01:44
This commit aims to help users running with clients suseptible to the
destroyed alloc being restrarted bug upgrade to latest.  Without this,
such users will have their tasks run unexpectedly on upgrade and only
see the bug resolved after subsequent restart.

If, on restore, the client sees a pending alloc without any other
persisted info, then err on the side that it's an corrupt persisted
state of an alloc instead of the client happening to be killed right
when alloc is assigned to client.

Few reasons motivate this behavior:

Statistically speaking, corruption being the cause is more likely.  A
long running client will have higher chance of having allocs persisted
incorrectly with pending state.  Being killed right when an alloc is
about to start is relatively unlikely.

Also, delaying starting an alloc that hasn't started (by hopefully
seconds) is not as severe as launching too many allocs that may bring
client down.

More importantly, this helps customers upgrade their clients without
risking taking their clients down and destablizing their cluster. We
don't want existing users to force triggering the bug while they upgrade
and restart cluster.
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The criteria for waiting on the server seems sufficiently narrow that this should never affect allocs that are actually running.

An alternative would be removing the alloc from the local state and returning a !ok or skip bool from AR.Restore to simply drop the alloc from the client. If the client should be running the alloc it will be re-added once the servers are contacted anyway.

The main functional difference (ignoring code complexity & riskiness of change) would be that the client alloc APIs will return information about these pending allocs with this solution, whereas the client would have no knowledge of them anymore if they're dropped.

client/allocrunner/alloc_runner.go Outdated Show resolved Hide resolved
client/allocrunner/alloc_runner.go Outdated Show resolved Hide resolved
client/allocrunner/alloc_runner.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion; I like it way more. I updated the PR such that suspicious allocs get skipped without creating an alloc runner. This approach also avoids creating goroutines unnecessarily which depending on client state may have its own undesired impact.

I have one question though about client status.

client/client.go Outdated
//
// COMPAT(0.12): remove once upgrading from 0.9.5 is no longer supported
func (c *Client) isPotentiallyCompletedAlloc(alloc *structs.Allocation) bool {
if alloc.ClientStatus != structs.AllocClientStatusPending {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking at this more, I wonder if this should be checking if the client status is dead. It's possible that a completed alloc can have a non-pending client status in the following scenario:

  1. Server pushes alloc to client
  2. Client starts alloc
  3. Server updates alloc in non-disructive way (e.g. updating reschedule policy, count, etc)
  4. Client fetches new alloc with clientStatus=running
  5. Client completes, GCes, and destroyes
  6. Client persists alloc with clientStatus=running in periodic persistence
  7. Client restarts.

Though, I'm a bit worried that we would increase the scope and impact a bit more. @schmichael opinion?

To clarify, the proposal is to make it:

if alloc.ClientStatus == structs.AllocClientStatusComplete || alloc.ClientStatus == structs.AllocClientStatusFailed {
	return true
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good observation, but I think the fix is ignoring alloc.ClientStatus altogether in this check.

What we've identified is that a persisted alloc in the absence of other associated client state is useless. There's no client state for it: nothing to reattach to, nothing to GC, no local state whatsoever other than an orphaned alloc struct in the client's db.

So I think we just remove this alloc.ClientStatus check altogether and rely on the other state checks. If there's any other client state, there's something we can move forward with. If there's no other client state, we're better off dropping it and waiting on the server to tell us about it again if it should be running.

You could rename this method hasLocalState or similar to reflect that logic. I believe the persisted alloc is cached global/remote state and in the absence of related local state, the cache entry should be ignored and purged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a persisted alloc ... is useless

To be clear: it could be useful if not for the bug. This fix hurts our "edge compute" story where clients are more likely to restart in disconnected states and using a local cache of allocs is useful.

However correctness far outweighs better support for disconnected nodes - especially because there are a number of other behaviors that are already similarly problematic for disconnected nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's any other client state, there's something we can move forward with. If there's no other client state, we're better off dropping it and waiting on the server to tell us about it again if it should be running.

Excellent point. Adding it to code comment.

Copy link
Contributor Author

@notnoop notnoop Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, a quick question: Should we special case system jobs here like we did with failed task restores?

I'm inclined to not special case it. Unless it was GCed, we should expect some task state to be persisted; unless super bad things happened.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think your inclination is correct: do not special case system jobs.

This uses an alternative approach where we avoid restoring the alloc
runner in the first place, if we suspect that the alloc may have been
completed already.
@notnoop notnoop force-pushed the b-recognize-pending-allocs branch from f6b9e4a to 493945a Compare August 27, 2019 21:31
@notnoop notnoop requested a review from schmichael August 28, 2019 12:38
The ClientState being pending isn't a good criteria; as an alloc may
have been updated in-place before it was completed.

Also, updated the logic so we only check for task states.  If an alloc
has deployment state but no persisted tasks at all, restore will still
fail.
@notnoop notnoop merged commit b2ef75e into master Aug 28, 2019
@notnoop notnoop deleted the b-recognize-pending-allocs branch August 28, 2019 18:46
notnoop pushed a commit that referenced this pull request Nov 21, 2019
TestClient_RestoreError is very slow, taking ~81 seconds.

It has few problematic patterns.  It's unclear what it tests, it
simulates a failure condition where all state db lookup fails and
asserts that alloc fails.  Though starting from
#6216 , we don't fail allocs in
that condition but rather restart them.

Also, the drivers used in second client `c2` are the same singleton
instances used in `c1` and already shutdown.  We ought to start healthy
new driver instances.
@github-actions
Copy link

github-actions bot commented Feb 3, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Runaway nomad process after Nomad client reboot
2 participants