-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: stopped-yet-running allocs are still running #10446
Conversation
@notnoop how would this interact with ephemeral disk migrations? My understanding is that requires we have the alloc with a desired state of stopped on the server but without the alloc runner having stopped on the client. |
I believe their logic remains untouched, but I'll test. This PR changes only changes the resource accounting for a node and whether an alloc may be placed on a node or not. The other issue that should be tested more extensively is pre-emption. The server should discount the resources of alloc that's to be evicted, and the client should sequence it so the evicted alloc completes before the new one starts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work coming up with a solution for #10440. I think your assessment of the bug and that this is an optimal solution is correct.
TODO
Probably obvious since this is marked Draft
but just to be clear it needs at least one unit test to assert the behavior on the scheduler side, and then an e2e test to ensure a well-behaved client interacts as expected with this change.
Client-coordinated 👎
We could fix this on the client, but I don't think there's any benefits there. It would introduce a lot of tricky additional logic to the client and could slow down time-to-running for an allocation to max(kill_timeout)
of every job in the cluster! That could be minutes of startup latency!
So I think client-coordinated collision detection is strictly worse than this server-coordinated approach.
Server-coordinated (this) risk
The big risk of this approach that occurs to me is that it will exacerbate forever pending
issues. Currently if an allocation is stuck in ClientStatus=pending
you can often stop and restart the job to get out of that tricky situation: since the allocation will be DesiredStatus=stop
, the scheduler will ignore its resource utilization. This would obviously change that and ClientStatus=pending
allocations would forever take up cluster resources until cleaned out.
I think we should at least consider (RFC?) escape hatches for ClientStatus=pending
and/or tasks that exceed their kill_timeout
before merging this PR. I don't want to make those efforts a hard blocker, but I don't want to make "poorly behaved clients" (forever pending
, tasks that outlive their kill_timeout
, etc) an even worse issue for operators by merging this without some consideration to remediating these extremely difficult to track down bugs.
if alloc.TerminalStatus() { | ||
if alloc.ClientTerminalStatus() { | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this code is ever hit. The 2 places we call AllocsFit both already filter down to just non-terminal allocs as far as I can tell.
The call in the plan applier calls AllocsByNodeTerminal(false) to generate the list of allocs, so all terminal allocs are already removed AFAICT.
The call in the worker is also passed a list of allocs from AllocsByNodeTerminal(false).
AllocsByNodeTerminal
itself makes this all pretty tricky to trace because it's difficult to know the provenance of any given []*Allocation
and how its been populated and filtered.
I think leaving this filter logic in place is ok as a defensive measure, but I think we need to ensure the half-terminal allocs aren't filtered out "upstream."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong. This is hit in the plan applier! Tested against main (c2428e1 - v1.3.2-dev) and rebased this commit on top of it.
I missed that after the call to snap.AllocsByNodeTerminal(ws, nodeID, false)
(which returns only non-terminal allocs) we append the allocs from the Plan which include the failed alloc: https://github.com/hashicorp/nomad/blob/v1.3.1/nomad/plan_apply.go#L699
To repro:
nomad agent -dev
nomad job run example.nomad
nomad alloc signal $allocid
# wait for restart
nomad alloc signal $allocid
# wait for restart
nomad alloc signal #allocid
# eval for rescheduling sees old alloc as DesiredStatus=run + ClientStatus=failed
We have been running this patch in our cluster since it was released and it's solved this problem for us; is there anyway we can help contribute to confidence in merging something? |
Thanks for the report @benbuzbee! It's very helpful in cases like this where a change has pretty subtle yet far reaching implications. We'll consider getting this into main. |
c79161e
to
53462db
Compare
This is officially a WIP that I intend to get merged. Still needs quite a bit more testing, but upon investigation I think most of my concerns were unfounded.
I have observed cases where a Many other Next steps are:
|
tl;dr -> Did some manual testing and confirmed this does not interfere with migrate=true and sticky=true. This patch does not interfere with the existing alloc replacement (due to deployments or preemption) code paths. If you simply redeploy a job using those, the replacement allocs are created atomically (as part of the same plan) as the old allocs are created. This is the same as the old behavior and optimal as in the case of a deployment the Nomad client agents handle waiting for the original to exit before allowing the replacement to start. (Preemption also uses this for "original" and "replacement" allocs of 2 different jobs. The scheduler must make the stop-old/place-new placement atomically otherwise another placement could interleave with the stopping-of-old and placing-of-new.) But that's not what #10440 is about or what this impacts: this is all about when the stop and placement are discrete operations. First the job is stopped. Then the job (or a job using the same resources) is started. In this case the Nomad client agent's waiting code is completely bypassed as there's no PreviousAllocation set. From the server's perspective there's no causality between the old alloc using a resource and the new alloc wanting it... ...so Mahmood's patch closes this gap by making the scheduler calculate resource usage based on what's actually running on the client (based on ClientStatus), not what the server wants to be running on the client (DesiredStatus | ClientStatus). |
Some attempts at trying to visually demonstrate the change: Each box represents the Original behaviorgantt
title Overlapping allocations
dateFormat x
axisFormat %s
section Job 1
Registered :j1, 0, 2s
run | pending :j1pending, after j1, 2s
run | running :j1running, after j1pending, 2s
stop | running :j1stopping, after j1running, 2s
stop | stopped :j1stopped, after j1stopping, 2s
section Job 2
Registered :j2, after j1pending, 1s
Blocked :j2blocked, after j2, 1s
run | pending :j2pending, after j1running, 1s
run | running :j2running, after j2pending, 1s
Implemented behaviorgantt
title Non-overlapping allocations
dateFormat x
axisFormat %s
section Job 1
Registered :j1, 0, 2s
run | pending :j1pending, after j1, 2s
run | running :j1running, after j1pending, 2s
stop | running :j1stopping, after j1running, 2s
stop | stopped :j1stopped, after j1stopping, 2s
section Job 2
Registered :j2, after j1pending, 1s
Blocked --> :j2blocked, after j2, 1s
run | pending :j2pending, after j1stopping, 1s
run | running :j2running, after j2pending, 1s
|
Also add a simpler Wait test helper to improve line numbers and save few lines of code.
53462db
to
9d06bd6
Compare
it's not concise... feedback welcome
testutil/wait.go
Outdated
func Wait(t *testing.T, test testFn) { | ||
t.Helper() | ||
retries := 500 * TestMultiplier() | ||
for retries > 0 { | ||
time.Sleep(10 * time.Millisecond) | ||
retries-- | ||
|
||
success, err := test() | ||
if success { | ||
return | ||
} | ||
|
||
if retries == 0 { | ||
t.Fatalf("timeout: %v", err) | ||
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just got sick of writing the error callback. LMK if I should stop being lazy and switch to WaitForResult. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was tempted to write a similar helper recently, and I've had a few review comments on other folks' PRs of late where not returning an error in the return false, nil
case ends up making for hard-to-debug test failures. There are ~600 uses of WaitForResult*
helpers, so at some point I'd like to take a pass through and see how many we could move to this simpler helper... I think it's a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also like to see this pattern cleaned up - but this particular helper doesn't prevent the nil
error issue, and also I think the concept of TestMultiplier
can be eliminated in favor of context/timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if I added
if err == nil {
t.Fatalf("timeout waiting for test function to succeed (you should probably return a helpful error instead of nil!)")
}
The file name and line number will take you to the beginning of the test function that returned false, nil
. Since the stack frame responsible for returning false, nil
... returned ... there's really nothing else we can do here.
I'll write an e2e test in a followup PR because I want to get this reviewed ASAP. The job_endpoint test I wrote is unfortunately long, but it does properly exercise this behavior. When run on
Which indicates that the 2nd job registration placed an allocation before the old allocation had a terminal client status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I have to approve my own changes to clear my old review status.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I think you also need to make the same change to the DeviceAccounter
?
https://github.com/hashicorp/nomad/blob/v1.3.5/nomad/structs/devices.go#L63-L66
A couple of other things I looked into:
- CPU cores will be handled by
AllocsFit
like other resources. - CSI volumes have a completely different tracking mechanism (claims) and are release via a
Postrun
allocrunner hook. - Host volume usage doesn't seem to be tracked? Or at least I couldn't find what it happens 😅
I haven't check quotas (yet), but that would have to be in a separate PR anyway.
s1, cleanupS1 := TestServer(t, func(c *Config) { | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s1, cleanupS1 := TestServer(t, func(c *Config) { | |
}) | |
s1, cleanupS1 := TestServer(t, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Interestingly, with this change and by updating the volumewatcher to query allocs and not just volumes (something I've been stewing on), we might be able to simplify the claim unpublish workflow a bit more. |
@@ -7,6 +7,7 @@ import ( | |||
"testing" | |||
|
|||
"github.com/hashicorp/nomad/ci" | |||
"github.com/stretchr/testify/assert" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just JYI the top-level test
package from shoenig/test
is analogous to testify/assert
(marking failure without stopping)
testutil/wait.go
Outdated
func Wait(t *testing.T, test testFn) { | ||
t.Helper() | ||
retries := 500 * TestMultiplier() | ||
for retries > 0 { | ||
time.Sleep(10 * time.Millisecond) | ||
retries-- | ||
|
||
success, err := test() | ||
if success { | ||
return | ||
} | ||
|
||
if retries == 0 { | ||
t.Fatalf("timeout: %v", err) | ||
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also like to see this pattern cleaned up - but this particular helper doesn't prevent the nil
error issue, and also I think the concept of TestMultiplier
can be eliminated in favor of context/timeout
Great catch! Fixing.
Good call. Looking into that now as well. |
Fixed in 102af72. Thanks!
Quotas broadly uses If at some point in the future we find a compelling reason to make Quotas usage more precise, this will serve as useful prior art. |
Followup to #10446 Fails (as expected) against 1.3.x at the wait for blocked eval (because the allocs are allowed to overlap). Passes against 1.4.0-beta.1 (as expected).
* test: add e2e for non-overlapping placements Followup to #10446 Fails (as expected) against 1.3.x at the wait for blocked eval (because the allocs are allowed to overlap). Passes against 1.4.0-beta.1 (as expected). * Update e2e/overlap/overlap_test.go Co-authored-by: James Rasell <[email protected]>
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This is a straw-man PR to kick the discussion going. Needs more testing; before merging.
Updates the scheduler so that it treats the alloc as running and using its resources until it truly completes as reported by the client. Typically, an alloc is terminates quickly upon being stopped, but an alloc may take a "long" time to properly shutdown and free up exclusive resources: e.g. ports, CSI/host volumes.
Previously, the scheduler optimistically treats stopped-yet-running allocs as terminated and may schedule an alloc in their place. This introduces a widow where both stopped-yet-running and new allocs are running on the client. During the window, the client resources can be oversubscribed unexpectedly (with OOM trigger) or lead to port/volume collision. Avoiding such window seems like the correct behavior.
I conjecture in a typical cluster, the schedule will simply place the alloc to a new client without user visible downside. For some constrained jobs, the alloc start may delayed until the previous alloc stopped.
An Alternative
Alternatively, the client may order alloc operations so it starts new allocs only after conflicting stopped allocs complete. Ordering the operations is more complex to implement. It also delays starting the task unnecessarily - the new alloc would have started faster had it be placed on another client without a conflicting alloc.
Fixes #10440