document expected evaluation counts #13468

tgross · 2022-06-23T12:49:57Z

The number of evaluations generated by Nomad is an important metric for operators to understand its performance. We should improve the documentation on the expected number of evaluations we create and what this means for how many nodes are processed by the scheduler.

Some example events that create Evaluations:

For one job registration, a single Evaluation is created in raft.
For one allocation failure, a single Evaluation is created in raft.
For each chunk of a deployment, a single Evaluation is created in raft.
For each run of a periodic job, a single Evaluation is created in raft.
For each periodic server garbage collection event, a single Evaluation is created in raft.
For one node registration event (including missed heartbeats):
- One Evaluation is created in raft for each system job in the cluster, plus...
- One Evaluation is created for each non-system job that has an allocation on that node.
- ref node_endpoint.go#L1445-L1548

For each Evaluation in raft:

A scheduler worker must dequeue it from the eval broker on the leader.
For each allocation in a batch job or service job without spread, the scheduler will evaluate (process) nodes until it finds 2 feasible nodes to score.
For each allocation in a batch job or service job with spread, the scheduler will evaluate (process) a number of nodes equal to the task group count, or 100, whichever is less.
For a system job, the scheduler will evaluate (process) all nodes on the cluster.
If there are changes to make, the scheduler worker must submit a plan to the plan applier on the leader. No-op evaluations are not submitted as plans.
If the scheduler can't place all the allocations, one new Evaluation will be created in the "blocked" state and those will start from the top once dequeued by the next pass thru a scheduler.

Some concrete examples:

If you have 90 system jobs and 10k nodes existing, and add 100 nodes:
- 90 * 100 = 9000 Evaluations will be created in raft (each one written to disk, replicated to a quorum of followers, etc.)
- for each of those Evaluations created in raft, 10100 nodes will be evaluated ("processed") by the scheduler (909M total, all in-memory operations on the scheduler workers)
- we'd expect between 90 and 9000 plans submitted, depending on how close together node updates land
If you have 90 system jobs and 10k nodes existing, and add 100 nodes, but after placement all the allocations fail once b/c of external dependencies:
- 90 * 100 = 9000 initial Evaluations + 9000 Evaluations for the failed allocs
- for each of Evaluations created in raft, 10100 nodes will be evaluated ("processed") for 1818M operations in total.
- we'd expect between 180 and 18000 plans submitted, depending on how close together node and alloc updates land
If you have 10k nodes existing and 100 nodes each with 100 allocations with unique non-spread service jobs stop and then restart (2 node update events):
- 100 * 100 * 2 = 20000 Evaluations will be created in raft.
- For each of those Evaluations created in raft a maximum of 2 nodes will be scored because in this case each job has 1 alloc. But this could mean anywhere from 40000 nodes evaluated ("processed") to 200M nodes evaluated ("processed"), depending on how sparse the feasibility of nodes for the job is. (Typically leaning towards the low end of that range outside of extremely strange cluster configurations.)
- we'd expect between 200 and 20000 plans submitted, depending on how close together node updates land

The text was updated successfully, but these errors were encountered:

robloxrob · 2022-06-23T15:54:02Z

Fantastic write up. Add in some charts and diagrams, then you've got a stew going.

tgross · 2022-06-27T12:47:46Z

I've added some missing bits about deployments and periodic jobs. There's a lot more detail to some of these that we'll want to capture in the actual docs, but I want to make sure we're not missing any of the big ones here.

ghshephard · 2022-07-07T05:00:41Z

Amazing write up Tim. I spent a quality 90 minutes carefully reading through it and comparing with actual cluster sizes/allocations/etc...

Follow up on one item:

For each allocation in a batch job or service job with spread, the scheduler will evaluate (process) all nodes on the cluster.

I was wondering if this applies to all our jobs given that we use the spread (instead of binpack) scheduler for our jobs. So, if we had a node with 62 allocations, and that node went down, and there were 10,000 nodes remaining in the cluster, does that mean that 62 * 10,000 or 620,000 evaluations will occur? And, for completeness, if we had 1000 nodes restart each with 62 allocations on them, would we see 620M evaluations occur if 10,000 nodes remained in the cluster?

ChrisL asked:

So be careful with what ‘spread’ means here. I am not sure, and want to follow up to see whether this is specific to a job with a spread stanza or whether it applies all the time when the spread scheduler. I know we’ve had issues with spread stanzas severely impacting placement times for large jobs, and so have removed them. Makes me wonder if this is just for spread stanzas

tgross · 2022-07-07T15:06:11Z

I was wondering if this applies to all our jobs given that we use the spread (instead of binpack) scheduler for our jobs.

No, they work differently and have different purposes:

The spread block changes the limit on the number of nodes processed from 2 to "many". This makes the scheduler spread a given job out as much as possible. Depending on your topology, this can help reduce the impact of node failures on the application.
The spread scheduling config doesn't change the number of nodes processed. Instead, it changes the scoring algorithm we use for binpacking from "best fit" to "worst fit" (ref funcs.go#L254-L297 or this paper if you really want to dig in), which statistically spreads out all the workloads across the cluster, especially on large clusters. For on-prem installations this can help distribute power and network utilization across your hardware.

This is why adding the spread block has a huge impact on scheduling time for large clusters (as ChrisL noted), whereas the spread scheduling config is just as fast as binpack.

Your question made me remember that we recently changed the behavior so that spread doesn't have to process all the nodes anymore, just a number equal to the count of the task group or 100, whichever is less. That change shipped in Nomad 1.2.4 #11712. The impact on scheduling time that ChrisL saw was much worse before that change.

tgross · 2022-09-29T20:05:18Z

Fixed by #14750

github-actions · 2023-02-01T02:20:50Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added theme/docs Documentation issues and enhancements theme/scheduling labels Jun 23, 2022

tgross self-assigned this Sep 14, 2022

tgross mentioned this issue Sep 29, 2022

internals documentation with diagrams #14750

Merged

tgross closed this as completed in #14750 Oct 3, 2022

github-actions bot locked as resolved and limited conversation to collaborators Feb 1, 2023

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Done in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document expected evaluation counts #13468

document expected evaluation counts #13468

tgross commented Jun 23, 2022 •

edited

Loading

robloxrob commented Jun 23, 2022

tgross commented Jun 27, 2022

ghshephard commented Jul 7, 2022

tgross commented Jul 7, 2022

tgross commented Sep 29, 2022

github-actions bot commented Feb 1, 2023

document expected evaluation counts #13468

document expected evaluation counts #13468

Comments

tgross commented Jun 23, 2022 • edited Loading

robloxrob commented Jun 23, 2022

tgross commented Jun 27, 2022

ghshephard commented Jul 7, 2022

tgross commented Jul 7, 2022

tgross commented Sep 29, 2022

github-actions bot commented Feb 1, 2023

tgross commented Jun 23, 2022 •

edited

Loading