Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document expected evaluation counts #13468

Closed
tgross opened this issue Jun 23, 2022 · 6 comments · Fixed by #14750
Closed

document expected evaluation counts #13468

tgross opened this issue Jun 23, 2022 · 6 comments · Fixed by #14750
Assignees
Labels
theme/docs Documentation issues and enhancements theme/scheduling

Comments

@tgross
Copy link
Member

tgross commented Jun 23, 2022

The number of evaluations generated by Nomad is an important metric for operators to understand its performance. We should improve the documentation on the expected number of evaluations we create and what this means for how many nodes are processed by the scheduler.

Some example events that create Evaluations:

  • For one job registration, a single Evaluation is created in raft.
  • For one allocation failure, a single Evaluation is created in raft.
  • For each chunk of a deployment, a single Evaluation is created in raft.
  • For each run of a periodic job, a single Evaluation is created in raft.
  • For each periodic server garbage collection event, a single Evaluation is created in raft.
  • For one node registration event (including missed heartbeats):
    • One Evaluation is created in raft for each system job in the cluster, plus...
    • One Evaluation is created for each non-system job that has an allocation on that node.
    • ref node_endpoint.go#L1445-L1548

For each Evaluation in raft:

  • A scheduler worker must dequeue it from the eval broker on the leader.
  • For each allocation in a batch job or service job without spread, the scheduler will evaluate (process) nodes until it finds 2 feasible nodes to score.
  • For each allocation in a batch job or service job with spread, the scheduler will evaluate (process) a number of nodes equal to the task group count, or 100, whichever is less.
  • For a system job, the scheduler will evaluate (process) all nodes on the cluster.
  • If there are changes to make, the scheduler worker must submit a plan to the plan applier on the leader. No-op evaluations are not submitted as plans.
  • If the scheduler can't place all the allocations, one new Evaluation will be created in the "blocked" state and those will start from the top once dequeued by the next pass thru a scheduler.

Some concrete examples:

  • If you have 90 system jobs and 10k nodes existing, and add 100 nodes:
    • 90 * 100 = 9000 Evaluations will be created in raft (each one written to disk, replicated to a quorum of followers, etc.)
    • for each of those Evaluations created in raft, 10100 nodes will be evaluated ("processed") by the scheduler (909M total, all in-memory operations on the scheduler workers)
    • we'd expect between 90 and 9000 plans submitted, depending on how close together node updates land
  • If you have 90 system jobs and 10k nodes existing, and add 100 nodes, but after placement all the allocations fail once b/c of external dependencies:
    • 90 * 100 = 9000 initial Evaluations + 9000 Evaluations for the failed allocs
    • for each of Evaluations created in raft, 10100 nodes will be evaluated ("processed") for 1818M operations in total.
    • we'd expect between 180 and 18000 plans submitted, depending on how close together node and alloc updates land
  • If you have 10k nodes existing and 100 nodes each with 100 allocations with unique non-spread service jobs stop and then restart (2 node update events):
    • 100 * 100 * 2 = 20000 Evaluations will be created in raft.
    • For each of those Evaluations created in raft a maximum of 2 nodes will be scored because in this case each job has 1 alloc. But this could mean anywhere from 40000 nodes evaluated ("processed") to 200M nodes evaluated ("processed"), depending on how sparse the feasibility of nodes for the job is. (Typically leaning towards the low end of that range outside of extremely strange cluster configurations.)
    • we'd expect between 200 and 20000 plans submitted, depending on how close together node updates land
@tgross tgross added theme/docs Documentation issues and enhancements theme/scheduling labels Jun 23, 2022
@robloxrob
Copy link

Fantastic write up. Add in some charts and diagrams, then you've got a stew going.

@tgross
Copy link
Member Author

tgross commented Jun 27, 2022

I've added some missing bits about deployments and periodic jobs. There's a lot more detail to some of these that we'll want to capture in the actual docs, but I want to make sure we're not missing any of the big ones here.

@ghshephard
Copy link

Amazing write up Tim. I spent a quality 90 minutes carefully reading through it and comparing with actual cluster sizes/allocations/etc...

Follow up on one item:

For each allocation in a batch job or service job with spread, the scheduler will evaluate (process) all nodes on the cluster.

I was wondering if this applies to all our jobs given that we use the spread (instead of binpack) scheduler for our jobs. So, if we had a node with 62 allocations, and that node went down, and there were 10,000 nodes remaining in the cluster, does that mean that 62 * 10,000 or 620,000 evaluations will occur? And, for completeness, if we had 1000 nodes restart each with 62 allocations on them, would we see 620M evaluations occur if 10,000 nodes remained in the cluster?

ChrisL asked:

So be careful with what ‘spread’ means here. I am not sure, and want to follow up to see whether this is specific to a job with a spread stanza or whether it applies all the time when the spread scheduler. I know we’ve had issues with spread stanzas severely impacting placement times for large jobs, and so have removed them. Makes me wonder if this is just for spread stanzas

@tgross
Copy link
Member Author

tgross commented Jul 7, 2022

I was wondering if this applies to all our jobs given that we use the spread (instead of binpack) scheduler for our jobs.

No, they work differently and have different purposes:

  • The spread block changes the limit on the number of nodes processed from 2 to "many". This makes the scheduler spread a given job out as much as possible. Depending on your topology, this can help reduce the impact of node failures on the application.
  • The spread scheduling config doesn't change the number of nodes processed. Instead, it changes the scoring algorithm we use for binpacking from "best fit" to "worst fit" (ref funcs.go#L254-L297 or this paper if you really want to dig in), which statistically spreads out all the workloads across the cluster, especially on large clusters. For on-prem installations this can help distribute power and network utilization across your hardware.

This is why adding the spread block has a huge impact on scheduling time for large clusters (as ChrisL noted), whereas the spread scheduling config is just as fast as binpack.

Your question made me remember that we recently changed the behavior so that spread doesn't have to process all the nodes anymore, just a number equal to the count of the task group or 100, whichever is less. That change shipped in Nomad 1.2.4 #11712. The impact on scheduling time that ChrisL saw was much worse before that change.

@tgross
Copy link
Member Author

tgross commented Sep 29, 2022

Fixed by #14750

@github-actions
Copy link

github-actions bot commented Feb 1, 2023

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
theme/docs Documentation issues and enhancements theme/scheduling
Projects
Development

Successfully merging a pull request may close this issue.

3 participants