-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: fix quadratic performance with spread blocks #11712
Conversation
199d851
to
7daaf75
Compare
7daaf75
to
e3f6c3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. I wonder if we should improve our documentation too. I think this paragraph could be expanded from:
Spread criteria are treated as a soft preference by the Nomad scheduler. If no nodes match a given spread criteria, placement is still successful.
To something like:
Spread criteria are treated as a soft preference by the Nomad scheduler. If no nodes match a given spread criteria, placement is still successful. To avoid scoring every node for every placement, allocations may not be perfectly spread. Spread works best on attributes with similar number of nodes: identically configured racks or similarly configured datacenters.
If anyone needs absolute precision with spread
we could make the timeout for evaluations configurable. Given the high concurrency of scheduling workers, operators may tolerate a small number of very slow (eg 5 minute) evaluations.
When the scheduler picks a node for each evaluation, the `LimitIterator` provides at most 2 eligible nodes for the `MaxScoreIterator` to choose from. This keeps scheduling fast while producing acceptable results because the results are binpacked. Jobs with a `spread` block (or node affinity) remove this limit in order to produce correct spread scoring. This means that every allocation within a job with a `spread` block is evaluated against _all_ eligible nodes. Operators of large clusters have reported that jobs with `spread` blocks that are eligible on a large number of nodes can take longer than the nack timeout to evaluate (60s). Typical evaluations are processed in milliseconds. In practice, it's not necessary to evaluate every eligible node for every allocation on large clusters, because the `RandomIterator` at the base of the scheduler stack produces enough variation in each pass that the likelihood of an uneven spread is negligible. Note that feasibility is checked before the limit, so this only impacts the number of _eligible_ nodes available for scoring, not the total number of nodes. This changeset sets the iterator limit for "large" `spread` block and node affinity jobs to be equal to the number of desired allocations. This brings an example problematic job evaluation down from ~3min to ~10s. The included tests ensure that we have acceptable spread results across a variety of large cluster topologies.
7a0db35
to
252f93b
Compare
I've picked up your suggestions and added the docs change. I've adjusted the test so that it's a bit less strict (effectively allowing for [0, 1, 2] which we need for the large rack counts) and produces more readable error outputs. I've also added a note to the change in |
Adds a package `scheduler/benchmarks` with some examples of profile and benchmarking the scheduler, along with helpers for loading real-world data for profiling. This tooling comes out of work done for #11712. These profiles and test benchmarks have not been added to CI because these particular profiles are mostly examples and the runs will add an excessive amount of time to CI runs for code that rarely changes in a way that has any chance of impacting performance.
Adds a package `scheduler/benchmarks` with some examples of profiling and benchmarking the scheduler, along with helpers for loading real-world data for profiling. This tooling comes out of work done for #11712. These test benchmarks have not been added to CI because these particular profiles are mostly examples and the runs will add an excessive amount of time to CI runs for code that rarely changes in a way that has any chance of impacting performance.
Adds a package `scheduler/benchmarks` with some examples of profiling and benchmarking the scheduler, along with helpers for loading real-world data for profiling. This tooling comes out of work done for #11712. These test benchmarks have not been added to CI because these particular profiles are mostly examples and the runs will add an excessive amount of time to CI runs for code that rarely changes in a way that has any chance of impacting performance.
When the scheduler picks a node for each evaluation, the `LimitIterator` provides at most 2 eligible nodes for the `MaxScoreIterator` to choose from. This keeps scheduling fast while producing acceptable results because the results are binpacked. Jobs with a `spread` block (or node affinity) remove this limit in order to produce correct spread scoring. This means that every allocation within a job with a `spread` block is evaluated against _all_ eligible nodes. Operators of large clusters have reported that jobs with `spread` blocks that are eligible on a large number of nodes can take longer than the nack timeout to evaluate (60s). Typical evaluations are processed in milliseconds. In practice, it's not necessary to evaluate every eligible node for every allocation on large clusters, because the `RandomIterator` at the base of the scheduler stack produces enough variation in each pass that the likelihood of an uneven spread is negligible. Note that feasibility is checked before the limit, so this only impacts the number of _eligible_ nodes available for scoring, not the total number of nodes. This changeset sets the iterator limit for "large" `spread` block and node affinity jobs to be equal to the number of desired allocations. This brings an example problematic job evaluation down from ~3min to ~10s. The included tests ensure that we have acceptable spread results across a variety of large cluster topologies.
When the scheduler picks a node for each evaluation, the `LimitIterator` provides at most 2 eligible nodes for the `MaxScoreIterator` to choose from. This keeps scheduling fast while producing acceptable results because the results are binpacked. Jobs with a `spread` block (or node affinity) remove this limit in order to produce correct spread scoring. This means that every allocation within a job with a `spread` block is evaluated against _all_ eligible nodes. Operators of large clusters have reported that jobs with `spread` blocks that are eligible on a large number of nodes can take longer than the nack timeout to evaluate (60s). Typical evaluations are processed in milliseconds. In practice, it's not necessary to evaluate every eligible node for every allocation on large clusters, because the `RandomIterator` at the base of the scheduler stack produces enough variation in each pass that the likelihood of an uneven spread is negligible. Note that feasibility is checked before the limit, so this only impacts the number of _eligible_ nodes available for scoring, not the total number of nodes. This changeset sets the iterator limit for "large" `spread` block and node affinity jobs to be equal to the number of desired allocations. This brings an example problematic job evaluation down from ~3min to ~10s. The included tests ensure that we have acceptable spread results across a variety of large cluster topologies.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
When the scheduler picks a node for each evaluation, the
LimitIterator
provides at most 2 eligible nodes for theMaxScoreIterator
to choose from. This keeps scheduling fast whileproducing acceptable results because the results are binpacked.
Jobs with a
spread
block (or node affinity) remove this limit inorder to produce correct spread scoring. This means that every
allocation within a job with a
spread
block is evaluated againstall eligible nodes. Operators of large clusters have reported that
jobs with
spread
blocks that are eligible on a large number of nodescan take longer than the nack timeout to evaluate (60s). Typical
evaluations are processed in milliseconds.
In practice, it's not necessary to evaluate every eligible node for
every allocation on large clusters, because the
RandomIterator
atthe base of the scheduler stack produces enough variation in each pass
that the likelihood of an uneven spread is negligible. Note that
feasibility is checked before the limit, so this only impacts the
number of eligible nodes available for scoring, not the total number
of nodes.
This changeset sets the iterator limit for "large"
spread
block andnode affinity jobs to be equal to the number of desired
allocations. This brings an example problematic job evaluation down
from ~3min to ~10s. The included tests ensure that we have acceptable
spread results across a variety of large cluster topologies.
Note for reviewers: I have some additional PRs to make that'll improve our ability to debug this kind of problem in the future (ex. instrumenting the iterators, tooling for handling large snapshots), but I've intentionally kept them out of this PR so we have a clean bugfix for backporting. I'll push those PRs up over the next couple days.