-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: cache resource comparisons for each alloc #11726
Conversation
During scheduling, a given allocation will potentially be compared for fit against a large number of nodes, particularly if the job has `spread` blocks or the cluster is large and few nodes meet the feasibility. The scheduler recalculates the `ComparableResources` struct on the allocation for each node compared against, but this data doesn't change for this evaluation. Cache this information to reduce allocations and CPU time. This patch was benchmarked on a large cluster with a `spread` job and reduced heap allocations by ~16MB (which only amounts to ~2% of CPU time, but can't hurt).
@@ -158,7 +158,11 @@ func AllocsFit(node *Node, allocs []*Allocation, netIdx *NetworkIndex, checkDevi | |||
continue | |||
} | |||
|
|||
cr := alloc.ComparableResources() | |||
if alloc.comparableResources == nil { | |||
alloc.comparableResources = alloc.ComparableResources() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is technically a race condition because 2 or more worker goroutines in the same process could be calculating the resources for the same allocs on the same node concurrently. It is the "safest" race I can imagine because (a) each goroutine will write the same values, and (b) cross cpu-core visibility of these writes isn't even necessary for the optimization to be beneficial.
However, I think it also breaks an invariant: objects in memdb are never mutated and only overwritten by the FSM. Again, I think this is the "safest" form of breaking that invariant. I can't think of a problem it could cause with existing code.
So I think we should consider another approach. It'd be nice to reliably use the race detector someday. 😬 (An even greater dream is some sort of immutability checker to assert the memdb invariant, but unless Go gains first class immutability support I don't see that happening.)
...so now I feel obligated to try to come up with an alternative and regretting everything. 😅 So many architectural aspects of Nomad work against us here:
- No universal Alloc constructor where we could memoize
- No thread locals or context bags to memoize into
alloc
in this func come from 3 distinct sources! Worker->Memdb, Worker, and Plan Applier. Makes it harder to make this the caller's problem.- Throwing a Mutex into
ComparableResources()
and memoizing in it doesn't fix theno mutating memdb objects
invariant and sets a dangerous precedent that might lead people to break that invariant more in future. - We presumably calculate this on the fly to save memory and bytes on the wire, so promoting it to a "real" field created by Workers and persisted by the FSM may cost more performance than memoization saves?
map[alloc.ID]*ComparableResources
on servers is one way to fix this by storing the state elsewhere. Sadly either the state needs to be guarded by a mutex (due to access by multiple Workers) or it can't be shared. Not to mention the code would be far more complicated than this 4 liner. An LRU would let us use only a fixed amount of memory, but I have no idea how to choose a useful number for it.
Having the Worker generate alloc.comparableResources
only for the allocs it creates is another solution, but I'm afraid leaking the calculated state to memdb is actually a really nice feature for real clusters (especially using spread{}
where the same nodes with the same allocs might be iterated over multiple times).
If we intentionally persisted the results we'd obviously always save on CPU and memory allocations/gc, but I'm not sure how to judge that CPU vs disk/network/memory tradeoff.
Here's hoping I'm missing something and either this approach is ok or a slight variation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof... I was thinking that we don't run into concurrency or memdb persistence issues because in the scheduler the Allocation object hasn't actually been persisted to memdb yet and is entirely owned by that scheduler worker... but I'd totally neglected to remember we call AllocFits
in the plan applier as well. So this is definitely not safe as is.
I'm hesitant to trade-off a tiny benefit in scheduler CPU fo disk/network because we know disk IOPS in particular has been a source of trouble for end users. But in any event my assumption was that these values were only unchanging within a single scheduler stack iteration, and persisting them would break that assumption.
I was happy to try to get this in for an hour or two's worth of work for a tiny benefit, but I suspect we'd be better served using any larger amount of time to make the scheduler more observable so we can target bigger architectural issues there. So I think I'm going to close this out if you don't object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was happy to try to get this in for an hour or two's worth of work for a tiny benefit, but I suspect we'd be better served using any larger amount of time to make the scheduler more observable so we can target bigger architectural issues there. So I think I'm going to close this out if you don't object?
Yeah. I suspect we'll be referencing this PR for years to come as an example of how tricky these things are, so your efforts are not wasted!
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
During scheduling, a given allocation will potentially be compared for
fit against a large number of nodes, particularly if the job has
spread
blocks or the cluster is large and few nodes meet thefeasibility. The scheduler recalculates the
ComparableResources
struct on the allocation for each node compared against, but this data
doesn't change for this evaluation.
Cache this information to reduce allocations and CPU time. This patch
was benchmarked on a large cluster with a
spread
job and reducedheap allocations by ~16MB (which only amounts to ~2% of CPU time, but
can't hurt).
This popped up while investigating #11712 and obviously is small potatoes,
but a little extra efficiency without any tradeoff sounds good to me.