-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: cache resource comparisons for each alloc #11726
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is technically a race condition because 2 or more worker goroutines in the same process could be calculating the resources for the same allocs on the same node concurrently. It is the "safest" race I can imagine because (a) each goroutine will write the same values, and (b) cross cpu-core visibility of these writes isn't even necessary for the optimization to be beneficial.
However, I think it also breaks an invariant: objects in memdb are never mutated and only overwritten by the FSM. Again, I think this is the "safest" form of breaking that invariant. I can't think of a problem it could cause with existing code.
So I think we should consider another approach. It'd be nice to reliably use the race detector someday. 😬 (An even greater dream is some sort of immutability checker to assert the memdb invariant, but unless Go gains first class immutability support I don't see that happening.)
...so now I feel obligated to try to come up with an alternative and regretting everything. 😅 So many architectural aspects of Nomad work against us here:
alloc
in this func come from 3 distinct sources! Worker->Memdb, Worker, and Plan Applier. Makes it harder to make this the caller's problem.ComparableResources()
and memoizing in it doesn't fix theno mutating memdb objects
invariant and sets a dangerous precedent that might lead people to break that invariant more in future.map[alloc.ID]*ComparableResources
on servers is one way to fix this by storing the state elsewhere. Sadly either the state needs to be guarded by a mutex (due to access by multiple Workers) or it can't be shared. Not to mention the code would be far more complicated than this 4 liner. An LRU would let us use only a fixed amount of memory, but I have no idea how to choose a useful number for it.Having the Worker generate
alloc.comparableResources
only for the allocs it creates is another solution, but I'm afraid leaking the calculated state to memdb is actually a really nice feature for real clusters (especially usingspread{}
where the same nodes with the same allocs might be iterated over multiple times).If we intentionally persisted the results we'd obviously always save on CPU and memory allocations/gc, but I'm not sure how to judge that CPU vs disk/network/memory tradeoff.
Here's hoping I'm missing something and either this approach is ok or a slight variation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof... I was thinking that we don't run into concurrency or memdb persistence issues because in the scheduler the Allocation object hasn't actually been persisted to memdb yet and is entirely owned by that scheduler worker... but I'd totally neglected to remember we call
AllocFits
in the plan applier as well. So this is definitely not safe as is.I'm hesitant to trade-off a tiny benefit in scheduler CPU fo disk/network because we know disk IOPS in particular has been a source of trouble for end users. But in any event my assumption was that these values were only unchanging within a single scheduler stack iteration, and persisting them would break that assumption.
I was happy to try to get this in for an hour or two's worth of work for a tiny benefit, but I suspect we'd be better served using any larger amount of time to make the scheduler more observable so we can target bigger architectural issues there. So I think I'm going to close this out if you don't object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I suspect we'll be referencing this PR for years to come as an example of how tricky these things are, so your efforts are not wasted!