-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System scheduler blocked evals #5900
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a first pass, mostly by inspecting the code, and have some suggestions. While I follow the (un-)blocking logic here, I'd like to do a deeper review or have @preetapan do a closer examination.
nomad/blocked_evals.go
Outdated
|
||
// node maps a node id to the set of all blocked evals, mapped to their authentication tokens | ||
// node map[string]map[*structs.Evaluation]string | ||
node map[string]map[string]*wrappedEval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems more mapping the node id and eval id to evaluation and authentication token
?
Also, I don't quite follow the field names, because they are in singular but refer to collections and don't indicate map structure.
I'd suggest clarifying the structure fields a bit more with a comment. Something like
systemEvals stores the blocked evaluations for system jobs, indexed by node and job to ease lookups in scheduler.
Then name the fields byJobID
and byNodeID
?
nomad/blocked_evals.go
Outdated
} | ||
|
||
// setSystemEval creates the inner map if necessary | ||
func (b *BlockedEvals) setSystemEval(eval *structs.Evaluation, token string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setSystemEval
and its siblings here operate purely on b.system
and nothing else. I would consider having the functions be on *systemEvalsand use a more conventional names
Get|Add|Remove`.
nomad/blocked_evals_test.go
Outdated
}, func(err error) { | ||
t.Fatalf("err: %s", err) | ||
}) | ||
waitBrokerEnqueued(t, blocked, broker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for extracting this code!
nomad/blocked_evals_test.go
Outdated
|
||
func waitBrokerEnqueued(t *testing.T, blocked *BlockedEvals, broker *EvalBroker) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for extracting this code! It was quite repetitive.
Looking at the function, it seems to assert that the blocked evaluations made it to eval broker. So maybe requireBlockedEvalsEnqueued
is more appropriate.
Also, is brokerStats.TotalReady
always expected to be 1? Does it it depend on how many evaluations are blocked in the test?
Also, while you are here, I'd suggest updating the error message to be more meaningful than bad:
scheduler/system_sched.go
Outdated
@@ -390,3 +394,28 @@ func (s *SystemScheduler) computePlacements(place []allocTuple) error { | |||
|
|||
return nil | |||
} | |||
|
|||
// addBlocked creates a new blocked eval for this job on this node | |||
// - Keep blocked evals in the scheduler, but just for grins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand this comment? why do we do it for grins? Is this following a pattern used in non system jobs? I'd suggest avoiding if it's not needed.
@@ -18,6 +18,7 @@ const ( | |||
|
|||
// SystemScheduler is used for 'system' jobs. This scheduler is | |||
// designed for services that should be run on every client. | |||
// One for each job, containing an allocation for each node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true for all schedulers I believe. Each scheduler instance is created to schedule a single eval.
scheduler/system_sched.go
Outdated
if s.blocked == nil { | ||
s.blocked = map[string]*structs.Evaluation{} | ||
} | ||
s.blocked[node.ID] = blocked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a harder time following time reading this code due to reusing blocked
name to refer to the blocked evaluation as well as container of blocked evals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code lgtm. I have a couple of clarifying questions before marking it as good.
Also, I'd highly encourage to rebase and squash some changes. I personally use git blame a lot and do some repository archeology when debugging issues or trying to understand some code; and having a lot of commits for relatively small number of lines of changed code adds some speed bumps into the process, specially when some changes are renames of variables introduced in the PR. I would suggest either squashing it before merging to have logical commits or use the "Squash & Merge" github feature for merging.
nomad/blocked_evals.go
Outdated
return | ||
} | ||
|
||
// QUESTION is it dangerous to delete this first? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is fine - but @preetapan should know best. If we want to be extra safe, we may need to have EnqueueAll
return error on failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fine and follows a similar pattern as other places in eval broker where we need to unblock evals and re-enqueue them.
Currently EnqueueAll
can silently not enqueue anything if this node lost leadership between before this was called to the point where it needs to enqueue. But if that's an expected case handled in this method called during leadership transitions -
Line 290 in 9134274
func (s *Server) restoreEvals() error { |
// systemEvals are handled specially, each job may have a blocked eval on each node | ||
type systemEvals struct { | ||
// byJob maps a jobID to a nodeID to that job's single blocked evalID on that node | ||
byJob map[structs.NamespacedID]map[string]string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a question: Can a system job have multiple TaskGroups - would they always be associated with the same eval id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the eval is always at the job level but if the system job has multiple task groups the scheduler will create multiple allocations. We always evaluate every task group in the job in the reconciler.
nomad/worker.go
Outdated
@@ -254,6 +254,7 @@ func (w *Worker) invokeScheduler(snap *state.StateSnapshot, eval *structs.Evalua | |||
} | |||
|
|||
// Create the scheduler, or use the special system scheduler | |||
// QUESTION: does this mean "special core scheduler" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! this is special core scheduler
that runs some administrative tasks (e.g. gc).
nomad/blocked_evals.go
Outdated
@@ -47,6 +47,11 @@ type BlockedEvals struct { | |||
// classes. | |||
escaped map[string]wrappedEval | |||
|
|||
// system is the set of system evaluations that failed to start on nodes because of | |||
// resource constraints. Retried on any change for those nodes | |||
// map[node.ID][token]eval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this type declaration needed? If so, maybe elaborate how one should read it?
// map[node.ID][token]eval |
|
||
b.evalBroker.EnqueueAll(evals) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to clear out the system evals from in memory state in the flush
method. That gets called when the eval broker is disabled (when the node loses leadership).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add a test to eval_broker_test.go where some blocked evals are in the state map and when SetDisabled is called it should no longer be in memory.
|
||
b.evalBroker.EnqueueAll(evals) | ||
} | ||
|
||
// watchCapacity is a long lived function that watches for capacity changes in | ||
// nodes and unblocks the correct set of evals. | ||
func (b *BlockedEvals) watchCapacity(stopCh <-chan struct{}, changeCh <-chan *capacityUpdate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a way to persist these such that a leadership transition causes the system evals to live on the new leader node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the blocked evals are created through the fsm, so when a new leader is elected, they will be locally replayed into memory via leader.go
nomad.restoreEvals
7e9de21
to
645be7e
Compare
@langmartin there;s one outstanding suggestion #5900 (comment) Otherwise LGTM |
0d8437f
to
e0edd11
Compare
Will this solve #4267 ? |
I'll have to investigate that, it may. |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
System jobs now support blocking evaluations by node, in the case that a placement can't be made due to resource constraints. We keep a minimal set of blocked evals, indexed by job and by node, so they can be re-evaluated when node resources become available.
fixes #5579