Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System scheduler blocked evals #5900

Merged
merged 7 commits into from
Jul 18, 2019
Merged

Conversation

langmartin
Copy link
Contributor

@langmartin langmartin commented Jun 27, 2019

System jobs now support blocking evaluations by node, in the case that a placement can't be made due to resource constraints. We keep a minimal set of blocked evals, indexed by job and by node, so they can be re-evaluated when node resources become available.

  • Changelog

fixes #5579

Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a first pass, mostly by inspecting the code, and have some suggestions. While I follow the (un-)blocking logic here, I'd like to do a deeper review or have @preetapan do a closer examination.


// node maps a node id to the set of all blocked evals, mapped to their authentication tokens
// node map[string]map[*structs.Evaluation]string
node map[string]map[string]*wrappedEval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems more mapping the node id and eval id to evaluation and authentication token?

Also, I don't quite follow the field names, because they are in singular but refer to collections and don't indicate map structure.

I'd suggest clarifying the structure fields a bit more with a comment. Something like

systemEvals stores the blocked evaluations for system jobs, indexed by node and job to ease lookups in scheduler.

Then name the fields byJobID and byNodeID?

}

// setSystemEval creates the inner map if necessary
func (b *BlockedEvals) setSystemEval(eval *structs.Evaluation, token string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setSystemEval and its siblings here operate purely on b.system and nothing else. I would consider having the functions be on *systemEvalsand use a more conventional namesGet|Add|Remove`.

nomad/blocked_evals.go Show resolved Hide resolved
}, func(err error) {
t.Fatalf("err: %s", err)
})
waitBrokerEnqueued(t, blocked, broker)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for extracting this code!


func waitBrokerEnqueued(t *testing.T, blocked *BlockedEvals, broker *EvalBroker) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for extracting this code! It was quite repetitive.

Looking at the function, it seems to assert that the blocked evaluations made it to eval broker. So maybe requireBlockedEvalsEnqueued is more appropriate.

Also, is brokerStats.TotalReady always expected to be 1? Does it it depend on how many evaluations are blocked in the test?

Also, while you are here, I'd suggest updating the error message to be more meaningful than bad:

@@ -390,3 +394,28 @@ func (s *SystemScheduler) computePlacements(place []allocTuple) error {

return nil
}

// addBlocked creates a new blocked eval for this job on this node
// - Keep blocked evals in the scheduler, but just for grins
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this comment? why do we do it for grins? Is this following a pattern used in non system jobs? I'd suggest avoiding if it's not needed.

@@ -18,6 +18,7 @@ const (

// SystemScheduler is used for 'system' jobs. This scheduler is
// designed for services that should be run on every client.
// One for each job, containing an allocation for each node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true for all schedulers I believe. Each scheduler instance is created to schedule a single eval.

if s.blocked == nil {
s.blocked = map[string]*structs.Evaluation{}
}
s.blocked[node.ID] = blocked
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a harder time following time reading this code due to reusing blocked name to refer to the blocked evaluation as well as container of blocked evals.

@langmartin langmartin requested a review from notnoop July 10, 2019 22:14
Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code lgtm. I have a couple of clarifying questions before marking it as good.

Also, I'd highly encourage to rebase and squash some changes. I personally use git blame a lot and do some repository archeology when debugging issues or trying to understand some code; and having a lot of commits for relatively small number of lines of changed code adds some speed bumps into the process, specially when some changes are renames of variables introduced in the PR. I would suggest either squashing it before merging to have logical commits or use the "Squash & Merge" github feature for merging.

return
}

// QUESTION is it dangerous to delete this first?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is fine - but @preetapan should know best. If we want to be extra safe, we may need to have EnqueueAll return error on failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fine and follows a similar pattern as other places in eval broker where we need to unblock evals and re-enqueue them.

Currently EnqueueAll can silently not enqueue anything if this node lost leadership between before this was called to the point where it needs to enqueue. But if that's an expected case handled in this method called during leadership transitions -

func (s *Server) restoreEvals() error {

// systemEvals are handled specially, each job may have a blocked eval on each node
type systemEvals struct {
// byJob maps a jobID to a nodeID to that job's single blocked evalID on that node
byJob map[structs.NamespacedID]map[string]string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a question: Can a system job have multiple TaskGroups - would they always be associated with the same eval id?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the eval is always at the job level but if the system job has multiple task groups the scheduler will create multiple allocations. We always evaluate every task group in the job in the reconciler.

nomad/worker.go Outdated
@@ -254,6 +254,7 @@ func (w *Worker) invokeScheduler(snap *state.StateSnapshot, eval *structs.Evalua
}

// Create the scheduler, or use the special system scheduler
// QUESTION: does this mean "special core scheduler"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! this is special core scheduler that runs some administrative tasks (e.g. gc).

@@ -47,6 +47,11 @@ type BlockedEvals struct {
// classes.
escaped map[string]wrappedEval

// system is the set of system evaluations that failed to start on nodes because of
// resource constraints. Retried on any change for those nodes
// map[node.ID][token]eval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this type declaration needed? If so, maybe elaborate how one should read it?

Suggested change
// map[node.ID][token]eval


b.evalBroker.EnqueueAll(evals)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to clear out the system evals from in memory state in the flush method. That gets called when the eval broker is disabled (when the node loses leadership).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a test to eval_broker_test.go where some blocked evals are in the state map and when SetDisabled is called it should no longer be in memory.


b.evalBroker.EnqueueAll(evals)
}

// watchCapacity is a long lived function that watches for capacity changes in
// nodes and unblocks the correct set of evals.
func (b *BlockedEvals) watchCapacity(stopCh <-chan struct{}, changeCh <-chan *capacityUpdate) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a way to persist these such that a leadership transition causes the system evals to live on the new leader node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the blocked evals are created through the fsm, so when a new leader is elected, they will be locally replayed into memory via leader.go nomad.restoreEvals

@langmartin langmartin force-pushed the f-system-sched-blocked-evals branch from 7e9de21 to 645be7e Compare July 17, 2019 19:25
@preetapan
Copy link
Contributor

@langmartin there;s one outstanding suggestion #5900 (comment)

Otherwise LGTM

@langmartin langmartin force-pushed the f-system-sched-blocked-evals branch from 0d8437f to e0edd11 Compare July 18, 2019 14:32
@langmartin langmartin merged commit 5f44755 into master Jul 18, 2019
@langmartin langmartin deleted the f-system-sched-blocked-evals branch July 18, 2019 20:13
@shantanugadgil
Copy link
Contributor

Will this solve #4267 ?

@langmartin
Copy link
Contributor Author

Will this solve #4267 ?

I'll have to investigate that, it may.

@github-actions
Copy link

github-actions bot commented Feb 6, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

System jobs may not be allocated on newly initialized nodes
4 participants