-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: stop allocs in unrelated nodes #11391
Conversation
he system scheduler should leave allocs on draining nodes as-is, but stop node stop allocs on nodes that are no longer part of the job datacenters. Previously, the scheduler did not make the distinction and left system job allocs intact if they are already running.
if _, ok := eligibleNodes[nodeID]; !ok { | ||
if _, ok := notReadyNodes[nodeID]; ok { | ||
goto IGNORE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the source of the bug. Previously, if the node
is not in the list of eligibleNodes
- we ignore it. The assumption is the node is draining or was marked ineligible for scheduling. Now, we explicitly check if the node is not ready but in the DCes that the job targets.
scheduler/util.go
Outdated
@@ -65,7 +65,8 @@ func diffSystemAllocsForNode( | |||
job *structs.Job, // job whose allocs are going to be diff-ed | |||
nodeID string, | |||
eligibleNodes map[string]*structs.Node, | |||
taintedNodes map[string]*structs.Node, // nodes which are down or in drain (by node name) | |||
notReadyNodes map[string]struct{}, // nodes that are not ready, e.g. draining | |||
taintedNodes map[string]*structs.Node, // nodes which are down (by node name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taintedNodes
logic is actually confusing IMO. taintedNodes
function filters the nodes with Down status as well or marked for draining in ShouldDrainNode
Lines 351 to 377 in b0ce684
// taintedNodes is used to scan the allocations and then check if the | |
// underlying nodes are tainted, and should force a migration of the allocation. | |
// All the nodes returned in the map are tainted. | |
func taintedNodes(state State, allocs []*structs.Allocation) (map[string]*structs.Node, error) { | |
out := make(map[string]*structs.Node) | |
for _, alloc := range allocs { | |
if _, ok := out[alloc.NodeID]; ok { | |
continue | |
} | |
ws := memdb.NewWatchSet() | |
node, err := state.NodeByID(ws, alloc.NodeID) | |
if err != nil { | |
return nil, err | |
} | |
// If the node does not exist, we should migrate | |
if node == nil { | |
out[alloc.NodeID] = nil | |
continue | |
} | |
if structs.ShouldDrainNode(node.Status) || node.DrainStrategy != nil { | |
out[alloc.NodeID] = node | |
} | |
} | |
return out, nil | |
} |
However, nodes that are up but marked for draining were already filtered out by readyNodesForDCs
Lines 277 to 313 in b0ce684
// readyNodesInDCs returns all the ready nodes in the given datacenters and a | |
// mapping of each data center to the count of ready nodes. | |
func readyNodesInDCs(state State, dcs []string) ([]*structs.Node, map[string]struct{}, map[string]int, error) { | |
// Index the DCs | |
dcMap := make(map[string]int, len(dcs)) | |
for _, dc := range dcs { | |
dcMap[dc] = 0 | |
} | |
// Scan the nodes | |
ws := memdb.NewWatchSet() | |
var out []*structs.Node | |
notReady := map[string]struct{}{} | |
iter, err := state.Nodes(ws) | |
if err != nil { | |
return nil, nil, nil, err | |
} | |
for { | |
raw := iter.Next() | |
if raw == nil { | |
break | |
} | |
// Filter on datacenter and status | |
node := raw.(*structs.Node) | |
if !node.Ready() { | |
notReady[node.ID] = struct{}{} | |
continue | |
} | |
if _, ok := dcMap[node.Datacenter]; !ok { | |
continue | |
} | |
out = append(out, node) | |
dcMap[node.Datacenter]++ | |
} | |
return out, notReady, dcMap, nil | |
} |
So taintedNodes
is only the down nodes. Reasoning through the code is a bit more complex and didn't feel confident restructuring that logic to be more explicit about node state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to add a small nitpick on this, the comment says (by node name)
but the code indexes them by ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting that structs.TerminalByNodeByName
struct is actually grouped by node ID too. I'll update the comment, but rename the struct in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Just missing a changelog entry.
We use node ids rather than node names when hashing or grouping.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
The system scheduler should leave allocs on draining nodes as-is, but
stop node stop allocs on nodes that are no longer part of the job
datacenters.
Previously, the scheduler did not make the distinction and left system
job allocs intact if they are already running.
I've added a failing test first, which you can see in https://app.circleci.com/jobs/github/hashicorp/nomad/179661 .
Fixes #11373