-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not placing system job allocations on new nodes #6960
Comments
@jorgemarey I've just tested this with the latest changes from #6968 and it seems to be working now. Prior to the fix, the scheduler would return an error and return early when it came upon the ineligible node, now it will just add it as a placement error but continue scheduling the rest of the nodes. |
Hi @drewbailey . Thanks for the fix. With that I guess that scheduling will continue, but I believe that the code here is not "correct". In here: // Create the required task groups.
required := materializeTaskGroups(job)
result := &diffResult{}
for nodeID, allocs := range nodeAllocs {
diff := diffAllocs(job, taintedNodes, required, allocs, terminalAllocs) required contains a map that has as keys <job>.<taskgroup>[0], so this map will have as many keys as taskgroups are in the job. Then for every node that has some allocation of this job delpoyed, a diff is computed. The problem is that in diffAllocs : func diffAllocs(job *structs.Job, taintedNodes map[string]*structs.Node,
required map[string]*structs.TaskGroup, allocs []*structs.Allocation,
terminalAllocs map[string]*structs.Allocation) *diffResult {
result := &diffResult{}
// Scan the existing updates
existing := make(map[string]struct{})
for _, exist := range allocs {
// Index the existing node
name := exist.Name
existing[name] = struct{}{}
.... OMITED....
}
// Scan the required groups
for name, tg := range required {
// Check for an existing allocation
_, ok := existing[name]
// Require a placement if no existing allocation. If there
// is an existing allocation, we would have checked for a potential
// update or ignore above.
if !ok {
result.place = append(result.place, allocTuple{
Name: name,
TaskGroup: tg,
Alloc: terminalAllocs[name],
})
}
}
return result
} Is generating a placement for every "required" allocation that is not in the node. But that is not correct in many cases, as you can set a constraint so that a system job taskgroup only deploys to a certain pool of nodes, it's not required in each of them, so that placement shouldn't happen. You can see that by looking at the logs, for explample, as I pasted before. Here nomad is generating placements that are not needed.
Is this example, I have 50 nodes with node class A, another 50 with node class B. The job I have has 2 taskgroups, one of them has a contraint to class A and the other to class B. Let say I disable eligibility to one node of B and try to add a new node of A. As you can see in the log nomad will try to add a new allocation of every taskgroup to every node, and thats not needed. The fix should help to at least, to continue placing allocations on the new nodes, but maybe this "problem" leads to another set of issues? I don't know if I explained the problem correctly.... Thanks! |
Hi @jorgemarey the log debug log line you shared actually occurs before placements are computed. The output from Here is a quick example of what I think outlines your scenario job file job "redis" {
datacenters = ["dc1"]
type = "system"
# type = "service"
group "cache2" {
constraint {
attribute = "${node.class}"
value = "class-2"
}
count = 1
restart {
attempts = 10
interval = "5m"
delay = "25s"
mode = "delay"
}
ephemeral_disk {
size = 10
}
task "redis" {
driver = "docker"
config {
image = "redis:3.2"
port_map {
db = 6379
}
}
env {
version = "3"
}
logs {
max_files = 1
max_file_size = 9
}
resources {
cpu = 20 # 500 MHz
memory = 40 # 256MB
network {
mbits = 1
port "db" {}
}
}
}
}
group "cache" {
constraint {
attribute = "${node.class}"
value = "class-1"
}
count = 1
restart {
attempts = 10
interval = "5m"
delay = "25s"
mode = "delay"
}
ephemeral_disk {
size = 10
}
task "redis" {
driver = "docker"
config {
image = "redis:3.2"
port_map {
db = 6379
}
}
env {
version = "3"
}
logs {
max_files = 1
max_file_size = 9
}
resources {
cpu = 20 # 500 MHz
memory = 40 # 256MB
network {
mbits = 1
port "db" {}
}
}
}
}
} output before adding a new node
mark one node ineligible add new node of class-1, I'll have 3 running jobs as expected
so even though the reconcile debug output shows a higher placement, we still only end up with one new allocation.
Let me know if that's not what you are expecting, thanks! |
Hi @drewbailey. Yes, I was expecting this behaviour from the fix, and I know it works. It was simple a little bit weird that a placement was generated for something that shouldn't be there. In the fix, with this behaviour, using the example you provided. In the code here woudn't a new element will be added to the failedTGAllocs map? That would set the eval status to some failed allocs. And thats is not something that failed, but something that souldn't be generated previously. Anyway, if it works fine and doesn't conflict with other parts I guess that's ok. Thanks for the fix! |
@jorgemarey, currently in master there would only be a We are working on a potential solution here #6996 |
Thats sounds great! Thanks! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.10.2
Issue
When a new node is registered into the cluster, a new allocation of a system job should be placed, but it isn't. There's an issue tracking this, but it's closed, so that's why I created a new one.
I think this only happens when the system job has two (or more) task groups.
I was looking at the code and think that the problem occurs when making the diff:
required
, after callingmaterializeTaskGroups
returns a map with the following:Thas why here, when calling
diffAllocs
and checking all the placed allocations on the nodes, it tries to place all taskGroups in all nodes. (here if the task group is not in the node it will be in thediff.placed
list, even if it has a contraint against that).In the logs I can see the following:
I have a cluster with 100 nodes. 50 with a meta value of blue and another 50 with a meta of green. Each taskgroup goes to only one of that meta (by constraint).
When I add a new node, if another node is not eligible, the diff will try to place both taskgroups in all nodes, failing to do it.
I don't know if I explained myself correctly....
Reproduction steps
The text was updated successfully, but these errors were encountered: