-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
-Inf
in allocation scoring JSON output
#8863
Comments
Hi @mrkurt! Sorry to hear about that. Can you clarify which command or HTTP API you're hitting to get that JSON back? |
@tgross it seems like it comes from the |
-Inf
in spread JSON output-Inf
in allocation scoring JSON output
Ok, thanks @mrkurt. It looks like the last logic change to that area of the code was in ea843a5#diff-7c9a6a9bdf41c1724b4f427d5c39daec which shipped in 0.9.4, so it looks like there's an unfortunate edge case here. Can you share the jobspec for the allocation involved? That might help us narrow the problem down. |
These are Spreads/Constraints/Affinities from the job (I can email you the whole thing if it's helpful): "Spreads": [
{
"Weight": 100,
"Attribute": "${meta.fly_region}",
"SpreadTarget": [
{
"Value": "vin",
"Percent": 100
}
]
}
],
"Affinities": [
{
"Weight": 50,
"LTarget": "${meta.fly_region}",
"Operand": "set_contains_any",
"RTarget": "vin"
},
{
"Weight": -100,
"LTarget": "${meta.fly_region}",
"Operand": "set_contains_any",
"RTarget": "scl1,syd"
}
],
"Constraints": [
{
"LTarget": "${meta.fly_region}",
"Operand": "set_contains_any",
"RTarget": "scl1,syd,vin"
}
] If it matters, there aren't any nodes that have |
I dove into the relevant section of the code and it looks like the place this could potentially bubble up from is I've written some hacky tests to see what conditions I can trigger a
So far the only way I've been able to trigger this in a real job is one where the spread percent is 0: job "example" {
datacenters = ["dc1"]
spread {
attribute = "${node.datacenter}"
target "dc1" {
percent = 0
}
}
group "cache" {
task "redis" {
driver = "docker"
config {
image = "redis:6.0"
}
resources {
cpu = 400
memory = 256
}
}
}
}
|
@tgross it definitely could be another job, the output was huge so I might've eyeball parsed it wrong. We do have other spreads set that look like this (our understanding is this spreads evenly across an attribute): "Spreads": [
{
"Weight": 100,
"Attribute": "${meta.fly_region}",
"SpreadTarget": null
}
] |
So I tried to come up with a case where a spread {
attribute = "${node.datacenter}"
weight = 100
} Querying
But this invalid config does pass validation: spread {
attribute = "${node.datacenter}"
weight = 100
target "dc1" {}
} Querying
Which effectively is the same invalid config I had in the previous comment. This results in the error you're seeing:
There's clearly some better validation to do around this particular bug. In the meantime, it's probably be worth iterating over your running jobs to verify there aren't any that are resulting in a 0% spread target. |
This is very frustrating since the cli commands to look at a node and the ui to inspect a node both fail if an allocation that has been placed on it is in this state. So you have to think of and know to use the HTTP API to look at the allocations on a node and find the offending job. |
Another example of this issue #11100. |
I dug into this a bit with @tgross and wanted to document what we found and decided. The root cause of the issue is a division by zero, happening here. If you had a situation where the “desiredCount” was 0, this number becomes negative infinity. Instead of doing this, we need to check if the desirectCount is 0 and instead apply a maximum penalty to that instead. The one caveat is that if you have a desiredCount below 1, you can get very large negative scoreBoosts, and you’ll need whatever number you set the 0 score boost to to always be larger (in absolute value) than that number. For instance, if I have a target percentage of 99% for “foo”, 1% for “bar”, and 0% for “baz”, and I have 10 allocations being placed. If I’m evaluating “bar” I might have a scoreBoost of So if you do have a desiredCount of 0, you must make the negative score boost relative to the highest previously seen used count. This is because the previously seen used count gives the relative weighting for previously selected options that we need to overcome. I'm not 100% positive, but I think I figured out a good way to set the bounds. The lowest possible non-zero desired count is 1% of totalCount (since “percentage” must be an int). So you could make Also worth noting, we should also set the max possible penalty to something beyond -1 here, as this is incorrect logic. |
When spread targets have a percent value of zero it's possible for them to return -Inf scoring because of a float divide by zero. This is very hard for operators to debug because the string "-Inf" is returned in the API and that breaks the presentation of debugging data. Most scoring iterators are bracketed to -1/+1, but spread iterators do not so that they can handle greatly unbalanced scoring so we can't simply return a -1 score without generating a score that might be greater than the negative scores set by other spread targets. Instead, track the lowest-seen spread boost and use that as the spread boost for any cases where we'd divide by zero. Fixes: #8863
When spread targets have a percent value of zero it's possible for them to return -Inf scoring because of a float divide by zero. This is very hard for operators to debug because the string "-Inf" is returned in the API and that breaks the presentation of debugging data. Most scoring iterators are bracketed to -1/+1, but spread iterators do not so that they can handle greatly unbalanced scoring so we can't simply return a -1 score without generating a score that might be greater than the negative scores set by other spread targets. Instead, track the lowest-seen spread boost and use that as the spread boost for any cases where we'd divide by zero. Fixes: #8863
When spread targets have a percent value of zero it's possible for them to return -Inf scoring because of a float divide by zero. This is very hard for operators to debug because the string "-Inf" is returned in the API and that breaks the presentation of debugging data. Most scoring iterators are bracketed to -1/+1, but spread iterators do not so that they can handle greatly unbalanced scoring so we can't simply return a -1 score without generating a score that might be greater than the negative scores set by other spread targets. Instead, track the lowest-seen spread boost and use that as the spread boost for any cases where we'd divide by zero. Fixes: #8863
When spread targets have a percent value of zero it's possible for them to return -Inf scoring because of a float divide by zero. This is very hard for operators to debug because the string "-Inf" is returned in the API and that breaks the presentation of debugging data. Most scoring iterators are bracketed to -1/+1, but spread iterators do not so that they can handle greatly unbalanced scoring so we can't simply return a -1 score without generating a score that might be greater than the negative scores set by other spread targets. Instead, track the lowest-seen spread boost and use that as the spread boost for any cases where we'd divide by zero. Fixes: #8863
When spread targets have a percent value of zero it's possible for them to return -Inf scoring because of a float divide by zero. This is very hard for operators to debug because the string "-Inf" is returned in the API and that breaks the presentation of debugging data. Most scoring iterators are bracketed to -1/+1, but spread iterators do not so that they can handle greatly unbalanced scoring so we can't simply return a -1 score without generating a score that might be greater than the negative scores set by other spread targets. Instead, track the lowest-seen spread boost and use that as the spread boost for any cases where we'd divide by zero. Fixes: #8863
I've just landed a fix for this issue in #17198 and that'll be released as a backport to 1.5.x, 1.4.x, and 1.3.x in the near future. |
This looks identical to #5698 to me, but we have allocations/nodes that suddenly cause errors with both
nomad
and the web UI.Nomad version
Nomad v0.11.0 (5f8fe0afc894d254e4d3baaeaee1c7a70a44fdc6)
Operating system and Environment details
Ubuntu 18.04, physical Intel Xeon based server.
Issue
JSON returned from nomad server includes
-Inf
:The text was updated successfully, but these errors were encountered: