Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

-Inf in allocation scoring JSON output #8863

Closed
mrkurt opened this issue Sep 10, 2020 · 11 comments · Fixed by #17198
Closed

-Inf in allocation scoring JSON output #8863

mrkurt opened this issue Sep 10, 2020 · 11 comments · Fixed by #17198
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/api HTTP API and SDK issues type/bug
Milestone

Comments

@mrkurt
Copy link

mrkurt commented Sep 10, 2020

This looks identical to #5698 to me, but we have allocations/nodes that suddenly cause errors with both nomad and the web UI.

Nomad version

Nomad v0.11.0 (5f8fe0afc894d254e4d3baaeaee1c7a70a44fdc6)

Operating system and Environment details

Ubuntu 18.04, physical Intel Xeon based server.

Issue

JSON returned from nomad server includes -Inf:

{"NodeID":"233150e3-6991-9a0c-101b-64124eca941a","Scores":{"binpack":0.15161654193467708,"job-anti-affinity":0.0,"node-reschedule-penalty":0.0,"node-affinity":0.3333333333333333,"allocation-spread":-Inf},"NormScore":-Inf}
@tgross
Copy link
Member

tgross commented Sep 10, 2020

Hi @mrkurt! Sorry to hear about that. Can you clarify which command or HTTP API you're hitting to get that JSON back?

@mrkurt
Copy link
Author

mrkurt commented Sep 10, 2020

@tgross it seems like it comes from the allocations lookup for a node. nomad node status -self would do it, and the client wouldn't load in the UI at all (console errors parsing what I'm pretty sure was the allocations response).

@tgross tgross changed the title -Inf in spread JSON output -Inf in allocation scoring JSON output Sep 10, 2020
@tgross
Copy link
Member

tgross commented Sep 10, 2020

Ok, thanks @mrkurt. It looks like the last logic change to that area of the code was in ea843a5#diff-7c9a6a9bdf41c1724b4f427d5c39daec which shipped in 0.9.4, so it looks like there's an unfortunate edge case here. Can you share the jobspec for the allocation involved? That might help us narrow the problem down.

@mrkurt
Copy link
Author

mrkurt commented Sep 10, 2020

These are Spreads/Constraints/Affinities from the job (I can email you the whole thing if it's helpful):

  "Spreads": [
    {
      "Weight": 100,
      "Attribute": "${meta.fly_region}",
      "SpreadTarget": [
        {
          "Value": "vin",
          "Percent": 100
        }
      ]
    }
  ],
  "Affinities": [
    {
      "Weight": 50,
      "LTarget": "${meta.fly_region}",
      "Operand": "set_contains_any",
      "RTarget": "vin"
    },
    {
      "Weight": -100,
      "LTarget": "${meta.fly_region}",
      "Operand": "set_contains_any",
      "RTarget": "scl1,syd"
    }
  ],
  "Constraints": [
    {
      "LTarget": "${meta.fly_region}",
      "Operand": "set_contains_any",
      "RTarget": "scl1,syd,vin"
    }
  ]

If it matters, there aren't any nodes that have fly_region: scl1.

@tgross
Copy link
Member

tgross commented Sep 10, 2020

I dove into the relevant section of the code and it looks like the place this could potentially bubble up from is spread.go#L230-L257, which sets the counts that get used in spread.go#L134. See https://play.golang.org/p/sermtiG3PgC for a minimal example.

I've written some hacky tests to see what conditions I can trigger a -Inf from that area in the code. I can hit it in two cases:

  • If the task group count is 0, but in a real job that never hits the spread ranking code so I don't think that's the problem.
  • If the spread percentage is 0.

So far the only way I've been able to trigger this in a real job is one where the spread percent is 0:

job "example" {
  datacenters = ["dc1"]

  spread {
    attribute = "${node.datacenter}"
    target "dc1" {
      percent = 0
    }
  }

  group "cache" {

    task "redis" {
      driver = "docker"

      config {
        image = "redis:6.0"
      }
      resources {
        cpu    = 400
        memory = 256
      }
    }
  }
}

You mentioned you're hitting this error on `nomad node status`. Could there be another job running on that node that has an improperly configured `spread` value?

But in any case it definitely looks like we should validate that `spread.target.percentage != 0`!

@mrkurt
Copy link
Author

mrkurt commented Sep 10, 2020

@tgross it definitely could be another job, the output was huge so I might've eyeball parsed it wrong. We do have other spreads set that look like this (our understanding is this spreads evenly across an attribute):

"Spreads": [
    {
      "Weight": 100,
      "Attribute": "${meta.fly_region}",
      "SpreadTarget": null
    }
  ]

@tgross
Copy link
Member

tgross commented Sep 14, 2020

So I tried to come up with a case where a spread without a target could trigger the bug. Modifying the job from earlier this one doesn't trigger the bug:

  spread {
    attribute = "${node.datacenter}"
    weight    = 100
  }

Querying /v1/job/example for that shows:

  "Spreads": [
    {
      "Attribute": "${node.datacenter}",
      "SpreadTarget": null,
      "Weight": 100
    }

But this invalid config does pass validation:

  spread {
    attribute = "${node.datacenter}"
    weight    = 100
    target "dc1" {}
  }

Querying /v1/job/example for that shows:

  "Spreads": [
    {
      "Attribute": "${node.datacenter}",
      "SpreadTarget": [
        {
          "Percent": 0,
          "Value": "dc1"
        }
      ],
      "Weight": 100
    }
  ],

Which effectively is the same invalid config I had in the previous comment. This results in the error you're seeing:

$ nomad alloc status -verbose ccf
Error querying allocation: invalid character 'I' in numeric literal

There's clearly some better validation to do around this particular bug. In the meantime, it's probably be worth iterating over your running jobs to verify there aren't any that are resulting in a 0% spread target.

@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/waiting-reply labels Sep 14, 2020
@tgross tgross removed their assignment Oct 12, 2020
@nugend
Copy link

nugend commented Feb 3, 2021

This is very frustrating since the cli commands to look at a node and the ui to inspect a node both fail if an allocation that has been placed on it is in this state. So you have to think of and know to use the HTTP API to look at the allocations on a node and find the offending job.

@tgross
Copy link
Member

tgross commented Feb 1, 2023

Another example of this issue #11100.

@mikenomitch
Copy link
Contributor

mikenomitch commented May 8, 2023

I dug into this a bit with @tgross and wanted to document what we found and decided.

The root cause of the issue is a division by zero, happening here. If you had a situation where the “desiredCount” was 0, this number becomes negative infinity.

Instead of doing this, we need to check if the desirectCount is 0 and instead apply a maximum penalty to that instead.

The one caveat is that if you have a desiredCount below 1, you can get very large negative scoreBoosts, and you’ll need whatever number you set the 0 score boost to to always be larger (in absolute value) than that number.

For instance, if I have a target percentage of 99% for “foo”, 1% for “bar”, and 0% for “baz”, and I have 10 allocations being placed. If I’m evaluating “bar” I might have a scoreBoost of ((desiredCount - float64(usedCount)) / desiredCount) * spreadWeight which would be (0.1 - 1) / 0.1, which is -9. The score for “baz” would have to be set with a larger absolute value than that.

So if you do have a desiredCount of 0, you must make the negative score boost relative to the highest previously seen used count. This is because the previously seen used count gives the relative weighting for previously selected options that we need to overcome.

I'm not 100% positive, but I think I figured out a good way to set the bounds. The lowest possible non-zero desired count is 1% of totalCount (since “percentage” must be an int). So you could make scoreBoost = (0 - (highestPreviousUsedCount + 1))/(totalCount * 0.01) in the case of an actual desiredCount of 0.

Also worth noting, we should also set the max possible penalty to something beyond -1 here, as this is incorrect logic.

tgross added a commit that referenced this issue May 15, 2023
When spread targets have a percent value of zero it's possible for them to
return -Inf scoring because of a float divide by zero. This is very hard for
operators to debug because the string "-Inf" is returned in the API and that
breaks the presentation of debugging data.

Most scoring iterators are bracketed to -1/+1, but spread iterators do not so
that they can handle greatly unbalanced scoring so we can't simply return a -1
score without generating a score that might be greater than the negative scores
set by other spread targets. Instead, track the lowest-seen spread boost and use
that as the spread boost for any cases where we'd divide by zero.

Fixes: #8863
@tgross tgross added this to the 1.5.x milestone May 15, 2023
tgross added a commit that referenced this issue May 16, 2023
When spread targets have a percent value of zero it's possible for them to
return -Inf scoring because of a float divide by zero. This is very hard for
operators to debug because the string "-Inf" is returned in the API and that
breaks the presentation of debugging data.

Most scoring iterators are bracketed to -1/+1, but spread iterators do not so
that they can handle greatly unbalanced scoring so we can't simply return a -1
score without generating a score that might be greater than the negative scores
set by other spread targets. Instead, track the lowest-seen spread boost and use
that as the spread boost for any cases where we'd divide by zero.

Fixes: #8863
tgross added a commit that referenced this issue May 16, 2023
When spread targets have a percent value of zero it's possible for them to
return -Inf scoring because of a float divide by zero. This is very hard for
operators to debug because the string "-Inf" is returned in the API and that
breaks the presentation of debugging data.

Most scoring iterators are bracketed to -1/+1, but spread iterators do not so
that they can handle greatly unbalanced scoring so we can't simply return a -1
score without generating a score that might be greater than the negative scores
set by other spread targets. Instead, track the lowest-seen spread boost and use
that as the spread boost for any cases where we'd divide by zero.

Fixes: #8863
tgross added a commit that referenced this issue May 16, 2023
When spread targets have a percent value of zero it's possible for them to
return -Inf scoring because of a float divide by zero. This is very hard for
operators to debug because the string "-Inf" is returned in the API and that
breaks the presentation of debugging data.

Most scoring iterators are bracketed to -1/+1, but spread iterators do not so
that they can handle greatly unbalanced scoring so we can't simply return a -1
score without generating a score that might be greater than the negative scores
set by other spread targets. Instead, track the lowest-seen spread boost and use
that as the spread boost for any cases where we'd divide by zero.

Fixes: #8863
tgross added a commit that referenced this issue May 16, 2023
When spread targets have a percent value of zero it's possible for them to
return -Inf scoring because of a float divide by zero. This is very hard for
operators to debug because the string "-Inf" is returned in the API and that
breaks the presentation of debugging data.

Most scoring iterators are bracketed to -1/+1, but spread iterators do not so
that they can handle greatly unbalanced scoring so we can't simply return a -1
score without generating a score that might be greater than the negative scores
set by other spread targets. Instead, track the lowest-seen spread boost and use
that as the spread boost for any cases where we'd divide by zero.

Fixes: #8863
@tgross
Copy link
Member

tgross commented May 16, 2023

I've just landed a fix for this issue in #17198 and that'll be released as a backport to 1.5.x, 1.4.x, and 1.3.x in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/api HTTP API and SDK issues type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants