`-Inf` in allocation scoring JSON output #8863

mrkurt · 2020-09-10T00:00:51Z

This looks identical to #5698 to me, but we have allocations/nodes that suddenly cause errors with both nomad and the web UI.

Nomad version

Nomad v0.11.0 (5f8fe0afc894d254e4d3baaeaee1c7a70a44fdc6)

Operating system and Environment details

Ubuntu 18.04, physical Intel Xeon based server.

Issue

JSON returned from nomad server includes -Inf:

{"NodeID":"233150e3-6991-9a0c-101b-64124eca941a","Scores":{"binpack":0.15161654193467708,"job-anti-affinity":0.0,"node-reschedule-penalty":0.0,"node-affinity":0.3333333333333333,"allocation-spread":-Inf},"NormScore":-Inf}

The text was updated successfully, but these errors were encountered:

tgross · 2020-09-10T12:49:34Z

Hi @mrkurt! Sorry to hear about that. Can you clarify which command or HTTP API you're hitting to get that JSON back?

mrkurt · 2020-09-10T13:44:03Z

@tgross it seems like it comes from the allocations lookup for a node. nomad node status -self would do it, and the client wouldn't load in the UI at all (console errors parsing what I'm pretty sure was the allocations response).

tgross · 2020-09-10T14:10:24Z

Ok, thanks @mrkurt. It looks like the last logic change to that area of the code was in ea843a5#diff-7c9a6a9bdf41c1724b4f427d5c39daec which shipped in 0.9.4, so it looks like there's an unfortunate edge case here. Can you share the jobspec for the allocation involved? That might help us narrow the problem down.

mrkurt · 2020-09-10T14:34:00Z

These are Spreads/Constraints/Affinities from the job (I can email you the whole thing if it's helpful):

  "Spreads": [
    {
      "Weight": 100,
      "Attribute": "${meta.fly_region}",
      "SpreadTarget": [
        {
          "Value": "vin",
          "Percent": 100
        }
      ]
    }
  ],
  "Affinities": [
    {
      "Weight": 50,
      "LTarget": "${meta.fly_region}",
      "Operand": "set_contains_any",
      "RTarget": "vin"
    },
    {
      "Weight": -100,
      "LTarget": "${meta.fly_region}",
      "Operand": "set_contains_any",
      "RTarget": "scl1,syd"
    }
  ],
  "Constraints": [
    {
      "LTarget": "${meta.fly_region}",
      "Operand": "set_contains_any",
      "RTarget": "scl1,syd,vin"
    }
  ]

If it matters, there aren't any nodes that have fly_region: scl1.

tgross · 2020-09-10T15:54:14Z

I dove into the relevant section of the code and it looks like the place this could potentially bubble up from is spread.go#L230-L257, which sets the counts that get used in spread.go#L134. See https://play.golang.org/p/sermtiG3PgC for a minimal example.

I've written some hacky tests to see what conditions I can trigger a -Inf from that area in the code. I can hit it in two cases:

If the task group count is 0, but in a real job that never hits the spread ranking code so I don't think that's the problem.
If the spread percentage is 0.

So far the only way I've been able to trigger this in a real job is one where the spread percent is 0:

job "example" {
  datacenters = ["dc1"]

  spread {
    attribute = "${node.datacenter}"
    target "dc1" {
      percent = 0
    }
  }

  group "cache" {

    task "redis" {
      driver = "docker"

      config {
        image = "redis:6.0"
      }
      resources {
        cpu    = 400
        memory = 256
      }
    }
  }
}


You mentioned you're hitting this error on `nomad node status`. Could there be another job running on that node that has an improperly configured `spread` value?

But in any case it definitely looks like we should validate that `spread.target.percentage != 0`!

mrkurt · 2020-09-10T21:13:03Z

@tgross it definitely could be another job, the output was huge so I might've eyeball parsed it wrong. We do have other spreads set that look like this (our understanding is this spreads evenly across an attribute):

"Spreads": [
    {
      "Weight": 100,
      "Attribute": "${meta.fly_region}",
      "SpreadTarget": null
    }
  ]

tgross · 2020-09-14T14:07:27Z

So I tried to come up with a case where a spread without a target could trigger the bug. Modifying the job from earlier this one doesn't trigger the bug:

  spread {
    attribute = "${node.datacenter}"
    weight    = 100
  }

Querying /v1/job/example for that shows:

  "Spreads": [
    {
      "Attribute": "${node.datacenter}",
      "SpreadTarget": null,
      "Weight": 100
    }

But this invalid config does pass validation:

  spread {
    attribute = "${node.datacenter}"
    weight    = 100
    target "dc1" {}
  }

Querying /v1/job/example for that shows:

  "Spreads": [
    {
      "Attribute": "${node.datacenter}",
      "SpreadTarget": [
        {
          "Percent": 0,
          "Value": "dc1"
        }
      ],
      "Weight": 100
    }
  ],

Which effectively is the same invalid config I had in the previous comment. This results in the error you're seeing:

$ nomad alloc status -verbose ccf
Error querying allocation: invalid character 'I' in numeric literal

There's clearly some better validation to do around this particular bug. In the meantime, it's probably be worth iterating over your running jobs to verify there aren't any that are resulting in a 0% spread target.

nugend · 2021-02-03T15:44:58Z

This is very frustrating since the cli commands to look at a node and the ui to inspect a node both fail if an allocation that has been placed on it is in this state. So you have to think of and know to use the HTTP API to look at the allocations on a node and find the offending job.

tgross · 2023-02-01T19:27:04Z

Another example of this issue #11100.

mikenomitch · 2023-05-08T20:02:56Z

I dug into this a bit with @tgross and wanted to document what we found and decided.

The root cause of the issue is a division by zero, happening here. If you had a situation where the “desiredCount” was 0, this number becomes negative infinity.

Instead of doing this, we need to check if the desirectCount is 0 and instead apply a maximum penalty to that instead.

The one caveat is that if you have a desiredCount below 1, you can get very large negative scoreBoosts, and you’ll need whatever number you set the 0 score boost to to always be larger (in absolute value) than that number.

For instance, if I have a target percentage of 99% for “foo”, 1% for “bar”, and 0% for “baz”, and I have 10 allocations being placed. If I’m evaluating “bar” I might have a scoreBoost of ((desiredCount - float64(usedCount)) / desiredCount) * spreadWeight which would be (0.1 - 1) / 0.1, which is -9. The score for “baz” would have to be set with a larger absolute value than that.

So if you do have a desiredCount of 0, you must make the negative score boost relative to the highest previously seen used count. This is because the previously seen used count gives the relative weighting for previously selected options that we need to overcome.

I'm not 100% positive, but I think I figured out a good way to set the bounds. The lowest possible non-zero desired count is 1% of totalCount (since “percentage” must be an int). So you could make scoreBoost = (0 - (highestPreviousUsedCount + 1))/(totalCount * 0.01) in the case of an actual desiredCount of 0.

Also worth noting, we should also set the max possible penalty to something beyond -1 here, as this is incorrect logic.

When spread targets have a percent value of zero it's possible for them to return -Inf scoring because of a float divide by zero. This is very hard for operators to debug because the string "-Inf" is returned in the API and that breaks the presentation of debugging data. Most scoring iterators are bracketed to -1/+1, but spread iterators do not so that they can handle greatly unbalanced scoring so we can't simply return a -1 score without generating a score that might be greater than the negative scores set by other spread targets. Instead, track the lowest-seen spread boost and use that as the spread boost for any cases where we'd divide by zero. Fixes: #8863

tgross · 2023-05-16T20:12:30Z

I've just landed a fix for this issue in #17198 and that'll be released as a backport to 1.5.x, 1.4.x, and 1.3.x in the near future.

tgross added theme/api HTTP API and SDK issues type/bug stage/needs-investigation stage/waiting-reply labels Sep 10, 2020

tgross changed the title ~~-Inf in spread JSON output~~ -Inf in allocation scoring JSON output Sep 10, 2020

tgross self-assigned this Sep 10, 2020

tgross removed the stage/needs-investigation label Sep 10, 2020

tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/waiting-reply labels Sep 14, 2020

tgross removed their assignment Oct 12, 2020

tgross mentioned this issue Nov 10, 2021

Node affinity range is squashed when mixing positive and negative weights #11130

Open

tgross mentioned this issue Feb 1, 2023

Error querying node allocations: invalid character 'I' in numeric literal #11100

Closed

mikenomitch assigned tgross May 9, 2023

tgross mentioned this issue May 15, 2023

scheduler: prevent -Inf in spread scoring #17198

Merged

tgross added this to the 1.5.x milestone May 15, 2023

tgross closed this as completed in #17198 May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`-Inf` in allocation scoring JSON output #8863

`-Inf` in allocation scoring JSON output #8863

mrkurt commented Sep 10, 2020

tgross commented Sep 10, 2020

mrkurt commented Sep 10, 2020

tgross commented Sep 10, 2020

mrkurt commented Sep 10, 2020

tgross commented Sep 10, 2020 •

edited

Loading

mrkurt commented Sep 10, 2020 •

edited

Loading

tgross commented Sep 14, 2020

nugend commented Feb 3, 2021

tgross commented Feb 1, 2023

mikenomitch commented May 8, 2023 •

edited

Loading

tgross commented May 16, 2023

-Inf in allocation scoring JSON output #8863

-Inf in allocation scoring JSON output #8863

Comments

mrkurt commented Sep 10, 2020

Nomad version

Operating system and Environment details

Issue

tgross commented Sep 10, 2020

mrkurt commented Sep 10, 2020

tgross commented Sep 10, 2020

mrkurt commented Sep 10, 2020

tgross commented Sep 10, 2020 • edited Loading

mrkurt commented Sep 10, 2020 • edited Loading

tgross commented Sep 14, 2020

nugend commented Feb 3, 2021

tgross commented Feb 1, 2023

mikenomitch commented May 8, 2023 • edited Loading

tgross commented May 16, 2023

`-Inf` in allocation scoring JSON output #8863

`-Inf` in allocation scoring JSON output #8863

tgross commented Sep 10, 2020 •

edited

Loading

mrkurt commented Sep 10, 2020 •

edited

Loading

mikenomitch commented May 8, 2023 •

edited

Loading