-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spread/affinity values in allocations may generate invalid JSON (and may not be working correctly) #5698
Comments
We've done some more digging - this doesn't seem to be impacting every allocation for a job, only some. We haven't been able to identify what the cause is yet. |
We've been seeing this too with v0.9.1, but like you not for every job. I noticed it come up when tasks are migrated while draining a client. Currently, the only thing that seems to fix it is nuking the |
We ran into this problem as well, with the allocation-spread:
I did a quick dive through the code and
There are a couple of things in that method that could cause a +Inf. First is
If
So if you never specify any weights, that would cause a +Inf. The second thing I found was a call to:
In that function, we have a divide by
Note that if minCount is 0 and currentAttributeCount isn't 0 and maxCount isn't 0, we fall through to the second to last line, which does a division by 0, which will result in a +Inf. The iteration over
|
@rsigrest: interesting. We went straight from v0.8.7 to v0.9.1, skipping v0.9.0. I'm curious if you were on v0.9.0 previously and didn't see the problem? Similar question about version to you @jonbodner. We're actually in a much worse place today, and we're trying to figure out if it's related to the spread/+Inf issue or not. All new allocations are stuck in pending, with no reason (there are placement metrics for some of them though). We basically can't reschedule anything, can't drain nodes, etc. Not sure if it's related to spreads or just a v0.9.X related issue at the moment. |
(Speaking in regards to @jonbodner's comment, we work together) @chrisboulton we moved from 0.8.7 to 0.9.0, experienced the issues with the malformed JSON, and then moved from 0.9.0 to 0.9.1 and continued to witness it. Eventually we removed the |
I was also using 0.8.7 and upgraded directly to 0.9.1, specifically because of the Based on what @jonbodner posted earlier, I'm going to try some deployments that specify a |
For what it's worth - we were using the weight attribute. Our spread stanza (at a task group level) was:
|
@jonbodner Thanks for your analysis, wanted to clarify that the weight is set to 50 here Line 272 in 5ea4382
PR #5713 fixes the divide by zero bug. |
Just wanted to drop a quick note - I cherry-picked #5713 and built a release on top of v0.9.1. We've had it in our test environments for 48 hours now, and have spreads enabled for the past 24 hours. Can confirm this is fixed now. 🎉 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Issue
We've just rolled spreads out in our environment, and for some of our services, whenever we run {{nomad alloc status..}} we receive the following:
I grabbed one of the evaluations via the API, and it looks like for whatever reason some of the score metadata is returning invalid JSON (snip of - see
+Inf
references):We're not sure how to reproduce this as yet - we just know it's happening consistently in our environment. All nomad-clients (and servers) are running Nomad 0.9.1.
Job file
Emailed along with full un-redacted allocation output
The text was updated successfully, but these errors were encountered: