-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return FailedTGAlloc metric instead of no node err #6968
Conversation
} | ||
|
||
s.failedTGAllocs[missing.TaskGroup.Name] = s.ctx.Metrics() | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively we could just log the error here and then call s.stack.SetNodes(nodes)
with zero nodesto allow stack.Select
to occur and possibly populate more Metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way we get a record of which node was missing when we tried to place the alloc, though, right? That seems better than the alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we'd return early but now we're collecting the metrics info (which is great!) but then continuing on. From the operator's perspective does this mean that we're going to have a partial placement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in the case of a system job running on 2 nodes, marking 1 node ineligible the resulting plan is
→ nomad job plan repro.hcl
+/- Job: "redis"
+/- Task Group: "cache" (2 create/destroy update)
+/- Task: "redis" (forces create/destroy update)
+/- Env[version]: "1" => "2"
Scheduler dry-run:
- WARNING: Failed to place allocations on all nodes.
Task Group "cache" (failed to place 1 allocation):
Job Modify Index: 13
To submit the job with version verification run:
nomad job run -check-index 13 repro.hcl
When running the job with the check-index flag, the job will only be run if the
server side version matches the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Makes sense.
} | ||
|
||
s.failedTGAllocs[missing.TaskGroup.Name] = s.ctx.Metrics() | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way we get a record of which node was missing when we tried to place the alloc, though, right? That seems better than the alternative.
If an existing system allocation is running and the node its running on is marked as ineligible, subsequent plan/applys return an RPC error instead of a more helpful plan result. This change logs the error, and appends a failedTGAlloc for the placement.
19f8302
to
abde9f9
Compare
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
If an existing system allocation is running and the node its running on
is marked as ineligible, subsequent plan/applys return an RPC error
instead of a more helpful plan result.
This change logs the error, and appends a failedTGAlloc for the
placement.
fixes #5169