Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return FailedTGAlloc metric instead of no node err #6968

Merged
merged 2 commits into from
Jan 22, 2020

Conversation

drewbailey
Copy link
Contributor

@drewbailey drewbailey commented Jan 21, 2020

If an existing system allocation is running and the node its running on
is marked as ineligible, subsequent plan/applys return an RPC error
instead of a more helpful plan result.

This change logs the error, and appends a failedTGAlloc for the
placement.

fixes #5169

}

s.failedTGAllocs[missing.TaskGroup.Name] = s.ctx.Metrics()
continue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could just log the error here and then call s.stack.SetNodes(nodes) with zero nodesto allow stack.Select to occur and possibly populate more Metrics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way we get a record of which node was missing when we tried to place the alloc, though, right? That seems better than the alternative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we'd return early but now we're collecting the metrics info (which is great!) but then continuing on. From the operator's perspective does this mean that we're going to have a partial placement?

Copy link
Contributor Author

@drewbailey drewbailey Jan 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the case of a system job running on 2 nodes, marking 1 node ineligible the resulting plan is

→ nomad job plan repro.hcl
+/- Job: "redis"
+/- Task Group: "cache" (2 create/destroy update)
  +/- Task: "redis" (forces create/destroy update)
    +/- Env[version]: "1" => "2"

Scheduler dry-run:
- WARNING: Failed to place allocations on all nodes.
  Task Group "cache" (failed to place 1 allocation):


Job Modify Index: 13
To submit the job with version verification run:

nomad job run -check-index 13 repro.hcl

When running the job with the check-index flag, the job will only be run if the
server side version matches the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Makes sense.

}

s.failedTGAllocs[missing.TaskGroup.Name] = s.ctx.Metrics()
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way we get a record of which node was missing when we tried to place the alloc, though, right? That seems better than the alternative.

If an existing system allocation is running and the node its running on
is marked as ineligible, subsequent plan/applys return an RPC error
instead of a more helpful plan result.

This change logs the error, and appends a failedTGAlloc for the
placement.
@drewbailey drewbailey force-pushed the b-system-sched-plan-ineligible branch from 19f8302 to abde9f9 Compare January 22, 2020 15:10
@drewbailey drewbailey merged commit 15b782c into master Jan 22, 2020
@drewbailey drewbailey deleted the b-system-sched-plan-ineligible branch January 22, 2020 16:53
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

system scheduler: error when performing plan/run with disabled nodes
3 participants