Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return FailedTGAlloc metric instead of no node err #6968

Merged
merged 2 commits into from
Jan 22, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ BUG FIXES:
* consul: Fixed a bug where script-based health checks would fail if the service configuration included interpolation. [[GH-6916](https://github.com/hashicorp/nomad/issues/6916)]
* consul/connect: Fixed a bug where Connect-enabled jobs failed to validate when service names used interpolation. [[GH-6855](https://github.com/hashicorp/nomad/issues/6855)]
* scheduler: Fixed a bug that caused evicted allocs on a lost node to be stuck in running. [[GH-6902](https://github.com/hashicorp/nomad/issues/6902)]
* scheduler: Fixed a bug where `nomad job plan/apply` returned errors instead of a partial placement warning for ineligible nodes. [[GH-6968](https://github.com/hashicorp/nomad/issues/6968)]

## 0.10.2 (December 4, 2019)

Expand Down
9 changes: 8 additions & 1 deletion scheduler/system_sched.go
Original file line number Diff line number Diff line change
Expand Up @@ -275,7 +275,13 @@ func (s *SystemScheduler) computePlacements(place []allocTuple) error {
for _, missing := range place {
node, ok := nodeByID[missing.Alloc.NodeID]
if !ok {
return fmt.Errorf("could not find node %q", missing.Alloc.NodeID)
s.logger.Debug("could not find node %q", missing.Alloc.NodeID)
if s.failedTGAllocs == nil {
s.failedTGAllocs = make(map[string]*structs.AllocMetric)
}

s.failedTGAllocs[missing.TaskGroup.Name] = s.ctx.Metrics()
continue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could just log the error here and then call s.stack.SetNodes(nodes) with zero nodesto allow stack.Select to occur and possibly populate more Metrics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way we get a record of which node was missing when we tried to place the alloc, though, right? That seems better than the alternative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we'd return early but now we're collecting the metrics info (which is great!) but then continuing on. From the operator's perspective does this mean that we're going to have a partial placement?

Copy link
Contributor Author

@drewbailey drewbailey Jan 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the case of a system job running on 2 nodes, marking 1 node ineligible the resulting plan is

→ nomad job plan repro.hcl
+/- Job: "redis"
+/- Task Group: "cache" (2 create/destroy update)
  +/- Task: "redis" (forces create/destroy update)
    +/- Env[version]: "1" => "2"

Scheduler dry-run:
- WARNING: Failed to place allocations on all nodes.
  Task Group "cache" (failed to place 1 allocation):


Job Modify Index: 13
To submit the job with version verification run:

nomad job run -check-index 13 repro.hcl

When running the job with the check-index flag, the job will only be run if the
server side version matches the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Makes sense.

}

// Update the set of placement nodes
Expand Down Expand Up @@ -327,6 +333,7 @@ func (s *SystemScheduler) computePlacements(place []allocTuple) error {
// Actual failure to start this task on this candidate node, report it individually
s.failedTGAllocs[missing.TaskGroup.Name] = s.ctx.Metrics()
s.addBlocked(node)

continue
}

Expand Down
62 changes: 62 additions & 0 deletions scheduler/system_sched_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1310,6 +1310,68 @@ func TestSystemSched_Queued_With_Constraints(t *testing.T) {

}

// No errors reported when no available nodes prevent placement
func TestSystemSched_NoNodes(t *testing.T) {
h := NewHarness(t)

var node *structs.Node
// Create a node
node = mock.Node()
node.ComputeClass()
require.Nil(t, h.State.UpsertNode(h.NextIndex(), node))

// Make a job
job := mock.SystemJob()
require.Nil(t, h.State.UpsertJob(h.NextIndex(), job))

// Evaluate the job
eval := &structs.Evaluation{
Namespace: structs.DefaultNamespace,
ID: uuid.Generate(),
Priority: job.Priority,
TriggeredBy: structs.EvalTriggerJobRegister,
JobID: job.ID,
Status: structs.EvalStatusPending,
}

require.Nil(t, h.State.UpsertEvals(h.NextIndex(), []*structs.Evaluation{eval}))
require.Nil(t, h.Process(NewSystemScheduler, eval))
require.Equal(t, "complete", h.Evals[0].Status)

// QueuedAllocations is drained
val, ok := h.Evals[0].QueuedAllocations["web"]
require.True(t, ok)
require.Equal(t, 0, val)

// The plan has one NodeAllocations
require.Equal(t, 1, len(h.Plans))

// Mark the node as ineligible
node.SchedulingEligibility = structs.NodeSchedulingIneligible

// Create a new job version, deploy
job2 := job.Copy()
job2.Meta["version"] = "2"
require.Nil(t, h.State.UpsertJob(h.NextIndex(), job2))

eval2 := &structs.Evaluation{
Namespace: structs.DefaultNamespace,
ID: uuid.Generate(),
Priority: job2.Priority,
TriggeredBy: structs.EvalTriggerJobRegister,
JobID: job2.ID,
Status: structs.EvalStatusPending,
}

// Ensure New eval is complete
require.Nil(t, h.State.UpsertEvals(h.NextIndex(), []*structs.Evaluation{eval2}))
require.Nil(t, h.Process(NewSystemScheduler, eval2))
require.Equal(t, "complete", h.Evals[1].Status)

// Ensure there is a FailedTGAlloc metric
require.Equal(t, 1, len(h.Evals[1].FailedTGAllocs))
}

// No errors reported when constraints prevent placement
func TestSystemSched_ConstraintErrors(t *testing.T) {
h := NewHarness(t)
Expand Down