Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically mark repeatedly rejected nodes in plans as ineligible #13017

Closed
schmichael opened this issue May 13, 2022 · 1 comment · Fixed by #13421
Closed

Automatically mark repeatedly rejected nodes in plans as ineligible #13017

schmichael opened this issue May 13, 2022 · 1 comment · Fixed by #13421

Comments

@schmichael
Copy link
Member

The plan applier should track nodes rejected from plans as repeatedly rejected nodes are a sign of the pathological #9506 plan for node rejected bug.

Bug

Docs: https://www.nomadproject.io/docs/operations/monitoring-nomad#progress
Related issue: #9506

Since plans for nodes may be rejected for a number of legitimate and incorrect reasons, we should be defensive about handling the pathological condition of workers and the plan applier disagreeing about the validity of a node. Nomad does optimistic concurrent scheduling: multiple concurrent scheduling workers may submit conflicting plans from which the serializing plan applier must choose a "winner" (eg select both would exhaust the cpu resources on the node). The worker whose plan "loses" then refreshes their cluster state and attempts to schedule the workload again.

Unfortunately this means the workers and plan applier have different logic for assessing the feasibility of nodes. Any slight divergence in their node feasibility assessment may lead to a pathological loop:

  1. Worker submits a Plan containing placements for Node A
  2. Plan Applier validates Plan and disagrees that Node A is a valid placement
  3. Worker sees the last plan was only partially applied so retries, still thinking Node A is valid
  4. Goto Step 2

While there are various backoffs and limits to cause evaluations to eventually clear, this can take minutes due to backoffs and retries. In a busy cluster, especially a cluster recovering from a partition or other event that causes a flurry of activity, minutes-per-evaluation can mean it takes days, weeks, or even months to clear. Even an eval queue that's effectively stuck on these pathological evals for hours is enough to cause severe service disruption.

Proposal

Since the existing eval backoff and retry mechanism is insufficient to deal with "bad" nodes (nodes for whom the workers and plan applier disagree about their feasibility), a new mechanism for detecting this pathological condition must be added. Since it is impossible to differentiate between a correctness issue (worker/applier disagreement) and intentional conflict resolution due to concurrent schedulers, we must rely on heuristics to make progress in the face of bad nodes.

The Plan Applier is the ideal place to implement the heuristic bad node detection mechanism for 2 reasons:

  1. Since the Plan Applier is single threaded and sees all plans, no coordination around the heuristic's state is necessary across goroutines or processes. The Plan Applier can track bad nodes and invalidate them without having to coordinate such tracking with Workers. Workers can remain unchanged.
  2. The Plan Applier commits Plan Results to Raft while Workers avoid making Raft commits until an Evaluation is fully processed to allow them to have as little coordination and overhead as possible. Therefore whatever decisions the heuristic mechanism makes can be committed to Raft by the Plan Applier and be globally visible not only to the workers and plan applier, but to users who need to observe issues in their cluster.

Heuristic Mechanism

The intentional reason placements are rejected from nodes is to account for the following race condition possible due to Nomad's optimistically concurrent scheduling:

  1. Worker W1 dequeues Eval E1 for a low priority batch job
  2. Worker W2 dequeues Eval E2 for a high priority system job
  3. W1 submits a plan with placement on Node N1 that uses all remaining cpu on N1
  4. W2 working concurrently with W1 submits a plan with a placement on N1 that also uses all remaining cpu on N1
  5. The Plan Applier dequeues W1's plan first, accepts it.

There are actually a number of properties of Nomad and its common use cases that make these sort of conflicts more likely:

  1. New nodes coming up create evals for all system jobs
  2. Treating Nomad as a work queue by submitting more low priority batch jobs than a cluster has capacity to process concurrently is a common and suggested use case
  3. On modern servers with many cores it is likely when a node is added all of the system evals and unblocked batch evals may be processed by concurrent workers more or less simultaneously
  4. Nomad's preemption feature is a great solution for the above case of a new node receiving low priority batch placements before high priority system jobs. However since preemption is costly to calculate Nomad puts it in the concurrent workers for optimal scalability. While this makes preemption and the plan applier's node rejection work well together, it does complicate the heuristic for detecting pathological node rejections as some level of rejections is expected for preemption to work optimally.
  5. Nomad's default bin packing algorithm means concurrent workers are likely to select the same nearly-at-capacity nodes for new placements, increasing the likelihood that some will be rejected by the plan applier.

All of this to say: the plan applier is expected to reject some node placements for correctness, so how do we detect pathological node rejections?

The general heuristic should be:

If a node is rejected N times in D second, mark it as ineligible.

The heuristic will have the following parameters:

  1. Number of nodes to track rejected stats for to cap the amount of memory used.
  2. Time Window (D) to track rejections within.
  3. Threshold (N) for rejections per Window before we mark it as ineligible.

Time Window (D) and Threshold (N) should be configurable.

The maximum number of nodes tracked (#1) does not need to be configurable and instead hardcoded to something like 50 nodes. Even in very large clusters (thousands of nodes) the maximum number of bad nodes ever observed at once was 20-30. Even if the cap is too low to track all bad nodes simultaneously, as bad nodes are marked ineligible they will cease being tracked and make space for more nodes to be tracked.

The PlanResult struct can encode the nodes to be marked ineligible:

type PlanResult struct {
	// ...

	// IneligibleNodes are nodes the plan applier has repeatedly rejected
	// placements for and should therefore be considered ineligible by
	// workers to avoid retrying them repeatedly.
	IneligibleNodes []string
}

The FSM handler for the ApplyPlanResultsRequestType Raft log message will need to be updated to mark nodes ineligible.

When retrying evaluations workers are sent the PlanResult, but it is not necessary for the worker to explicitly read IneligibleNodes because the RefreshIndex will cause the worker to use a snapshot of cluster state that includes the nodes as being marked ineligible.

Visibility

A metric and log line should be emitted every time a new node is marked ineligible by the heuristic.

The existing nomad.plan.node_rejected metric and associated log line should be sufficient to guide users on how best to tune the Time Window and Threshold parameters, although opportunities to improve visibility should be taken.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants