You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since plans for nodes may be rejected for a number of legitimate and incorrect reasons, we should be defensive about handling the pathological condition of workers and the plan applier disagreeing about the validity of a node. Nomad does optimistic concurrent scheduling: multiple concurrent scheduling workers may submit conflicting plans from which the serializing plan applier must choose a "winner" (eg select both would exhaust the cpu resources on the node). The worker whose plan "loses" then refreshes their cluster state and attempts to schedule the workload again.
Unfortunately this means the workers and plan applier have different logic for assessing the feasibility of nodes. Any slight divergence in their node feasibility assessment may lead to a pathological loop:
Worker submits a Plan containing placements for Node A
Plan Applier validates Plan and disagrees that Node A is a valid placement
Worker sees the last plan was only partially applied so retries, still thinking Node A is valid
Goto Step 2
While there are various backoffs and limits to cause evaluations to eventually clear, this can take minutes due to backoffs and retries. In a busy cluster, especially a cluster recovering from a partition or other event that causes a flurry of activity, minutes-per-evaluation can mean it takes days, weeks, or even months to clear. Even an eval queue that's effectively stuck on these pathological evals for hours is enough to cause severe service disruption.
Proposal
Since the existing eval backoff and retry mechanism is insufficient to deal with "bad" nodes (nodes for whom the workers and plan applier disagree about their feasibility), a new mechanism for detecting this pathological condition must be added. Since it is impossible to differentiate between a correctness issue (worker/applier disagreement) and intentional conflict resolution due to concurrent schedulers, we must rely on heuristics to make progress in the face of bad nodes.
The Plan Applier is the ideal place to implement the heuristic bad node detection mechanism for 2 reasons:
Since the Plan Applier is single threaded and sees all plans, no coordination around the heuristic's state is necessary across goroutines or processes. The Plan Applier can track bad nodes and invalidate them without having to coordinate such tracking with Workers. Workers can remain unchanged.
The Plan Applier commits Plan Results to Raft while Workers avoid making Raft commits until an Evaluation is fully processed to allow them to have as little coordination and overhead as possible. Therefore whatever decisions the heuristic mechanism makes can be committed to Raft by the Plan Applier and be globally visible not only to the workers and plan applier, but to users who need to observe issues in their cluster.
Heuristic Mechanism
The intentional reason placements are rejected from nodes is to account for the following race condition possible due to Nomad's optimistically concurrent scheduling:
Worker W1 dequeues Eval E1 for a low priority batch job
Worker W2 dequeues Eval E2 for a high priority system job
W1 submits a plan with placement on Node N1 that uses all remaining cpu on N1
W2 working concurrently with W1 submits a plan with a placement on N1 that also uses all remaining cpu on N1
The Plan Applier dequeues W1's plan first, accepts it.
There are actually a number of properties of Nomad and its common use cases that make these sort of conflicts more likely:
New nodes coming up create evals for all system jobs
Treating Nomad as a work queue by submitting more low priority batch jobs than a cluster has capacity to process concurrently is a common and suggested use case
On modern servers with many cores it is likely when a node is added all of the system evals and unblocked batch evals may be processed by concurrent workers more or less simultaneously
Nomad's preemption feature is a great solution for the above case of a new node receiving low priority batch placements before high priority system jobs. However since preemption is costly to calculate Nomad puts it in the concurrent workers for optimal scalability. While this makes preemption and the plan applier's node rejection work well together, it does complicate the heuristic for detecting pathological node rejections as some level of rejections is expected for preemption to work optimally.
Nomad's default bin packing algorithm means concurrent workers are likely to select the same nearly-at-capacity nodes for new placements, increasing the likelihood that some will be rejected by the plan applier.
All of this to say: the plan applier is expected to reject some node placements for correctness, so how do we detect pathological node rejections?
The general heuristic should be:
If a node is rejected N times in D second, mark it as ineligible.
The heuristic will have the following parameters:
Number of nodes to track rejected stats for to cap the amount of memory used.
Time Window (D) to track rejections within.
Threshold (N) for rejections per Window before we mark it as ineligible.
Time Window (D) and Threshold (N) should be configurable.
The maximum number of nodes tracked (#1) does not need to be configurable and instead hardcoded to something like 50 nodes. Even in very large clusters (thousands of nodes) the maximum number of bad nodes ever observed at once was 20-30. Even if the cap is too low to track all bad nodes simultaneously, as bad nodes are marked ineligible they will cease being tracked and make space for more nodes to be tracked.
The PlanResult struct can encode the nodes to be marked ineligible:
typePlanResultstruct {
// ...// IneligibleNodes are nodes the plan applier has repeatedly rejected// placements for and should therefore be considered ineligible by// workers to avoid retrying them repeatedly.IneligibleNodes []string
}
The FSM handler for the ApplyPlanResultsRequestType Raft log message will need to be updated to mark nodes ineligible.
When retrying evaluations workers are sent the PlanResult, but it is not necessary for the worker to explicitly read IneligibleNodes because the RefreshIndex will cause the worker to use a snapshot of cluster state that includes the nodes as being marked ineligible.
Visibility
A metric and log line should be emitted every time a new node is marked ineligible by the heuristic.
The existing nomad.plan.node_rejected metric and associated log line should be sufficient to guide users on how best to tune the Time Window and Threshold parameters, although opportunities to improve visibility should be taken.
The text was updated successfully, but these errors were encountered:
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
The plan applier should track nodes rejected from plans as repeatedly rejected nodes are a sign of the pathological #9506
plan for node rejected
bug.Bug
Docs: https://www.nomadproject.io/docs/operations/monitoring-nomad#progress
Related issue: #9506
Since plans for nodes may be rejected for a number of legitimate and incorrect reasons, we should be defensive about handling the pathological condition of workers and the plan applier disagreeing about the validity of a node. Nomad does optimistic concurrent scheduling: multiple concurrent scheduling workers may submit conflicting plans from which the serializing plan applier must choose a "winner" (eg select both would exhaust the cpu resources on the node). The worker whose plan "loses" then refreshes their cluster state and attempts to schedule the workload again.
Unfortunately this means the workers and plan applier have different logic for assessing the feasibility of nodes. Any slight divergence in their node feasibility assessment may lead to a pathological loop:
While there are various backoffs and limits to cause evaluations to eventually clear, this can take minutes due to backoffs and retries. In a busy cluster, especially a cluster recovering from a partition or other event that causes a flurry of activity, minutes-per-evaluation can mean it takes days, weeks, or even months to clear. Even an eval queue that's effectively stuck on these pathological evals for hours is enough to cause severe service disruption.
Proposal
Since the existing eval backoff and retry mechanism is insufficient to deal with "bad" nodes (nodes for whom the workers and plan applier disagree about their feasibility), a new mechanism for detecting this pathological condition must be added. Since it is impossible to differentiate between a correctness issue (worker/applier disagreement) and intentional conflict resolution due to concurrent schedulers, we must rely on heuristics to make progress in the face of bad nodes.
The Plan Applier is the ideal place to implement the heuristic bad node detection mechanism for 2 reasons:
Heuristic Mechanism
The intentional reason placements are rejected from nodes is to account for the following race condition possible due to Nomad's optimistically concurrent scheduling:
W1
dequeues EvalE1
for a low priority batch jobW2
dequeues EvalE2
for a high priority system jobW1
submits a plan with placement on NodeN1
that uses all remaining cpu onN1
W2
working concurrently withW1
submits a plan with a placement onN1
that also uses all remaining cpu onN1
W1
's plan first, accepts it.There are actually a number of properties of Nomad and its common use cases that make these sort of conflicts more likely:
All of this to say: the plan applier is expected to reject some node placements for correctness, so how do we detect pathological node rejections?
The general heuristic should be:
The heuristic will have the following parameters:
Time Window (D) and Threshold (N) should be configurable.
The maximum number of nodes tracked (
#1
) does not need to be configurable and instead hardcoded to something like 50 nodes. Even in very large clusters (thousands of nodes) the maximum number of bad nodes ever observed at once was 20-30. Even if the cap is too low to track all bad nodes simultaneously, as bad nodes are marked ineligible they will cease being tracked and make space for more nodes to be tracked.The
PlanResult
struct can encode the nodes to be marked ineligible:The FSM handler for the
ApplyPlanResultsRequestType
Raft log message will need to be updated to mark nodes ineligible.When retrying evaluations workers are sent the
PlanResult
, but it is not necessary for the worker to explicitly read IneligibleNodes because the RefreshIndex will cause the worker to use a snapshot of cluster state that includes the nodes as being marked ineligible.Visibility
A metric and log line should be emitted every time a new node is marked ineligible by the heuristic.
The existing
nomad.plan.node_rejected
metric and associated log line should be sufficient to guide users on how best to tune the Time Window and Threshold parameters, although opportunities to improve visibility should be taken.The text was updated successfully, but these errors were encountered: