Nomad scheduling halting #10289

jonathanrcross · 2021-04-02T12:56:37Z

Nomad version

Nomad v1.0.4 (9294f35)

Operating system and Environment details

Centos 7
Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Issue

We are seeing the Nomad brokers suddenly stop or have blocked evaluations (we think), although we have capacity capable of fulfilling the required constraints asked. Attempting to restart already running services seems to also be blocked. As soon as we restart the leader, scheduling resumes (and allocates) as normal with no changes in the available clients. So this has us thinking since nomad is optimistically concurrent that is certainly a leader related issue but either with the evaluation broker or plan broker, maybe some mutex /lockingissue or the leader is somehow unresponsive but responsive (lots of guesses)? Below are some of the observations we've tracked and will be sending an email with pprofs, leader log and operator debug archive.

Observations

Sever configuration

autopilot {
  cleanup_dead_servers = true
  last_contact_threshold = "4s"
}

server {
  enabled = true
  bootstrap_expect = 3
  encrypt = "..."
  server_join {
    ...
  }

  node_gc_threshold = "15s"
  job_gc_interval = "1m"
  job_gc_threshold = "30s"

  heartbeat_grace = "15s"
  max_heartbeats_per_second = 45.0
  raft_multiplier = 7
}

Reproduction steps

Unable to reproduce

Expected Result

Actual Result

All Jobs are stuck in pending, restarting a job also does not take affect.

Job file (if appropriate)

The text was updated successfully, but these errors were encountered:

drewbailey · 2021-04-02T14:06:28Z

@jonathanrcross Sorry that you are running into this and thank you for the detailed report and logs.

The team will be taking a look at this and we'll hopefully have a follow up soon.

jonathanrcross · 2021-04-02T16:57:41Z

We're adding the autopilot stanza and updating the raft_protocol to 3. Surprised we didnt do that beforehand, will let you know if we still see the same behavior.

drewbailey · 2021-04-06T13:10:06Z

@jonathanrcross we were able to spend some time looking at the log bundle you sent as well as the graphs you provided.
We have a few follow questions and would like to learn more about your server's network topology as well as why you have tuned the heartbeat parameters and raft_multiplier to the values you shared.

The logs you provided show 200+ of the following error log message pairs. Yamux (our connection multiplexer) is configured to have a 30 second keepalive interval so it seems as though there was some form of network degradation during this time period

[ERROR] nomad.rpc: yamux: keepalive failed: i/o deadline reached
[ERROR] nomad.rpc: multiplex_v2 conn accept failed: error="keepalive timeout"

We are also wondering if you were scaling up the number of client nodes during this time. The logs additionally show over 3000 uniq node TTL expired warnings

jonathanrcross · 2021-04-06T14:15:35Z

Hi @drewbailey and team, thanks for looking all of that over!

We tuned the heartbeat and raft_multipliers after having reach around 3500+ instances in the past, the servers would struggle (outage or potential outage) with rapid changes in instances being added/removed along with scheduling. By loosening the timing we thought it would relieve cpu and network demands at those peaks. We are running on gcp n1-standard-8's for the servers; with the idea of not having to go a size bigger by using more relaxed timings. Our workloads are quite elastic and on this particular cloud require specific AZ to be placed, so instances are being initialized/removed quite frequently.

As for the topology each server runs in a different AZ (GCP) within a region. Clients are in the same region and vpc.

Our autoscaler was scaling up at the time in an attempt to catchup to backlog before noticing that nothing was be allocated/placed.

Update from previous message, we haven't run into this issue since adding autopilot. We also increased the server quorum to 5. We haven't seen any changes in leader election to see why this would have "fixed" this behavior we were observing.

jonathanrcross · 2021-04-07T14:46:03Z

Follow up from the autopilot and increasing server quorum size; just ran into the same behavior (took much longer to occur). Since once the first questions you asked was about timings, we'll set those back to the defaults and see if that helps.

schmichael · 2021-12-14T18:03:30Z

The graph of nomad.nomad.broker.total_blocked mostly growing without bound makes me wonder if you're hitting #9506.

Have you seen the following message in your leader logs?

nomad: plan for node rejected: node_id=0fa84370-c713-b914-d329-f6485951cddc reason="reserved port collision" eval_id=098a5

Unfortunately it is a DEBUG level log line, so if you have not enable debug logs you won't be able to search your history. nomad monitor allows you to stream logs live at any log level, so if the issue is currently happening that might be an easy way to observe it without restarting servers. Likewise nomad operator debug captures debug level logs in its debug bundle and could also be used to observe the log line.

If you are seeing it: try restarting the Nomad client agent for the node_id in the log line. This has been seen to reliably fix evals being blocked forever.

tgross · 2022-06-17T20:55:08Z

While working on #13407 for a customer I searched back thru open issues to see if we'd seen that before. On your leader nmd-pool-8-servers-eet0f638-c04n.us-central1 I see the following goroutine blocked in select at plan_apply.go#L156:

goroutine 338696 [chan receive, 37 minutes]:
github.com/hashicorp/nomad/nomad.(*planner).planApply(0xc000416940)
	github.com/hashicorp/nomad/nomad/plan_apply.go:156 +0x529
created by github.com/hashicorp/nomad/nomad.(*Server).establishLeadership
	github.com/hashicorp/nomad/nomad/leader.go:250 +0x254

This exactly matches the behavior we saw on the cluster impacted by #13407, so I would expect this issue will be fixed there as well.

tgross · 2022-06-23T16:07:30Z

The fix for this has been merged and will ship in Nomad 1.3.2 (plus backports).

github-actions · 2022-10-22T02:41:18Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jonathanrcross added the type/bug label Apr 2, 2021

drewbailey added the stage/needs-investigation label Apr 2, 2021

drewbailey added the theme/scheduling label Apr 2, 2021

tgross mentioned this issue Jun 17, 2022

fix deadlock in plan_apply #13407

Merged

tgross self-assigned this Jun 17, 2022

tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Jun 17, 2022

tgross closed this as completed in #13407 Jun 23, 2022

hc-github-team-nomad-core mentioned this issue Jun 23, 2022

Backport of fix deadlock in plan_apply into release/1.1.x #13469

Merged

tgross added this to the 1.3.2 milestone Jun 23, 2022

hc-github-team-nomad-core mentioned this issue Jun 23, 2022

Backport of fix deadlock in plan_apply into release/1.2.x #13470

Merged

hc-github-team-nomad-core mentioned this issue Jun 23, 2022

Backport of fix deadlock in plan_apply into release/1.3.x #13471

Merged

github-actions bot locked as resolved and limited conversation to collaborators Oct 22, 2022

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Done in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad scheduling halting #10289

Nomad scheduling halting #10289

jonathanrcross commented Apr 2, 2021 •

edited

Loading

drewbailey commented Apr 2, 2021

jonathanrcross commented Apr 2, 2021

drewbailey commented Apr 6, 2021

jonathanrcross commented Apr 6, 2021 •

edited

Loading

jonathanrcross commented Apr 7, 2021

schmichael commented Dec 14, 2021

tgross commented Jun 17, 2022 •

edited

Loading

tgross commented Jun 23, 2022

github-actions bot commented Oct 22, 2022

Nomad scheduling halting #10289

Nomad scheduling halting #10289

Comments

jonathanrcross commented Apr 2, 2021 • edited Loading

Nomad version

Operating system and Environment details

Issue

Observations

Sever configuration

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

drewbailey commented Apr 2, 2021

jonathanrcross commented Apr 2, 2021

drewbailey commented Apr 6, 2021

jonathanrcross commented Apr 6, 2021 • edited Loading

jonathanrcross commented Apr 7, 2021

schmichael commented Dec 14, 2021

tgross commented Jun 17, 2022 • edited Loading

tgross commented Jun 23, 2022

github-actions bot commented Oct 22, 2022

jonathanrcross commented Apr 2, 2021 •

edited

Loading

jonathanrcross commented Apr 6, 2021 •

edited

Loading

tgross commented Jun 17, 2022 •

edited

Loading