Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad scheduling halting #10289

Closed
jonathanrcross opened this issue Apr 2, 2021 · 9 comments · Fixed by #13407
Closed

Nomad scheduling halting #10289

jonathanrcross opened this issue Apr 2, 2021 · 9 comments · Fixed by #13407
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/scheduling type/bug
Milestone

Comments

@jonathanrcross
Copy link

jonathanrcross commented Apr 2, 2021

Nomad version

Nomad v1.0.4 (9294f35)

Operating system and Environment details

Centos 7
Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Issue

We are seeing the Nomad brokers suddenly stop or have blocked evaluations (we think), although we have capacity capable of fulfilling the required constraints asked. Attempting to restart already running services seems to also be blocked. As soon as we restart the leader, scheduling resumes (and allocates) as normal with no changes in the available clients. So this has us thinking since nomad is optimistically concurrent that is certainly a leader related issue but either with the evaluation broker or plan broker, maybe some mutex /lockingissue or the leader is somehow unresponsive but responsive (lots of guesses)? Below are some of the observations we've tracked and will be sending an email with pprofs, leader log and operator debug archive.

Observations

Screen Shot 2021-04-02 at 8 45 08 AM
Screen Shot 2021-04-02 at 8 45 35 AM
Screen Shot 2021-04-02 at 8 45 50 AM

Sever configuration

autopilot {
  cleanup_dead_servers = true
  last_contact_threshold = "4s"
}

server {
  enabled = true
  bootstrap_expect = 3
  encrypt = "..."
  server_join {
    ...
  }

  node_gc_threshold = "15s"
  job_gc_interval = "1m"
  job_gc_threshold = "30s"

  heartbeat_grace = "15s"
  max_heartbeats_per_second = 45.0
  raft_multiplier = 7
}

Reproduction steps

Unable to reproduce

Expected Result

Actual Result

All Jobs are stuck in pending, restarting a job also does not take affect.

Job file (if appropriate)

@drewbailey
Copy link
Contributor

@jonathanrcross Sorry that you are running into this and thank you for the detailed report and logs.

The team will be taking a look at this and we'll hopefully have a follow up soon.

@jonathanrcross
Copy link
Author

We're adding the autopilot stanza and updating the raft_protocol to 3. Surprised we didnt do that beforehand, will let you know if we still see the same behavior.

@drewbailey
Copy link
Contributor

@jonathanrcross we were able to spend some time looking at the log bundle you sent as well as the graphs you provided.
We have a few follow questions and would like to learn more about your server's network topology as well as why you have tuned the heartbeat parameters and raft_multiplier to the values you shared.

The logs you provided show 200+ of the following error log message pairs. Yamux (our connection multiplexer) is configured to have a 30 second keepalive interval so it seems as though there was some form of network degradation during this time period

[ERROR] nomad.rpc: yamux: keepalive failed: i/o deadline reached
[ERROR] nomad.rpc: multiplex_v2 conn accept failed: error="keepalive timeout"

We are also wondering if you were scaling up the number of client nodes during this time. The logs additionally show over 3000 uniq node TTL expired warnings

@jonathanrcross
Copy link
Author

jonathanrcross commented Apr 6, 2021

Hi @drewbailey and team, thanks for looking all of that over!

We tuned the heartbeat and raft_multipliers after having reach around 3500+ instances in the past, the servers would struggle (outage or potential outage) with rapid changes in instances being added/removed along with scheduling. By loosening the timing we thought it would relieve cpu and network demands at those peaks. We are running on gcp n1-standard-8's for the servers; with the idea of not having to go a size bigger by using more relaxed timings. Our workloads are quite elastic and on this particular cloud require specific AZ to be placed, so instances are being initialized/removed quite frequently.

As for the topology each server runs in a different AZ (GCP) within a region. Clients are in the same region and vpc.

Our autoscaler was scaling up at the time in an attempt to catchup to backlog before noticing that nothing was be allocated/placed.

Update from previous message, we haven't run into this issue since adding autopilot. We also increased the server quorum to 5. We haven't seen any changes in leader election to see why this would have "fixed" this behavior we were observing.

@jonathanrcross
Copy link
Author

Follow up from the autopilot and increasing server quorum size; just ran into the same behavior (took much longer to occur). Since once the first questions you asked was about timings, we'll set those back to the defaults and see if that helps.

@schmichael
Copy link
Member

The graph of nomad.nomad.broker.total_blocked mostly growing without bound makes me wonder if you're hitting #9506.

Have you seen the following message in your leader logs?

nomad: plan for node rejected: node_id=0fa84370-c713-b914-d329-f6485951cddc reason="reserved port collision" eval_id=098a5

Unfortunately it is a DEBUG level log line, so if you have not enable debug logs you won't be able to search your history. nomad monitor allows you to stream logs live at any log level, so if the issue is currently happening that might be an easy way to observe it without restarting servers. Likewise nomad operator debug captures debug level logs in its debug bundle and could also be used to observe the log line.

If you are seeing it: try restarting the Nomad client agent for the node_id in the log line. This has been seen to reliably fix evals being blocked forever.

@tgross
Copy link
Member

tgross commented Jun 17, 2022

While working on #13407 for a customer I searched back thru open issues to see if we'd seen that before. On your leader nmd-pool-8-servers-eet0f638-c04n.us-central1 I see the following goroutine blocked in select at plan_apply.go#L156:

goroutine 338696 [chan receive, 37 minutes]:
github.com/hashicorp/nomad/nomad.(*planner).planApply(0xc000416940)
	github.com/hashicorp/nomad/nomad/plan_apply.go:156 +0x529
created by github.com/hashicorp/nomad/nomad.(*Server).establishLeadership
	github.com/hashicorp/nomad/nomad/leader.go:250 +0x254

This exactly matches the behavior we saw on the cluster impacted by #13407, so I would expect this issue will be fixed there as well.

@tgross
Copy link
Member

tgross commented Jun 23, 2022

The fix for this has been merged and will ship in Nomad 1.3.2 (plus backports).

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/scheduling type/bug
Projects
Development

Successfully merging a pull request may close this issue.

4 participants