-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad scheduling halting #10289
Comments
@jonathanrcross Sorry that you are running into this and thank you for the detailed report and logs. The team will be taking a look at this and we'll hopefully have a follow up soon. |
We're adding the autopilot stanza and updating the raft_protocol to 3. Surprised we didnt do that beforehand, will let you know if we still see the same behavior. |
@jonathanrcross we were able to spend some time looking at the log bundle you sent as well as the graphs you provided. The logs you provided show 200+ of the following error log message pairs. Yamux (our connection multiplexer) is configured to have a 30 second keepalive interval so it seems as though there was some form of network degradation during this time period
We are also wondering if you were scaling up the number of client nodes during this time. The logs additionally show over 3000 uniq |
Hi @drewbailey and team, thanks for looking all of that over! We tuned the heartbeat and raft_multipliers after having reach around 3500+ instances in the past, the servers would struggle (outage or potential outage) with rapid changes in instances being added/removed along with scheduling. By loosening the timing we thought it would relieve cpu and network demands at those peaks. We are running on gcp n1-standard-8's for the servers; with the idea of not having to go a size bigger by using more relaxed timings. Our workloads are quite elastic and on this particular cloud require specific AZ to be placed, so instances are being initialized/removed quite frequently. As for the topology each server runs in a different AZ (GCP) within a region. Clients are in the same region and vpc. Our autoscaler was scaling up at the time in an attempt to catchup to backlog before noticing that nothing was be allocated/placed. Update from previous message, we haven't run into this issue since adding autopilot. We also increased the server quorum to 5. We haven't seen any changes in leader election to see why this would have "fixed" this behavior we were observing. |
Follow up from the autopilot and increasing server quorum size; just ran into the same behavior (took much longer to occur). Since once the first questions you asked was about timings, we'll set those back to the defaults and see if that helps. |
The graph of Have you seen the following message in your leader logs?
Unfortunately it is a If you are seeing it: try restarting the Nomad client agent for the |
While working on #13407 for a customer I searched back thru open issues to see if we'd seen that before. On your leader
This exactly matches the behavior we saw on the cluster impacted by #13407, so I would expect this issue will be fixed there as well. |
The fix for this has been merged and will ship in Nomad 1.3.2 (plus backports). |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v1.0.4 (9294f35)
Operating system and Environment details
Centos 7
Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Issue
We are seeing the Nomad brokers suddenly stop or have blocked evaluations (we think), although we have capacity capable of fulfilling the required constraints asked. Attempting to restart already running services seems to also be blocked. As soon as we restart the leader, scheduling resumes (and allocates) as normal with no changes in the available clients. So this has us thinking since nomad is optimistically concurrent that is certainly a leader related issue but either with the evaluation broker or plan broker, maybe some mutex /lockingissue or the leader is somehow unresponsive but responsive (lots of guesses)? Below are some of the observations we've tracked and will be sending an email with pprofs, leader log and operator debug archive.
Observations
Sever configuration
Reproduction steps
Unable to reproduce
Expected Result
Actual Result
All Jobs are stuck in pending, restarting a job also does not take affect.
Job file (if appropriate)
The text was updated successfully, but these errors were encountered: