-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad Lead Server at 100% CPU Usage Leading to Total System Death #4864
Comments
Hey @Miserlou sorry you're experiencing issues again. Please share the logs for this node as well as other server nodes spanning this incident. Debug level would be especially helpful. Could we also get your nomad server configuration if possible and the version of nomad you're running. As for connections, can you provide the open file handles and what your file handle limit is. Further, can you enable go tool pprof -proto -output=prof.proto http://127.0.0.1:4646/debug/pprof/profile?seconds=30 What is the the state of the system during the screenshot. It appears Nomad only has a few seconds of process time. Is something restarting nomad? Also is one process a child of another or are there actually two nomad agents running on the node? I know some details are spread across a few of your other tickets but a summary of your deployment architecture here would be helpful. How are the servers deployed, what are the machine sizes, rate of the incoming dispatch requests etc. Additionally could you elaborate as to your usecase of Nomad. I understand the frustration of coming in first thing to a crashed system, but there's really nothing we can do with the information you've provided in the initial ticket. Please be as detailed as possible so we can help as best we can. In the future please follow the issue template when opening issues. |
Viewing the UI directly gives this: This looks like it was the crash: I can't enable debug now since this is a prod system and it's already happened. File max is at This time it looks like logging stopped, as I can find no actionable logs. In the working directory there is a 226M
Obviously, this isn't the first time that Nomad server failure has called total system collapse: #4323 although this time the outage cost us about $4,000 as it happened undetected over a weekend. I'm responsible for maintaining a system which is completely incapable of doing what it claims. The decision to use this technology was made before I arrived. It is too late to turn around. Every day is hell for me. I feel like the victim of fraud who has no choice but to wake up everyday and be defrauded over and over again. |
As a reminder, when this problem occurs, it takes down all servers in the cluster. Restarting the server manually appears to cause a
Restarting all of the other servers gives just:
I have asked this so many times anybody has ever answered me - how do you load test Nomad? It seems like this is the sort of thing that would have been discovered if you did any serious workload on Nomad at all. |
After about 15 minutes it actually did start an election again, but one of the agents simply died in the process for no reason and so everything failed:
|
Hey @Miserlou I'm working on going through this info. What immediately sticks out at me is in your server config you have:
Does this mean this value is set to 3 on other servers and if so why? Additionally it looks like you're running 4 servers?
This shouldn't be an issue, but 3 servers will actually yield better performance without sacrificing any reliability. You can only loose 1 server in either scenario to maintain quorum but 3 servers means one less participant in raft. When you say that your workers (I assume nomad clients) are elastic, does this mean they're managed by an autoscaling group. If so do you happen to have a graph of the number of clients during the time of the crash/lockup as well as a graph of the |
Sorry, it is actually 3 servers, not 4.
This is because it says to in the documentation. Queue length looks like you'd expect. I can't seem to find a metric for how many clients were online at the time of the crash. It was probably between 0 and 20. =/ |
Restarting keeps causing that same
|
Additionally enabling debug mode and setting the logging level to There are several threads to pull at here. If you could provide the logs from the weekend, perhaps via dropbox or S3 if they're too large to share here. If you'd be willing to let us look at the raft.db and state.bin files that could help as well. That |
Would you share a graph of the memory usage as well? Any chance there are OOM killer events in dmesg? |
Actually that time it happened on two servers:
It's way, way, way too noisy to use DEBUG on our production systems - especially when this particular error occurs, it generated gigabytes of log output in a few seconds. It does look like those
CWL doesn't automatically track memory usage over time (AFAIK), so I can't give you that graph. "Throw more memory at it" was our solution to this problem last time but I have to say that I'm very unhappy with that as we still have the problem, only now it's more expensive. |
Progress! So Nomad is running out of memory. Now lets try a few things to figure out why. Try grepping for this in your server logs of the incident:
Also could you share the job you're dispatching? |
Do you have graphs for all of Nomad's metrics? Seeing the following would be really helpful:
|
We have lots of different jobs, many of them are in this directory: https://github.com/AlexsLemonade/refinebio/blob/dev/workers/nomad-job-specs/salmon.nomad.tpl - that's a template, it ends up being about 70 job specs total at registration time. I'm not sure about sharing the files because a) they're very big and b) I don't know if they're safe to share. What information do they contain? Obviously I want to make sure that they don't contain any keys or environment variables that are passed around. May plan is to just implement a governor to put a ceiling on the number of dispatch jobs at once and add How do I get those values you requested? We have disabled telemetry due to it being significantly worse than useless for dispatch jobs. |
|
Yeah I don't want you sharing those if you're templating in secret data. Many people use the vault integration so secrets aren't leaked into job specs. I think we may have reproduced a condition that can cause memory bloat under your use case. I'll share in more detail tomorrow, but wanted to get a reply to you today. |
Sounds good, looking forward to the fix. |
Did you see my note about your internal load testing? Because it still seems that you're advertising a product for production work loads that you've never actually tested at scale yourselves, which is obviously pretty astounding. |
As an observer, I feel for Miserlou's pain and frustration as he keeps tackling the various issues he is facing. Though, @Miserlou I have one thing to add; "You are lucky" ... Having HashiStack is "actually a good thing" ™️ 😁 as Nomad is extremely flexible yet adequately rigid. I have been trying to convince folks to adopt HashiStack, though without much luck, mostly due to their off-the-radar reasons like: From personal experience having All that said ... coming back to issue at hand and some brainstorming ...
*** FWIW, I have my Nomad agents under systemd with a 10 second delay between restarts and a separate cron job which checks 'systemctl status nomad' to determine if a datdir wipe is needed HTH, |
Before I dig into the details of the fix I think it important to highlight that you may still be misusing Nomad. Nomad is a workload scheduler and is suitable as a replacment for a classic job queue system in many applications. However it is not designed to be a distributed durable queuing system such as Kafka for example. Nomad must hold all pending Jobs in memory to schedule them, it currently has no swap to disk feature and will crash if you DOS it. This can happen for example if your elastic client pool suddenly needs to expand in response to an unexpected burst in workload. While new nodes are coming online, Jobs are piling up in memory, waiting to be placed. I think an architecture where you have a feedback loop and some backpressure into whats sending dispatch requests would add some resilience to your deployment. Testing the upper bound of your deployment and using that data to watch metrics like As for the bug you really already nailed it in #4422. Each Job in Nomad maintains a small overhead of memory to track the
I'm introducing a couple new flags to the
Specifically When doing this with my local cluster I was able to reduce the memory usage of my leader by over %40. I'm working to get this merge into our 0.8.7 branch so we can get this out ASAP. That PR is here: #4878 |
Hey Nick, big thanks for the deep dive on this. We've added a limiter for the number of dispatched jobs, which are all stored in a database anyway, so hopefully that gives us a safety buffer. We are now well aware that Nomad isn't a suitable tool for our needs - however we got this impression because you are lying to people about Nomad's capabilities both online and at conferences. Dispatch jobs are a hack and it's dishonest to advertise them as a replacement for queues, even though the title of your blog post is "Replacing Queues with Nomad Dispatch", without any mention of these crippling limitations. I'm not even going to start on the inability to handle job timeouts, the lack of forward-looking prioritized job scheduling, lack of constraints across dispatched jobs, lack of recommendations for server:job ratios, lack of load testing results, lack of clients publicly using the stack, etc.. I also don't think that Anyway, I gotta say I really do appreciate you taking the time to finally address this issue and I'm really looking forward to the patch landing, which will likely be a big help to us. |
Yep, fell into the same trap. I've been fighting with the dispatch jobs for the last week. My current idea is to simply schedule consumers with nomad and leave the actual job metadata in a queue from which they will pull the tasks. Sadly this means complete re-engineering of the entire architecture. Maybe I'll try the backpressure trick as mentioned by @Miserlou (store jobs in a queue and have nomad periodically fetch some and dispatch them). With an aggresive GC this might somewhat work. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
You know the drill by now - use Nomad for a heavy workload, come back from the weekend to find that the entire system has completely collapsed because the lead server is using 100% CPU and won't accept any connections, even to
nomad status
itself. Classic.The text was updated successfully, but these errors were encountered: