-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sustained CPU Spike and Death of Lead Server Leads to Total Collapse of the System #4323
Comments
The unresponsive secondary server also has a 29 Gigabyte log file which consists entirely of:
So, it looked like this is actually two bugs which have combined to cause total system death. |
@Miserlou How many times did you dispatch jobs? You say there are 1.7M lines before but each looks to be scheduling a new dispatch job: DOWNLOADER/dispatch-1527025593-82387453, DOWNLOADER/dispatch-1527025593-efbec847, ... Are your clients DDoSing Nomad by asking it to scheduling millions of jobs? |
You can see the number of jobs dispatched in the second graph, there are only 13400, which is what we want. The desired changes spam looks like it's happening every .001 seconds, which the application is certainly not doing. I think it all occurred during that CPU spike window. |
@Miserlou Can you share the full logs. I would like to see the behavior before the CPU spike, during and after. What is your file descriptor limit for these process? Can you list the file descriptors that Nomad has open? |
I am going to close this issue as we never received reproduction steps and nothing of the sort has been reported since. If this does happen again the following details would be useful:
Thanks! |
The second crash occurred in 0.8.3, the first in 0.7.0. |
Hey @Miserlou Thanks for the information. We still would like the information Alex requested to help root cause this:
|
3 nodes. |
You can even think of this issue as "die gracefully" if it helps, but one-too-many-jobs-and-the-whole-system-dies obviously isn't an acceptable behavior for a distributed job scheduler. |
This ticket shouldn't be closed. |
@Miserlou Can you provide the crash output? How many jobs can the cluster run at the same time? I believe the larger problem is you are using dispatch as a high volume queueing system which it is not designed for. I would suggest you use a work queue at the volume you are dispatching. |
It's literally dozens of gigabytes of log output of the kind I already linked with the lines every thousandth of a millisecond. I described the number of jobs above. This was with 3xm4xlarge |
@Miserlou I am asking for the output of the agent crashing. Was it a panic, an out of memory, etc. |
Those servers were terminated spot instances. I'm guessing it was an OOM since we solved the problem last time by throwing more RAM at it, it should be easy to reproduce yourself by starting a cluster and throwing jobs at it until it falls over like this. |
@Miserlou There really is not a lot to go off with this issue. You are reporting a server that is unresponsive and or crashing. There is no log of it crashing to know the root cause of that. The log lines you are showing is that Nomad has hit its file descriptors limit but the list of open files is not given or the limit set for Nomad.
You are queuing a 180,000 jobs to 3 instances. High CPU not that surprising. Again you should likely be using a work queue. |
The fd limit was raised to 65536. I don't have the memory graphs since those servers were rotated. The server nodes are dedicated and there are no other services on them. There are 10 clients connected. This isn't an issue about what I should do, this is an issue about Nomad shouldn't fail so catastrophically. It's not one server that this happens to, when it happens, it happens to every single dedicated server in the system. The reason we migrated from a work queue to using Nomad here is because we believed your own - frankly deceptive - marketing materials. However the lack of forward capacity planning, dispatch job constraints, and basic stability issues are crushing us right now. I don't know what else to tell you. Can you tell me a bit about how you're testing/using Nomad at scale internally? I assumed that when your docs said "Enterprise" they meant big, but I don't know what kind of sizes you're actually talking about. |
@Miserlou Sorry you feel like it wasn't clear. We never state that dispatch is meant for high volume: https://www.hashicorp.com/blog/replacing-queues-with-nomad-dispatch#caveats. The reason all servers fail is that Nomad uses Raft as a consensus protocol. Thus all data is replicated, and if you have more working set data than can fit in memory, Nomad will crash with out of memory. Unfortunately I am going to close this issue again. If this happens to you again, please collect the various bits of information asked for in this issue and I highly suggest you rethink the usage of Nomad as a high volume work queue as it is not what it is designed for. |
That isn't what the blog you linked says at all. It says:
These aren't jobs which take a short time to process, these are jobs which take hours to run. There's just a lot of them. They aren't tiny jobs either, they have reasonable resources allocated to them. You still haven't said anything about the scale which you test at, but you use the words "enterprise", "production ready" and "scalable" all over your marketing materials without ever defining what you actually mean by that. Are those just.. mistruths? For heaven's sake, the title of the blog post is "Replacing Queues with Nomad Dispatch"! You should leave this ticket open so somebody else who stumbles onto this limitation can add their - it is a serious problem that |
Is there a reason this is closed? At the very least there should be some published guidance around the scaling limitations of this (eg, memory required per x number of concurrent jobs scheduled or hard limits that should be observed/different configurations that avoid the issue like using an external queue in front of nomad/etc). |
A job with 3800 tasks caused nomad OOM halfway through, the raft.db reached ~10gb and never goes down, even after purging the job and forcing gc. It is reproducible every time. that's a cluster of 3x64gb ram servers and 10 agents, TBH I can't see how nomad can handle the claimed 1M jobs. |
@amfern I'm terribly sorry you've hit such a catastrophic failure! I know our resource utilization guidelines are sorely lacking. It's something we intend to improve but varies widely depending on use patterns.
Please open a new issue with as many details as possible. In the issue template we mention a [email protected] address you can send unredacted logs to if you don't want to share all of your cluster information publicly. Please do not interpret this issue being closed as us ignoring all server-related resource utilization issues! This is absolutely not the case, and we absolutely want to ensure server perform reliably and predictably for a wide range of use cases. |
@dadgar mentioned that every Task instance contains the entire Job spec. So, it is important to keep a job spec small. If your 3800 tasks are structured as 3800 distinct TaskGroup/Task combo in one since job spec, you'll run into memory issues at the servers very quickly. We launch thousands of tasks that are distinct and has this 1 Task/TaskGroup pattern. The only way to keep Nomad servers sane is to break up a large job into smaller ones. So, we have a typical job spec containing only about 50 distinct Task/TaskGroups. That has worked and kept the memory consumption reasonable. |
@schmichael I am sorry I think I wasn't using nomad correctly, I tested with a parametrized job and I could run 100k jobs on single nomad server and 8 agents, the server consumed 9gb ram and raft.db is 217m. So what is the use case for the non-parametrized batch job? |
@amfern You just need to break up a big job into smaller ones. You can see each Allocation embeds a pointer to |
@amfern I'm sorry Nomad isn't meeting your expectations! If you're interested in diagnosing the memory usage of your setup please open a new issue and include as many details as possible:
While I can't think of any log lines in particular that would be useful for diagnosing memory usage, logs may help us ensure everything is operating as expected. Please feel free to email logs to [email protected] if you don't wish to post them publicly on the issue. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e)
Operating system and Environment details
Ubuntu
Issue
I ran a job overnight. I came back to find that overnight, the lead server had a sustained and unexplained CPU spike which lead to the complete collapse of the total system.
The logs end like this:
And there are 1.7 million lines similar lines leading up to this.
The CPU usage of the lead server over time looks like this:
And the combined length of the enqueued tasks looks like this:
At the application layer, there is obviously lots of:
In the network, there is one lead server, two secondary servers, and 20 clients - all of which are running on
m4.large
type instances.On the lead server, calling
nomad status
:On a client:
And on a secondary server:
(it just hangs completely.)
Reproduction steps
Unsure. Actually use Nomad for a resource intensive job at scale with the resources provided.
This is a really major issue for us. A bug in the lead server left the Nomad system completely unable to self-recover. Even secondary server instances are left in an unresponsive, unusable state.
The text was updated successfully, but these errors were encountered: