Sustained CPU Spike and Death of Lead Server Leads to Total Collapse of the System #4323

Miserlou · 2018-05-23T14:16:25Z

Nomad version

Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e)

Operating system and Environment details

Ubuntu

Issue

I ran a job overnight. I came back to find that overnight, the lead server had a sustained and unexplained CPU spike which lead to the complete collapse of the total system.

The logs end like this:

Desired Changes for "jobs": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)
    2018/05/23 01:31:05.230446 [DEBUG] worker: reblocked evaluation <Eval "a569463e-7375-c5de-44ef-225938aeb909" JobID: "DOWNLOADER/dispatch-1527025593-82387453" Namespace: "default">
    2018/05/23 01:31:05.230520 [DEBUG] worker: ack for evaluation a569463e-7375-c5de-44ef-225938aeb909
    2018/05/23 01:31:05.230624 [DEBUG] worker: dequeued evaluation f317dec3-c2bf-77b1-5894-cbe6b4e94716
    2018/05/23 01:31:05.230765 [DEBUG] sched: <Eval "f317dec3-c2bf-77b1-5894-cbe6b4e94716" JobID: "DOWNLOADER/dispatch-1527025593-efbec847" Namespace: "default">: Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
Desired Changes for "jobs": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)
    2018/05/23 01:31:05.231449 [DEBUG] worker: reblocked evaluation <Eval "f317dec3-c2bf-77b1-5894-cbe6b4e94716" JobID: "DOWNLOADER/dispatch-1527025593-efbec847" Namespace: "default">
    2018/05/23 01:31:05.231508 [DEBUG] worker: ack for evaluation f317dec3-c2bf-77b1-5894-cbe6b4e94716
    2018/05/23 01:31:05.231590 [DEBUG] worker: dequeued evaluation b9cd617f-8fea-be11-717d-d3dc3a1f3f05
    2018/05/23 01:31:05.231692 [DEBUG] sched: <Eval "b9cd617f-8fea-be11-717d-d3dc3a1f3f05" JobID: "DOWNLOADER/dispatch-1527025593-7692e6d0" Namespace: "default">: Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
Desired Changes for "jobs": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)
    2018/05/23 01:31:05.232335 [DEBUG] worker: reblocked evaluation <Eval "b9cd617f-8fea-be11-717d-d3dc3a1f3f05" JobID: "DOWNLOADER/dispatch-1527025593-7692e6d0" Namespace: "default">
    2018/05/23 01:31:05.232398 [DEBUG] worker: ack for evaluation b9cd617f-8fea-be11-717d-d3dc3a1f3f05
    2018/05/23 01:31:05.232482 [DEBUG] worker: dequeued evaluation 0bd5e956-1b7f-e44f-0e2b-da791c4b5e58
    2018/05/23 01:31:05.232585 [DEBUG] sched: <Eval "0bd5e956-1b7f-e44f-0e2b-da791c4b5e58" JobID: "DOWNLOADER/dispatch-1527025593-33b8b90e" Namespace: "default">: Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
Desired Changes for "jobs": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)

And there are 1.7 million lines similar lines leading up to this.

The CPU usage of the lead server over time looks like this:

And the combined length of the enqueued tasks looks like this:

At the application layer, there is obviously lots of:

data-refinery-log-group-rich-zebra5-dev-prod log-stream-downloader-docker-rich-zebra5-dev-prod nomad.api.exceptions.BaseNomadException: None
data-refinery-log-group-rich-zebra5-dev-prod log-stream-downloader-docker-rich-zebra5-dev-prod     raise nomad.api.exceptions.BaseNomadException(response)
data-refinery-log-group-rich-zebra5-dev-prod log-stream-downloader-docker-rich-zebra5-dev-prod nomad.api.exceptions.BaseNomadException: None
data-refinery-log-group-rich-zebra5-dev-prod log-stream-downloader-docker-rich-zebra5-dev-prod     raise nomad.api.exceptions.BaseNomadException(response)
data-refinery-log-group-rich-zebra5-dev-prod log-stream-downloader-docker-rich-zebra5-dev-prod nomad.api.exceptions.BaseNomadException: None
data-refinery-log-group-rich-zebra5-dev-prod log-stream-downloader-docker-rich-zebra5-dev-prod     raise nomad.api.exceptions.BaseNomadException(response)
data-refinery-log-group-rich-zebra5-dev-prod log-stream-downloader-docker-rich-zebra5-dev-prod nomad.api.exceptions.BaseNomadException: None
data-refinery-log-group-rich-zebra5-dev-prod log-stream-downloader-docker-rich-zebra5-dev-prod     raise nomad.api.exceptions.BaseNomadException(response)
data-refinery-log-group-rich-zebra5-dev-prod log-stream-downloader-docker-rich-zebra5-dev-prod nomad.api.exceptions.BaseNomadException: None

In the network, there is one lead server, two secondary servers, and 20 clients - all of which are running on m4.large type instances.

On the lead server, calling nomad status:

$ nomad status
Error querying jobs: Get http://127.0.0.1:4646/v1/jobs: dial tcp 127.0.0.1:4646: getsockopt: connection refused

On a client:

$ nomad status
Error querying jobs: Unexpected response code: 500 (2 error(s) occurred:

* RPC failed to server 10.0.145.110:4647: rpc error: No cluster leader
* RPC failed to server 10.0.71.66:4647: rpc error: No cluster leader)

And on a secondary server:

$ nomad status

(it just hangs completely.)

Reproduction steps

Unsure. Actually use Nomad for a resource intensive job at scale with the resources provided.

This is a really major issue for us. A bug in the lead server left the Nomad system completely unable to self-recover. Even secondary server instances are left in an unresponsive, unusable state.

The text was updated successfully, but these errors were encountered:

Miserlou · 2018-05-23T14:32:32Z

The unresponsive secondary server also has a 29 Gigabyte log file which consists entirely of:

    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files

So, it looked like this is actually two bugs which have combined to cause total system death.

dadgar · 2018-05-23T21:18:23Z

@Miserlou How many times did you dispatch jobs? You say there are 1.7M lines before but each looks to be scheduling a new dispatch job: DOWNLOADER/dispatch-1527025593-82387453, DOWNLOADER/dispatch-1527025593-efbec847, ...

Are your clients DDoSing Nomad by asking it to scheduling millions of jobs?

Miserlou · 2018-05-23T21:24:59Z

You can see the number of jobs dispatched in the second graph, there are only 13400, which is what we want. The desired changes spam looks like it's happening every .001 seconds, which the application is certainly not doing. I think it all occurred during that CPU spike window.

dadgar · 2018-05-24T01:09:06Z

@Miserlou Can you share the full logs. I would like to see the behavior before the CPU spike, during and after.

What is your file descriptor limit for these process? Can you list the file descriptors that Nomad has open?

dadgar · 2018-08-28T21:36:14Z

I am going to close this issue as we never received reproduction steps and nothing of the sort has been reported since.

If this does happen again the following details would be useful:

A list of the number of nodes
The job being dispatched
A dump of the job's evaluations
The output of the /v1/metrics endpoint on the servers.

Thanks!

Miserlou · 2018-10-01T14:34:23Z

This ticket should be re-opened as this problem is not fixed. We have also had this issue occur since, and people should be aware that Nomad has this fundamental flaw.

Miserlou · 2018-10-01T14:44:26Z

And here's a graph of the queue length for that second total system crash, peaking at 180,000 jobs:

Miserlou · 2018-10-01T14:45:30Z

The second crash occurred in 0.8.3, the first in 0.7.0.

nickethier · 2018-10-01T14:51:11Z

Hey @Miserlou

Thanks for the information. We still would like the information Alex requested to help root cause this:

A list of the number of nodes

The job being dispatched

A dump of the job's evaluations

The output of the /v1/metrics endpoint on the servers.

Miserlou · 2018-10-01T14:55:45Z

3 nodes.
There are 67 different job specifications and 180,000 jobs dispatches.
The servers are dead so there are no /v1/metrics and nomad status is unresponsive. This was reported before. Logging information was previously provided.

Miserlou · 2018-10-01T15:20:38Z

You can even think of this issue as "die gracefully" if it helps, but one-too-many-jobs-and-the-whole-system-dies obviously isn't an acceptable behavior for a distributed job scheduler.

Miserlou · 2018-10-02T16:08:57Z

This ticket shouldn't be closed.

dadgar · 2018-10-02T18:48:53Z

@Miserlou Can you provide the crash output? How many jobs can the cluster run at the same time?

I believe the larger problem is you are using dispatch as a high volume queueing system which it is not designed for. I would suggest you use a work queue at the volume you are dispatching.

Miserlou · 2018-10-02T18:52:42Z

It's literally dozens of gigabytes of log output of the kind I already linked with the lines every thousandth of a millisecond. I described the number of jobs above. This was with 3xm4xlarge

dadgar · 2018-10-02T19:52:20Z

@Miserlou I am asking for the output of the agent crashing. Was it a panic, an out of memory, etc.

Miserlou · 2018-10-02T19:57:35Z

Those servers were terminated spot instances. I'm guessing it was an OOM since we solved the problem last time by throwing more RAM at it, it should be easy to reproduce yourself by starting a cluster and throwing jobs at it until it falls over like this.

dadgar · 2018-10-02T20:06:51Z

@Miserlou There really is not a lot to go off with this issue. You are reporting a server that is unresponsive and or crashing. There is no log of it crashing to know the root cause of that. The log lines you are showing is that Nomad has hit its file descriptors limit but the list of open files is not given or the limit set for Nomad.

Do you have a graph of the memory usage of Nomad over this period
What is the file descriptor limit
Are you running the agents in both client and server mode? (1 machine running both the Nomad server and Client)?

You are queuing a 180,000 jobs to 3 instances. High CPU not that surprising. Again you should likely be using a work queue.

Miserlou · 2018-10-02T20:19:19Z

The fd limit was raised to 65536. I don't have the memory graphs since those servers were rotated. The server nodes are dedicated and there are no other services on them. There are 10 clients connected.

This isn't an issue about what I should do, this is an issue about Nomad shouldn't fail so catastrophically. It's not one server that this happens to, when it happens, it happens to every single dedicated server in the system.

The reason we migrated from a work queue to using Nomad here is because we believed your own - frankly deceptive - marketing materials. However the lack of forward capacity planning, dispatch job constraints, and basic stability issues are crushing us right now. I don't know what else to tell you.

Can you tell me a bit about how you're testing/using Nomad at scale internally? I assumed that when your docs said "Enterprise" they meant big, but I don't know what kind of sizes you're actually talking about.

dadgar · 2018-10-02T20:31:48Z

@Miserlou Sorry you feel like it wasn't clear. We never state that dispatch is meant for high volume: https://www.hashicorp.com/blog/replacing-queues-with-nomad-dispatch#caveats.

The reason all servers fail is that Nomad uses Raft as a consensus protocol. Thus all data is replicated, and if you have more working set data than can fit in memory, Nomad will crash with out of memory.

Unfortunately I am going to close this issue again. If this happens to you again, please collect the various bits of information asked for in this issue and I highly suggest you rethink the usage of Nomad as a high volume work queue as it is not what it is designed for.

Miserlou · 2018-10-02T21:14:38Z

That isn't what the blog you linked says at all. It says:

Nomad Dispatch is not as suitable for high volumes of requests which take very short time to process. Nomad can schedule thousands of tasks per second, but the overhead of scheduling is very high. For most long-running processes, this time is amortized over the life of the job, but for processes that complete very quickly, the scheduling component may increase total processing time, especially in large quantities.

These aren't jobs which take a short time to process, these are jobs which take hours to run. There's just a lot of them. They aren't tiny jobs either, they have reasonable resources allocated to them. You still haven't said anything about the scale which you test at, but you use the words "enterprise", "production ready" and "scalable" all over your marketing materials without ever defining what you actually mean by that. Are those just.. mistruths? For heaven's sake, the title of the blog post is "Replacing Queues with Nomad Dispatch"!

You should leave this ticket open so somebody else who stumbles onto this limitation can add their - it is a serious problem that needs-info, but I don't know why you'd close a ticket for such a huge unaddressed flaw in your software.

nvx · 2018-10-23T06:58:02Z

Is there a reason this is closed? At the very least there should be some published guidance around the scaling limitations of this (eg, memory required per x number of concurrent jobs scheduled or hard limits that should be observed/different configurations that avoid the issue like using an external queue in front of nomad/etc).

amfern · 2019-04-19T11:28:29Z

A job with 3800 tasks caused nomad OOM halfway through, the raft.db reached ~10gb and never goes down, even after purging the job and forcing gc. It is reproducible every time. that's a cluster of 3x64gb ram servers and 10 agents, TBH I can't see how nomad can handle the claimed 1M jobs.

schmichael · 2019-04-19T18:10:07Z

@amfern I'm terribly sorry you've hit such a catastrophic failure! I know our resource utilization guidelines are sorely lacking. It's something we intend to improve but varies widely depending on use patterns.

raft.db is a BoltDB file and as such will never shrink in size on disk. The OS should properly page sections of the file in and out on use, so Nomad should not need to have the entire file resident in memory. Nomad does keep its own representation of data resident in memory at all times, so that is likely the source of your OOMs. GCs should help the memory usage, so if they're not it could be due to a bug.

Please open a new issue with as many details as possible. In the issue template we mention a [email protected] address you can send unredacted logs to if you don't want to share all of your cluster information publicly.

Please do not interpret this issue being closed as us ignoring all server-related resource utilization issues! This is absolutely not the case, and we absolutely want to ensure server perform reliably and predictably for a wide range of use cases.

kcwong-verseon · 2019-04-19T18:19:09Z

@dadgar mentioned that every Task instance contains the entire Job spec. So, it is important to keep a job spec small. If your 3800 tasks are structured as 3800 distinct TaskGroup/Task combo in one since job spec, you'll run into memory issues at the servers very quickly. We launch thousands of tasks that are distinct and has this 1 Task/TaskGroup pattern. The only way to keep Nomad servers sane is to break up a large job into smaller ones. So, we have a typical job spec containing only about 50 distinct Task/TaskGroups. That has worked and kept the memory consumption reasonable.

amfern · 2019-04-20T08:37:11Z

@schmichael I am sorry I think I wasn't using nomad correctly, I tested with a parametrized job and I could run 100k jobs on single nomad server and 8 agents, the server consumed 9gb ram and raft.db is 217m. So what is the use case for the non-parametrized batch job?

kcwong-verseon · 2019-04-20T15:52:36Z

@amfern You just need to break up a big job into smaller ones. You can see each Allocation embeds a pointer to Job struct. So, if you have 100 tasks in distinct task-groups from one job, allocated to the same node, they'll all have distinct allocations and the same job spec will be referenced a 100 times. The bigger the job, the larger the Job struct. I have a suspicion they don't all point to the same Job struct, given how the memory footprint grows.

schmichael · 2019-04-22T15:26:18Z

@amfern I'm sorry Nomad isn't meeting your expectations! If you're interested in diagnosing the memory usage of your setup please open a new issue and include as many details as possible:

Nomad version
Job file
Memory usage (from ps and/or curl -s http://localhost:4646/v1/metrics | jq '.Gauges[] | select(.Name=="nomad.runtime.alloc_bytes") | .Value')
RSS being the measure from ps that matters as Nomad uses a large amount of VSZ that does not represent physical memory in use.
Attach the file generated by http://localhost:4646/debug/pprof/heap

While I can't think of any log lines in particular that would be useful for diagnosing memory usage, logs may help us ensure everything is operating as expected. Please feel free to email logs to [email protected] if you don't wish to post them publicly on the issue.

github-actions · 2022-11-24T02:20:18Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added stage/waiting-reply theme/scheduling theme/crash labels May 23, 2018

dadgar closed this as completed Aug 28, 2018

Miserlou mentioned this issue Oct 1, 2018

Poor parameterized job scheduling performance #4736

Closed

kurtwheeler mentioned this issue Oct 1, 2018

Increases nomad_server_instance_type from m5.xlarge to m5.2xlarge. AlexsLemonade/refinebio#677

Merged

dadgar reopened this Oct 2, 2018

dadgar closed this as completed Oct 2, 2018

Miserlou mentioned this issue Oct 10, 2018

[feature] Timeout for batch jobs #1782

Open

Miserlou mentioned this issue Nov 13, 2018

Nomad Lead Server at 100% CPU Usage Leading to Total System Death #4864

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sustained CPU Spike and Death of Lead Server Leads to Total Collapse of the System #4323

Sustained CPU Spike and Death of Lead Server Leads to Total Collapse of the System #4323

Miserlou commented May 23, 2018 •

edited

Loading

Miserlou commented May 23, 2018

dadgar commented May 23, 2018

Miserlou commented May 23, 2018

dadgar commented May 24, 2018

dadgar commented Aug 28, 2018

Miserlou commented Oct 1, 2018

Miserlou commented Oct 1, 2018

Miserlou commented Oct 1, 2018

nickethier commented Oct 1, 2018

Miserlou commented Oct 1, 2018 •

edited

Loading

Miserlou commented Oct 1, 2018 •

edited

Loading

Miserlou commented Oct 2, 2018

dadgar commented Oct 2, 2018 •

edited

Loading

Miserlou commented Oct 2, 2018

dadgar commented Oct 2, 2018

Miserlou commented Oct 2, 2018

dadgar commented Oct 2, 2018 •

edited

Loading

Miserlou commented Oct 2, 2018 •

edited

Loading

dadgar commented Oct 2, 2018

Miserlou commented Oct 2, 2018 •

edited

Loading

nvx commented Oct 23, 2018

amfern commented Apr 19, 2019 •

edited

Loading

schmichael commented Apr 19, 2019

kcwong-verseon commented Apr 19, 2019

amfern commented Apr 20, 2019 •

edited

Loading

kcwong-verseon commented Apr 20, 2019

schmichael commented Apr 22, 2019

github-actions bot commented Nov 24, 2022

Sustained CPU Spike and Death of Lead Server Leads to Total Collapse of the System #4323

Sustained CPU Spike and Death of Lead Server Leads to Total Collapse of the System #4323

Comments

Miserlou commented May 23, 2018 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Miserlou commented May 23, 2018

dadgar commented May 23, 2018

Miserlou commented May 23, 2018

dadgar commented May 24, 2018

dadgar commented Aug 28, 2018

Miserlou commented Oct 1, 2018

Miserlou commented Oct 1, 2018

Miserlou commented Oct 1, 2018

nickethier commented Oct 1, 2018

Miserlou commented Oct 1, 2018 • edited Loading

Miserlou commented Oct 1, 2018 • edited Loading

Miserlou commented Oct 2, 2018

dadgar commented Oct 2, 2018 • edited Loading

Miserlou commented Oct 2, 2018

dadgar commented Oct 2, 2018

Miserlou commented Oct 2, 2018

dadgar commented Oct 2, 2018 • edited Loading

Miserlou commented Oct 2, 2018 • edited Loading

dadgar commented Oct 2, 2018

Miserlou commented Oct 2, 2018 • edited Loading

nvx commented Oct 23, 2018

amfern commented Apr 19, 2019 • edited Loading

schmichael commented Apr 19, 2019

kcwong-verseon commented Apr 19, 2019

amfern commented Apr 20, 2019 • edited Loading

kcwong-verseon commented Apr 20, 2019

schmichael commented Apr 22, 2019

github-actions bot commented Nov 24, 2022

Miserlou commented May 23, 2018 •

edited

Loading

Miserlou commented Oct 1, 2018 •

edited

Loading

Miserlou commented Oct 1, 2018 •

edited

Loading

dadgar commented Oct 2, 2018 •

edited

Loading

dadgar commented Oct 2, 2018 •

edited

Loading

Miserlou commented Oct 2, 2018 •

edited

Loading

Miserlou commented Oct 2, 2018 •

edited

Loading

amfern commented Apr 19, 2019 •

edited

Loading

amfern commented Apr 20, 2019 •

edited

Loading