Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sustained CPU Spike and Death of Lead Server Leads to Total Collapse of the System #4323

Closed
Miserlou opened this issue May 23, 2018 · 28 comments

Comments

@Miserlou
Copy link

Miserlou commented May 23, 2018

Nomad version

Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e)

Operating system and Environment details

Ubuntu

Issue

I ran a job overnight. I came back to find that overnight, the lead server had a sustained and unexplained CPU spike which lead to the complete collapse of the total system.

The logs end like this:

Desired Changes for "jobs": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)
    2018/05/23 01:31:05.230446 [DEBUG] worker: reblocked evaluation <Eval "a569463e-7375-c5de-44ef-225938aeb909" JobID: "DOWNLOADER/dispatch-1527025593-82387453" Namespace: "default">
    2018/05/23 01:31:05.230520 [DEBUG] worker: ack for evaluation a569463e-7375-c5de-44ef-225938aeb909
    2018/05/23 01:31:05.230624 [DEBUG] worker: dequeued evaluation f317dec3-c2bf-77b1-5894-cbe6b4e94716
    2018/05/23 01:31:05.230765 [DEBUG] sched: <Eval "f317dec3-c2bf-77b1-5894-cbe6b4e94716" JobID: "DOWNLOADER/dispatch-1527025593-efbec847" Namespace: "default">: Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
Desired Changes for "jobs": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)
    2018/05/23 01:31:05.231449 [DEBUG] worker: reblocked evaluation <Eval "f317dec3-c2bf-77b1-5894-cbe6b4e94716" JobID: "DOWNLOADER/dispatch-1527025593-efbec847" Namespace: "default">
    2018/05/23 01:31:05.231508 [DEBUG] worker: ack for evaluation f317dec3-c2bf-77b1-5894-cbe6b4e94716
    2018/05/23 01:31:05.231590 [DEBUG] worker: dequeued evaluation b9cd617f-8fea-be11-717d-d3dc3a1f3f05
    2018/05/23 01:31:05.231692 [DEBUG] sched: <Eval "b9cd617f-8fea-be11-717d-d3dc3a1f3f05" JobID: "DOWNLOADER/dispatch-1527025593-7692e6d0" Namespace: "default">: Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
Desired Changes for "jobs": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)
    2018/05/23 01:31:05.232335 [DEBUG] worker: reblocked evaluation <Eval "b9cd617f-8fea-be11-717d-d3dc3a1f3f05" JobID: "DOWNLOADER/dispatch-1527025593-7692e6d0" Namespace: "default">
    2018/05/23 01:31:05.232398 [DEBUG] worker: ack for evaluation b9cd617f-8fea-be11-717d-d3dc3a1f3f05
    2018/05/23 01:31:05.232482 [DEBUG] worker: dequeued evaluation 0bd5e956-1b7f-e44f-0e2b-da791c4b5e58
    2018/05/23 01:31:05.232585 [DEBUG] sched: <Eval "0bd5e956-1b7f-e44f-0e2b-da791c4b5e58" JobID: "DOWNLOADER/dispatch-1527025593-33b8b90e" Namespace: "default">: Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
Desired Changes for "jobs": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)

And there are 1.7 million lines similar lines leading up to this.

The CPU usage of the lead server over time looks like this:
screen shot 2018-05-23 at 9 40 26 am

And the combined length of the enqueued tasks looks like this:

screen shot 2018-05-23 at 9 25 08 am

At the application layer, there is obviously lots of:

data-refinery-log-group-rich-zebra5-dev-prod log-stream-downloader-docker-rich-zebra5-dev-prod nomad.api.exceptions.BaseNomadException: None
data-refinery-log-group-rich-zebra5-dev-prod log-stream-downloader-docker-rich-zebra5-dev-prod     raise nomad.api.exceptions.BaseNomadException(response)
data-refinery-log-group-rich-zebra5-dev-prod log-stream-downloader-docker-rich-zebra5-dev-prod nomad.api.exceptions.BaseNomadException: None
data-refinery-log-group-rich-zebra5-dev-prod log-stream-downloader-docker-rich-zebra5-dev-prod     raise nomad.api.exceptions.BaseNomadException(response)
data-refinery-log-group-rich-zebra5-dev-prod log-stream-downloader-docker-rich-zebra5-dev-prod nomad.api.exceptions.BaseNomadException: None
data-refinery-log-group-rich-zebra5-dev-prod log-stream-downloader-docker-rich-zebra5-dev-prod     raise nomad.api.exceptions.BaseNomadException(response)
data-refinery-log-group-rich-zebra5-dev-prod log-stream-downloader-docker-rich-zebra5-dev-prod nomad.api.exceptions.BaseNomadException: None
data-refinery-log-group-rich-zebra5-dev-prod log-stream-downloader-docker-rich-zebra5-dev-prod     raise nomad.api.exceptions.BaseNomadException(response)
data-refinery-log-group-rich-zebra5-dev-prod log-stream-downloader-docker-rich-zebra5-dev-prod nomad.api.exceptions.BaseNomadException: None

In the network, there is one lead server, two secondary servers, and 20 clients - all of which are running on m4.large type instances.

On the lead server, calling nomad status:

$ nomad status
Error querying jobs: Get http://127.0.0.1:4646/v1/jobs: dial tcp 127.0.0.1:4646: getsockopt: connection refused

On a client:

$ nomad status
Error querying jobs: Unexpected response code: 500 (2 error(s) occurred:

* RPC failed to server 10.0.145.110:4647: rpc error: No cluster leader
* RPC failed to server 10.0.71.66:4647: rpc error: No cluster leader)

And on a secondary server:

$ nomad status


(it just hangs completely.)

Reproduction steps

Unsure. Actually use Nomad for a resource intensive job at scale with the resources provided.

This is a really major issue for us. A bug in the lead server left the Nomad system completely unable to self-recover. Even secondary server instances are left in an unresponsive, unusable state.

@Miserlou
Copy link
Author

The unresponsive secondary server also has a 29 Gigabyte log file which consists entirely of:

    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files
    2018/05/23 14:29:03 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files

So, it looked like this is actually two bugs which have combined to cause total system death.

@dadgar
Copy link
Contributor

dadgar commented May 23, 2018

@Miserlou How many times did you dispatch jobs? You say there are 1.7M lines before but each looks to be scheduling a new dispatch job: DOWNLOADER/dispatch-1527025593-82387453, DOWNLOADER/dispatch-1527025593-efbec847, ...

Are your clients DDoSing Nomad by asking it to scheduling millions of jobs?

@Miserlou
Copy link
Author

You can see the number of jobs dispatched in the second graph, there are only 13400, which is what we want. The desired changes spam looks like it's happening every .001 seconds, which the application is certainly not doing. I think it all occurred during that CPU spike window.

@dadgar
Copy link
Contributor

dadgar commented May 24, 2018

@Miserlou Can you share the full logs. I would like to see the behavior before the CPU spike, during and after.

What is your file descriptor limit for these process? Can you list the file descriptors that Nomad has open?

@dadgar
Copy link
Contributor

dadgar commented Aug 28, 2018

I am going to close this issue as we never received reproduction steps and nothing of the sort has been reported since.

If this does happen again the following details would be useful:

  • A list of the number of nodes
  • The job being dispatched
  • A dump of the job's evaluations
  • The output of the /v1/metrics endpoint on the servers.

Thanks!

@dadgar dadgar closed this as completed Aug 28, 2018
@Miserlou
Copy link
Author

Miserlou commented Oct 1, 2018

This ticket should be re-opened as this problem is not fixed. We have also had this issue occur since, and people should be aware that Nomad has this fundamental flaw.

image 1

@Miserlou
Copy link
Author

Miserlou commented Oct 1, 2018

And here's a graph of the queue length for that second total system crash, peaking at 180,000 jobs:

image 2

@Miserlou
Copy link
Author

Miserlou commented Oct 1, 2018

The second crash occurred in 0.8.3, the first in 0.7.0.

@nickethier
Copy link
Member

Hey @Miserlou

Thanks for the information. We still would like the information Alex requested to help root cause this:

  • A list of the number of nodes
  • The job being dispatched
  • A dump of the job's evaluations
  • The output of the /v1/metrics endpoint on the servers.

@Miserlou
Copy link
Author

Miserlou commented Oct 1, 2018

3 nodes.
There are 67 different job specifications and 180,000 jobs dispatches.
The servers are dead so there are no /v1/metrics and nomad status is unresponsive. This was reported before. Logging information was previously provided.

@Miserlou
Copy link
Author

Miserlou commented Oct 1, 2018

You can even think of this issue as "die gracefully" if it helps, but one-too-many-jobs-and-the-whole-system-dies obviously isn't an acceptable behavior for a distributed job scheduler.

@Miserlou
Copy link
Author

Miserlou commented Oct 2, 2018

This ticket shouldn't be closed.

@dadgar dadgar reopened this Oct 2, 2018
@dadgar
Copy link
Contributor

dadgar commented Oct 2, 2018

@Miserlou Can you provide the crash output? How many jobs can the cluster run at the same time?

I believe the larger problem is you are using dispatch as a high volume queueing system which it is not designed for. I would suggest you use a work queue at the volume you are dispatching.

@Miserlou
Copy link
Author

Miserlou commented Oct 2, 2018

It's literally dozens of gigabytes of log output of the kind I already linked with the lines every thousandth of a millisecond. I described the number of jobs above. This was with 3xm4xlarge

@dadgar
Copy link
Contributor

dadgar commented Oct 2, 2018

@Miserlou I am asking for the output of the agent crashing. Was it a panic, an out of memory, etc.

@Miserlou
Copy link
Author

Miserlou commented Oct 2, 2018

Those servers were terminated spot instances. I'm guessing it was an OOM since we solved the problem last time by throwing more RAM at it, it should be easy to reproduce yourself by starting a cluster and throwing jobs at it until it falls over like this.

@dadgar
Copy link
Contributor

dadgar commented Oct 2, 2018

@Miserlou There really is not a lot to go off with this issue. You are reporting a server that is unresponsive and or crashing. There is no log of it crashing to know the root cause of that. The log lines you are showing is that Nomad has hit its file descriptors limit but the list of open files is not given or the limit set for Nomad.

  1. Do you have a graph of the memory usage of Nomad over this period
  2. What is the file descriptor limit
  3. Are you running the agents in both client and server mode? (1 machine running both the Nomad server and Client)?

You are queuing a 180,000 jobs to 3 instances. High CPU not that surprising. Again you should likely be using a work queue.

@Miserlou
Copy link
Author

Miserlou commented Oct 2, 2018

The fd limit was raised to 65536. I don't have the memory graphs since those servers were rotated. The server nodes are dedicated and there are no other services on them. There are 10 clients connected.

This isn't an issue about what I should do, this is an issue about Nomad shouldn't fail so catastrophically. It's not one server that this happens to, when it happens, it happens to every single dedicated server in the system.

The reason we migrated from a work queue to using Nomad here is because we believed your own - frankly deceptive - marketing materials. However the lack of forward capacity planning, dispatch job constraints, and basic stability issues are crushing us right now. I don't know what else to tell you.

Can you tell me a bit about how you're testing/using Nomad at scale internally? I assumed that when your docs said "Enterprise" they meant big, but I don't know what kind of sizes you're actually talking about.

@dadgar
Copy link
Contributor

dadgar commented Oct 2, 2018

@Miserlou Sorry you feel like it wasn't clear. We never state that dispatch is meant for high volume: https://www.hashicorp.com/blog/replacing-queues-with-nomad-dispatch#caveats.

The reason all servers fail is that Nomad uses Raft as a consensus protocol. Thus all data is replicated, and if you have more working set data than can fit in memory, Nomad will crash with out of memory.

Unfortunately I am going to close this issue again. If this happens to you again, please collect the various bits of information asked for in this issue and I highly suggest you rethink the usage of Nomad as a high volume work queue as it is not what it is designed for.

@dadgar dadgar closed this as completed Oct 2, 2018
@Miserlou
Copy link
Author

Miserlou commented Oct 2, 2018

That isn't what the blog you linked says at all. It says:

Nomad Dispatch is not as suitable for high volumes of requests which take very short time to process. Nomad can schedule thousands of tasks per second, but the overhead of scheduling is very high. For most long-running processes, this time is amortized over the life of the job, but for processes that complete very quickly, the scheduling component may increase total processing time, especially in large quantities.

These aren't jobs which take a short time to process, these are jobs which take hours to run. There's just a lot of them. They aren't tiny jobs either, they have reasonable resources allocated to them. You still haven't said anything about the scale which you test at, but you use the words "enterprise", "production ready" and "scalable" all over your marketing materials without ever defining what you actually mean by that. Are those just.. mistruths? For heaven's sake, the title of the blog post is "Replacing Queues with Nomad Dispatch"!

You should leave this ticket open so somebody else who stumbles onto this limitation can add their - it is a serious problem that needs-info, but I don't know why you'd close a ticket for such a huge unaddressed flaw in your software.

@nvx
Copy link
Contributor

nvx commented Oct 23, 2018

Is there a reason this is closed? At the very least there should be some published guidance around the scaling limitations of this (eg, memory required per x number of concurrent jobs scheduled or hard limits that should be observed/different configurations that avoid the issue like using an external queue in front of nomad/etc).

@amfern
Copy link

amfern commented Apr 19, 2019

A job with 3800 tasks caused nomad OOM halfway through, the raft.db reached ~10gb and never goes down, even after purging the job and forcing gc. It is reproducible every time. that's a cluster of 3x64gb ram servers and 10 agents, TBH I can't see how nomad can handle the claimed 1M jobs.

@schmichael
Copy link
Member

@amfern I'm terribly sorry you've hit such a catastrophic failure! I know our resource utilization guidelines are sorely lacking. It's something we intend to improve but varies widely depending on use patterns.

raft.db is a BoltDB file and as such will never shrink in size on disk. The OS should properly page sections of the file in and out on use, so Nomad should not need to have the entire file resident in memory. Nomad does keep its own representation of data resident in memory at all times, so that is likely the source of your OOMs. GCs should help the memory usage, so if they're not it could be due to a bug.

Please open a new issue with as many details as possible. In the issue template we mention a [email protected] address you can send unredacted logs to if you don't want to share all of your cluster information publicly.

Please do not interpret this issue being closed as us ignoring all server-related resource utilization issues! This is absolutely not the case, and we absolutely want to ensure server perform reliably and predictably for a wide range of use cases.

@kcwong-verseon
Copy link
Contributor

@dadgar mentioned that every Task instance contains the entire Job spec. So, it is important to keep a job spec small. If your 3800 tasks are structured as 3800 distinct TaskGroup/Task combo in one since job spec, you'll run into memory issues at the servers very quickly. We launch thousands of tasks that are distinct and has this 1 Task/TaskGroup pattern. The only way to keep Nomad servers sane is to break up a large job into smaller ones. So, we have a typical job spec containing only about 50 distinct Task/TaskGroups. That has worked and kept the memory consumption reasonable.

@amfern
Copy link

amfern commented Apr 20, 2019

@schmichael I am sorry I think I wasn't using nomad correctly, I tested with a parametrized job and I could run 100k jobs on single nomad server and 8 agents, the server consumed 9gb ram and raft.db is 217m. So what is the use case for the non-parametrized batch job?

@kcwong-verseon
Copy link
Contributor

@amfern You just need to break up a big job into smaller ones. You can see each Allocation embeds a pointer to Job struct. So, if you have 100 tasks in distinct task-groups from one job, allocated to the same node, they'll all have distinct allocations and the same job spec will be referenced a 100 times. The bigger the job, the larger the Job struct. I have a suspicion they don't all point to the same Job struct, given how the memory footprint grows.

@schmichael
Copy link
Member

@amfern I'm sorry Nomad isn't meeting your expectations! If you're interested in diagnosing the memory usage of your setup please open a new issue and include as many details as possible:

  • Nomad version
  • Job file
  • Memory usage (from ps and/or curl -s http://localhost:4646/v1/metrics | jq '.Gauges[] | select(.Name=="nomad.runtime.alloc_bytes") | .Value')
  • RSS being the measure from ps that matters as Nomad uses a large amount of VSZ that does not represent physical memory in use.
  • Attach the file generated by http://localhost:4646/debug/pprof/heap

While I can't think of any log lines in particular that would be useful for diagnosing memory usage, logs may help us ensure everything is operating as expected. Please feel free to email logs to [email protected] if you don't wish to post them publicly on the issue.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants