High CPU usage #2169

kak-tus · 2017-01-09T17:42:42Z

Nomad version

0.5.2

Operating system and Environment details

Ububtu 16.04

Issue

High CPU usage with nomad (before 15:00 at screenshots) and low - without nomad (after 15:00 at screenshots). Near 15:00 nomad was killed, but tasks continued to execution.
May be related to #1995.

LA

CPU metrics

Other server

Reproduction steps

Runned about 20 tasks (more tasks per server - more load).

dadgar · 2017-01-09T17:55:55Z

To be clear these are the clients?

kak-tus · 2017-01-09T18:32:55Z

On the same nodes as servers-nodes (it's a small 3-nodes cluster).
I know it is not recommended in docs, but I have only 3 nodes for whole cluster.

dadgar · 2017-01-09T18:42:09Z

@kak-tus And when you say you are stopping Nomad, are you stopping both server and client? Can you show the CPU per pid?

kak-tus · 2017-01-09T18:53:15Z

@dadgar Yes, both client and server (they are running in same process).
Unfortunately, I don't have per process CPU graphs, but graphs at top: there was only only one change at 15:00 - nomad was killed and managed jobs (docker containers) were continue running.

jtuthehien · 2017-01-13T06:45:57Z

Hi
I'm seeing similar problem
Nomad executor is taking too much CPU

diptanu · 2017-01-13T07:51:00Z

@jtuthehien Which driver are you using?

dadgar · 2017-01-13T19:39:06Z

@jtuthehien What version of Nomad are you on? Can you run: go tool pprof http://localhost:4646/debug/pprof/profile it should output something like:

Fetching profile from http://localhost:4646/debug/pprof/profile
Please wait... (30s)
Saved profile in /home/vagrant/pprof/pprof.nomad.localhost:4646.samples.cpu.002.pb.gz
Entering interactive mode (type "help" for commands)

Can you attach the profile

jtuthehien · 2017-01-16T03:50:56Z

I'm on 0.4.6 . Docker driver.
I'll send the pprof when I got it

vietwow · 2017-01-20T03:19:41Z

Hi, I'm on 0.4.1 . Docker driver 1.11.2

This is my pprof debug [output](url
pprof.nomad.10.50.10.20:4646.samples.cpu.001.pb.gz
)

rejeep · 2017-01-22T09:14:14Z

I saw similar issues with the exec driver on Nomad 0.4.1. I haven't used Nomad since, but will hopefully spend some more time on it. If I have some useful debug information I get back to you.

iconara · 2017-01-23T07:23:51Z

Sorry, @rejeep got the version wrong (my fault for confusing Chef recipes), we ran 0.5.1 when we saw the issues.

danielbenzvi · 2017-02-15T15:11:01Z

seeing the same issue here with nomad 0.5.2

dadgar · 2017-02-15T18:21:39Z

@danielbenzvi Are you seeing the 100% CPU usage as well? What are your nodes running?

danielbenzvi · 2017-02-15T20:57:57Z

@dadgar It was one node out of two in a cluster we are POCíng - all the tasks were docker images. The machine is CentOS 7.2 running on AWS (kernel 3.10.0-514.2.2.el7.x86_64)
Docker driver version is 1.12.6.

During the time of the issue, the nomad client was taking 600% CPU, attempting to start and stop docker images all the time and was very slow to respond. Also over 249 tasks accumulated as "lost" and the node health was flapping (we have 34 tasks running in the cluster normally).

Nothing in the logs suggested the cause of the issue and the docker daemon responded fine.

Here are some graphs during this time:

dadgar · 2017-02-15T21:11:37Z

@danielbenzvi Hmm, did this just happen randomly or could you get it in this state reproducibly?

danielbenzvi · 2017-02-15T21:50:25Z

@dadgar randomly. we've been playing with nomad for the past two weeks so I'm guessing this will happen again if we choose to take nomad further.

kak-tus · 2017-02-15T22:05:35Z

I also have an issue with frequent container restart, but it is not that reason, that in the topic.
In topic I have high LA, but it is stable in time.

And frequent container restart happens, when nomad servers temporary looses connection to each other. They begin to restart tasks and then connection restores - tasks all started as normal.
In nomad 0.5-0.5.2 sometimes restart not succeded and I have used this workaround: some script that periodically do "nomad run" to all tasks.
In nomad 0.5.4 situation with restart is better.

jippi · 2017-02-16T06:52:07Z

@danielbenzvi what is your file descriptors limit for the nomad client ? for me a too narrow number of FD have created a great deal of issues over the last year or so, having bumped it to 65536 have made all of it go away :)

danielbenzvi · 2017-02-16T11:34:24Z

@jippi our limit is 131072 open files and we're far below it...

danielbenzvi · 2017-03-13T15:20:51Z

Seeing this again.. this is crazy. Nomad starting and stopping images all the time with no clear explanation in the logs.

dadgar · 2017-03-13T17:15:53Z

@danielbenzvi Few questions:

Are you using service checks? If so what type are they?
What is the behavior you are seeing at the jobs? Are the allocations dying and then the scheduler replaces them, they are just restarting locally, etc? Maybe show nomad status <job> and nomad alloc-status <alloc> of some of the misbehaving allocs.
Could you share logs and the time period this has happened?

I have not been able to replicate this.

OferE · 2017-04-27T11:38:26Z

look for the strace of the threads that are causing nomad to be in 100% cpu in issue #2590 in @dovka comment. thanks

OferE · 2017-04-27T13:23:45Z

This issue can be easily reproduced by creating a batch job that requires more resources than the cluster can handle at the first moment. Once some jobs are queued - all cpu of the workers will be occupied by nomad. In version 0.5.3 - the cpu will remain occupied even if the batch job is stopped. However in version 0.5.6 - after stopping the job the cpu usgae goes to 0 (i am guessing this has something to do with the garbage collection - but i cannot be sure).

this is described in #2590

Edit - we limited the nomad agent and its many child processes to one cpu using taskset. this made nomad only consume one cpu. This didn't change the scheduling which is still poor. After a while at pick nomad utilizing just 1/3 of the cluster in a good scenario, most of the time its even less. This means that nomad scheduling for batch job is really broken.

burdandrei · 2017-06-29T18:44:33Z

received high CPU and memory usage of nomad client 0.5.6 when running ~ 90 service groups on the machines.
All tasks are docker running the same image.

burdandrei · 2017-07-04T14:07:06Z

@OferE #2771 this should help you run docker with docker driver and to make a workaround with exec

OferE · 2017-07-04T14:32:00Z

@burdandrei - thanks for sharing.

I started using raw-exec for this purpose and i am so happy that i chose to use raw-exec.
U can add many adjustment to your infrastructure once u have the control of the docker launch.
I strongly recommend to stay with raw-exec.

My exec script now handles timeouts for batch jobs. Pulling containers from ECR. startup dependencies. logging into s3. destruction of containers in a clean way. reporting of nomad anomalies.

This has no price.

burdandrei · 2017-07-04T14:55:42Z

strange, I like docker driver, cause i can stream logs with syslog to ELK, the only thing that i needed is to have soft and not memory limits.

OferE · 2017-07-04T18:00:35Z

U'll need more once your things get complex. Also - soft memory is not enough there is also IO.
I'll give u an example for other thing:, u spoke about elasticsearch - how r u stopping it without losing data? graceful shut down of services is a must...

jzvelc · 2017-07-05T05:18:36Z

@OferE I use dumb-init for that which allows me to rewrite signals (e.g. SIGINT -> SIGQUIT).

burdandrei · 2017-07-05T11:16:19Z

@jzvelc you can add STOPSIGNAL to Dockerfile, and docker will honor it. Works like a charm together with kill_timeout.

dadgar · 2017-07-28T20:13:16Z

@kak-tus Wanted to bump this issue as 0.6 is now out! Let me know if this has been resolved.

burdandrei · 2017-07-28T20:47:04Z

I'll deploy 0.6 to our staging on sunday and will let you know how it's running, as discribed in #2771 we got a machine with ~ 80 tasks running, and i can see both high CPU and memory usage there, while running 0.5.6

dadgar · 2017-07-28T20:48:44Z

@burdandrei Thank you! Would you mind setting enable_debug = true on the client so that we could do some perf inspection in the case we want to investigate cpu/mem usage.

https://www.nomadproject.io/docs/agent/configuration/index.html#enable_debug

burdandrei · 2017-07-30T09:19:56Z

It's 0.6 client running 117 jobs.
Virtual footprint is half of what we had in 0.5.6, nad RSS is 1/5 -less than 1 GB

kak-tus · 2017-07-30T18:32:54Z

@dadgar Thank you. In 0.6 LA is better than in 0.5.2/0.5.4, but not as like as with nomad server/client killed and jobs running.
But may be, CPU usage is normal for nomad process.

dadgar · 2017-08-01T20:02:43Z

@burdandrei Did CPU usage come down as well?

@kak-tus I apologize but I am having a hard time following your comment. Did CPU utilization go down?

burdandrei · 2017-08-01T20:04:03Z

It looks like yes.

dadgar · 2017-08-01T20:05:09Z

Nice! I am going to close this issue now! Thank you so much for testing!

kak-tus · 2017-08-01T20:07:29Z

@dadgar Same for me - yes.

burdandrei · 2017-10-08T11:49:50Z

@dadgar I got something to report:
Looks like high memory and CPU usage in my case was generated by the job that had 114 groups.
I had nomad client v0.6.3 consuming 8-12GB RSS.
In the last few days I've been working adding Replicator to our stack, and he had problem with this particular job.
I split the job to 114 jobs, each one having 1 group, and magically nomad agent is now on Virtual ~8GB but RSS ~150 MB.

This should be stated in docs that huge job definition with lots of groups is not desirable for cluster, like consul-template is throwing warning if it's watching more than 100 keys, maybe it's worth warning on 10 or more groups.

jippi · 2017-10-08T19:03:28Z

@burdandrei did it also fix your hashi-ui memory issues?

burdandrei · 2017-10-08T19:14:45Z

@jippi, yes, when i open cluster tab with the tab of region without 100-group job, memory usage jumps from 100 to 300. When i switch to 100-group job it jums to 2GB in 20 seconds

jippi · 2017-10-08T19:19:01Z

Must be due to the Nomad structs or something in the SDK/server since 3rd parties like hashi-ui and replicator also see the same behaviour :)

github-actions · 2022-11-15T02:29:14Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added theme/client theme/performance labels Jan 9, 2017

diptanu added this to the v0.5.3 milestone Jan 13, 2017

OferE mentioned this issue Apr 27, 2017

nomad large batch jobs slowdown on clients #2590

Closed

burdandrei mentioned this issue Jul 4, 2017

Use soft docker memory limit instead of hard one #2771

Closed

dadgar closed this as completed Aug 1, 2017

tgross removed this from the near-term milestone Jan 9, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2022

High CPU usage #2169

High CPU usage #2169

Comments

kak-tus commented Jan 9, 2017 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

dadgar commented Jan 9, 2017

kak-tus commented Jan 9, 2017

dadgar commented Jan 9, 2017

kak-tus commented Jan 9, 2017

jtuthehien commented Jan 13, 2017

diptanu commented Jan 13, 2017

dadgar commented Jan 13, 2017

jtuthehien commented Jan 16, 2017

vietwow commented Jan 20, 2017

rejeep commented Jan 22, 2017

iconara commented Jan 23, 2017

danielbenzvi commented Feb 15, 2017

dadgar commented Feb 15, 2017

danielbenzvi commented Feb 15, 2017

dadgar commented Feb 15, 2017

danielbenzvi commented Feb 15, 2017

kak-tus commented Feb 15, 2017

jippi commented Feb 16, 2017

danielbenzvi commented Feb 16, 2017

danielbenzvi commented Mar 13, 2017

dadgar commented Mar 13, 2017

OferE commented Apr 27, 2017 • edited Loading

OferE commented Apr 27, 2017 • edited Loading

burdandrei commented Jun 29, 2017

burdandrei commented Jul 4, 2017

OferE commented Jul 4, 2017 • edited Loading

burdandrei commented Jul 4, 2017

OferE commented Jul 4, 2017 • edited Loading

jzvelc commented Jul 5, 2017

burdandrei commented Jul 5, 2017

dadgar commented Jul 28, 2017

burdandrei commented Jul 28, 2017

dadgar commented Jul 28, 2017

burdandrei commented Jul 30, 2017

kak-tus commented Jul 30, 2017

dadgar commented Aug 1, 2017

burdandrei commented Aug 1, 2017

dadgar commented Aug 1, 2017

kak-tus commented Aug 1, 2017

burdandrei commented Oct 8, 2017

jippi commented Oct 8, 2017

burdandrei commented Oct 8, 2017

jippi commented Oct 8, 2017

github-actions bot commented Nov 15, 2022

kak-tus commented Jan 9, 2017 •

edited

Loading

OferE commented Apr 27, 2017 •

edited

Loading

OferE commented Apr 27, 2017 •

edited

Loading

OferE commented Jul 4, 2017 •

edited

Loading

OferE commented Jul 4, 2017 •

edited

Loading