Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage #2169

Closed
kak-tus opened this issue Jan 9, 2017 · 44 comments
Closed

High CPU usage #2169

kak-tus opened this issue Jan 9, 2017 · 44 comments

Comments

@kak-tus
Copy link

kak-tus commented Jan 9, 2017

Nomad version

0.5.2

Operating system and Environment details

Ububtu 16.04

Issue

High CPU usage with nomad (before 15:00 at screenshots) and low - without nomad (after 15:00 at screenshots). Near 15:00 nomad was killed, but tasks continued to execution.
May be related to #1995.

LA
image

CPU metrics
image

Other server

image

image

Reproduction steps

Runned about 20 tasks (more tasks per server - more load).

@dadgar
Copy link
Contributor

dadgar commented Jan 9, 2017

To be clear these are the clients?

@kak-tus
Copy link
Author

kak-tus commented Jan 9, 2017

On the same nodes as servers-nodes (it's a small 3-nodes cluster).
I know it is not recommended in docs, but I have only 3 nodes for whole cluster.

@dadgar
Copy link
Contributor

dadgar commented Jan 9, 2017

@kak-tus And when you say you are stopping Nomad, are you stopping both server and client? Can you show the CPU per pid?

@kak-tus
Copy link
Author

kak-tus commented Jan 9, 2017

@dadgar Yes, both client and server (they are running in same process).
Unfortunately, I don't have per process CPU graphs, but graphs at top: there was only only one change at 15:00 - nomad was killed and managed jobs (docker containers) were continue running.

@jtuthehien
Copy link

Hi
I'm seeing similar problem
Nomad executor is taking too much CPU
screen shot 2017-01-13 at 1 44 52 pm

screen shot 2017-01-13 at 1 42 52 pm

@diptanu
Copy link
Contributor

diptanu commented Jan 13, 2017

@jtuthehien Which driver are you using?

@diptanu diptanu added this to the v0.5.3 milestone Jan 13, 2017
@dadgar
Copy link
Contributor

dadgar commented Jan 13, 2017

@jtuthehien What version of Nomad are you on? Can you run: go tool pprof http://localhost:4646/debug/pprof/profile it should output something like:

Fetching profile from http://localhost:4646/debug/pprof/profile
Please wait... (30s)
Saved profile in /home/vagrant/pprof/pprof.nomad.localhost:4646.samples.cpu.002.pb.gz
Entering interactive mode (type "help" for commands)

Can you attach the profile

@jtuthehien
Copy link

I'm on 0.4.6 . Docker driver.
I'll send the pprof when I got it

@vietwow
Copy link

vietwow commented Jan 20, 2017

Hi, I'm on 0.4.1 . Docker driver 1.11.2

This is my pprof debug [output](url
pprof.nomad.10.50.10.20:4646.samples.cpu.001.pb.gz
)

@rejeep
Copy link

rejeep commented Jan 22, 2017

I saw similar issues with the exec driver on Nomad 0.4.1. I haven't used Nomad since, but will hopefully spend some more time on it. If I have some useful debug information I get back to you.

@iconara
Copy link
Contributor

iconara commented Jan 23, 2017

Sorry, @rejeep got the version wrong (my fault for confusing Chef recipes), we ran 0.5.1 when we saw the issues.

@danielbenzvi
Copy link

seeing the same issue here with nomad 0.5.2

@dadgar
Copy link
Contributor

dadgar commented Feb 15, 2017

@danielbenzvi Are you seeing the 100% CPU usage as well? What are your nodes running?

@danielbenzvi
Copy link

@dadgar It was one node out of two in a cluster we are POCíng - all the tasks were docker images. The machine is CentOS 7.2 running on AWS (kernel 3.10.0-514.2.2.el7.x86_64)
Docker driver version is 1.12.6.

During the time of the issue, the nomad client was taking 600% CPU, attempting to start and stop docker images all the time and was very slow to respond. Also over 249 tasks accumulated as "lost" and the node health was flapping (we have 34 tasks running in the cluster normally).

Nothing in the logs suggested the cause of the issue and the docker daemon responded fine.

Here are some graphs during this time:
Load averages

CPU usage

@dadgar
Copy link
Contributor

dadgar commented Feb 15, 2017

@danielbenzvi Hmm, did this just happen randomly or could you get it in this state reproducibly?

@danielbenzvi
Copy link

@dadgar randomly. we've been playing with nomad for the past two weeks so I'm guessing this will happen again if we choose to take nomad further.

@kak-tus
Copy link
Author

kak-tus commented Feb 15, 2017

I also have an issue with frequent container restart, but it is not that reason, that in the topic.
In topic I have high LA, but it is stable in time.

And frequent container restart happens, when nomad servers temporary looses connection to each other. They begin to restart tasks and then connection restores - tasks all started as normal.
In nomad 0.5-0.5.2 sometimes restart not succeded and I have used this workaround: some script that periodically do "nomad run" to all tasks.
In nomad 0.5.4 situation with restart is better.

@jippi
Copy link
Contributor

jippi commented Feb 16, 2017

@danielbenzvi what is your file descriptors limit for the nomad client ? for me a too narrow number of FD have created a great deal of issues over the last year or so, having bumped it to 65536 have made all of it go away :)

@danielbenzvi
Copy link

@jippi our limit is 131072 open files and we're far below it...

@danielbenzvi
Copy link

Seeing this again.. this is crazy. Nomad starting and stopping images all the time with no clear explanation in the logs.

@dadgar
Copy link
Contributor

dadgar commented Mar 13, 2017

@danielbenzvi Few questions:

  1. Are you using service checks? If so what type are they?
  2. What is the behavior you are seeing at the jobs? Are the allocations dying and then the scheduler replaces them, they are just restarting locally, etc? Maybe show nomad status <job> and nomad alloc-status <alloc> of some of the misbehaving allocs.
  3. Could you share logs and the time period this has happened?

I have not been able to replicate this.

@OferE
Copy link

OferE commented Apr 27, 2017

look for the strace of the threads that are causing nomad to be in 100% cpu in issue #2590 in @dovka comment. thanks

@OferE
Copy link

OferE commented Apr 27, 2017

This issue can be easily reproduced by creating a batch job that requires more resources than the cluster can handle at the first moment. Once some jobs are queued - all cpu of the workers will be occupied by nomad. In version 0.5.3 - the cpu will remain occupied even if the batch job is stopped. However in version 0.5.6 - after stopping the job the cpu usgae goes to 0 (i am guessing this has something to do with the garbage collection - but i cannot be sure).

this is described in #2590

Edit - we limited the nomad agent and its many child processes to one cpu using taskset. this made nomad only consume one cpu. This didn't change the scheduling which is still poor. After a while at pick nomad utilizing just 1/3 of the cluster in a good scenario, most of the time its even less. This means that nomad scheduling for batch job is really broken.

@burdandrei
Copy link
Contributor

image
received high CPU and memory usage of nomad client 0.5.6 when running ~ 90 service groups on the machines.
All tasks are docker running the same image.

@burdandrei
Copy link
Contributor

@OferE #2771 this should help you run docker with docker driver and to make a workaround with exec

@OferE
Copy link

OferE commented Jul 4, 2017

@burdandrei - thanks for sharing.

I started using raw-exec for this purpose and i am so happy that i chose to use raw-exec.
U can add many adjustment to your infrastructure once u have the control of the docker launch.
I strongly recommend to stay with raw-exec.

My exec script now handles timeouts for batch jobs. Pulling containers from ECR. startup dependencies. logging into s3. destruction of containers in a clean way. reporting of nomad anomalies.

This has no price.

@burdandrei
Copy link
Contributor

strange, I like docker driver, cause i can stream logs with syslog to ELK, the only thing that i needed is to have soft and not memory limits.

@OferE
Copy link

OferE commented Jul 4, 2017

U'll need more once your things get complex. Also - soft memory is not enough there is also IO.
I'll give u an example for other thing:, u spoke about elasticsearch - how r u stopping it without losing data? graceful shut down of services is a must...

@jzvelc
Copy link

jzvelc commented Jul 5, 2017

@OferE I use dumb-init for that which allows me to rewrite signals (e.g. SIGINT -> SIGQUIT).

@burdandrei
Copy link
Contributor

@jzvelc you can add STOPSIGNAL to Dockerfile, and docker will honor it. Works like a charm together with kill_timeout.

@dadgar
Copy link
Contributor

dadgar commented Jul 28, 2017

@kak-tus Wanted to bump this issue as 0.6 is now out! Let me know if this has been resolved.

@burdandrei
Copy link
Contributor

I'll deploy 0.6 to our staging on sunday and will let you know how it's running, as discribed in #2771 we got a machine with ~ 80 tasks running, and i can see both high CPU and memory usage there, while running 0.5.6

@dadgar
Copy link
Contributor

dadgar commented Jul 28, 2017

@burdandrei Thank you! Would you mind setting enable_debug = true on the client so that we could do some perf inspection in the case we want to investigate cpu/mem usage.

https://www.nomadproject.io/docs/agent/configuration/index.html#enable_debug

@burdandrei
Copy link
Contributor

image
It's 0.6 client running 117 jobs.
Virtual footprint is half of what we had in 0.5.6, nad RSS is 1/5 -less than 1 GB

@kak-tus
Copy link
Author

kak-tus commented Jul 30, 2017

@dadgar Thank you. In 0.6 LA is better than in 0.5.2/0.5.4, but not as like as with nomad server/client killed and jobs running.
But may be, CPU usage is normal for nomad process.

@dadgar
Copy link
Contributor

dadgar commented Aug 1, 2017

@burdandrei Did CPU usage come down as well?

@kak-tus I apologize but I am having a hard time following your comment. Did CPU utilization go down?

@burdandrei
Copy link
Contributor

It looks like yes.

@dadgar
Copy link
Contributor

dadgar commented Aug 1, 2017

Nice! I am going to close this issue now! Thank you so much for testing!

@dadgar dadgar closed this as completed Aug 1, 2017
@kak-tus
Copy link
Author

kak-tus commented Aug 1, 2017

@dadgar Same for me - yes.

@burdandrei
Copy link
Contributor

@dadgar I got something to report:
Looks like high memory and CPU usage in my case was generated by the job that had 114 groups.
I had nomad client v0.6.3 consuming 8-12GB RSS.
In the last few days I've been working adding Replicator to our stack, and he had problem with this particular job.
I split the job to 114 jobs, each one having 1 group, and magically nomad agent is now on Virtual ~8GB but RSS ~150 MB.

This should be stated in docs that huge job definition with lots of groups is not desirable for cluster, like consul-template is throwing warning if it's watching more than 100 keys, maybe it's worth warning on 10 or more groups.

@jippi
Copy link
Contributor

jippi commented Oct 8, 2017

@burdandrei did it also fix your hashi-ui memory issues?

@burdandrei
Copy link
Contributor

@jippi, yes, when i open cluster tab with the tab of region without 100-group job, memory usage jumps from 100 to 300. When i switch to 100-group job it jums to 2GB in 20 seconds

@jippi
Copy link
Contributor

jippi commented Oct 8, 2017

Must be due to the Nomad structs or something in the SDK/server since 3rd parties like hashi-ui and replicator also see the same behaviour :)

@tgross tgross removed this from the near-term milestone Jan 9, 2020
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests