-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU usage #2169
Comments
To be clear these are the clients? |
On the same nodes as servers-nodes (it's a small 3-nodes cluster). |
@kak-tus And when you say you are stopping Nomad, are you stopping both server and client? Can you show the CPU per pid? |
@dadgar Yes, both client and server (they are running in same process). |
@jtuthehien Which driver are you using? |
@jtuthehien What version of Nomad are you on? Can you run:
Can you attach the profile |
I'm on 0.4.6 . Docker driver. |
Hi, I'm on 0.4.1 . Docker driver 1.11.2 This is my pprof debug [output](url |
I saw similar issues with the exec driver on Nomad 0.4.1. I haven't used Nomad since, but will hopefully spend some more time on it. If I have some useful debug information I get back to you. |
Sorry, @rejeep got the version wrong (my fault for confusing Chef recipes), we ran 0.5.1 when we saw the issues. |
seeing the same issue here with nomad 0.5.2 |
@danielbenzvi Are you seeing the 100% CPU usage as well? What are your nodes running? |
@dadgar It was one node out of two in a cluster we are POCíng - all the tasks were docker images. The machine is CentOS 7.2 running on AWS (kernel 3.10.0-514.2.2.el7.x86_64) During the time of the issue, the nomad client was taking 600% CPU, attempting to start and stop docker images all the time and was very slow to respond. Also over 249 tasks accumulated as "lost" and the node health was flapping (we have 34 tasks running in the cluster normally). Nothing in the logs suggested the cause of the issue and the docker daemon responded fine. |
@danielbenzvi Hmm, did this just happen randomly or could you get it in this state reproducibly? |
@dadgar randomly. we've been playing with nomad for the past two weeks so I'm guessing this will happen again if we choose to take nomad further. |
I also have an issue with frequent container restart, but it is not that reason, that in the topic. And frequent container restart happens, when nomad servers temporary looses connection to each other. They begin to restart tasks and then connection restores - tasks all started as normal. |
@danielbenzvi what is your file descriptors limit for the nomad client ? for me a too narrow number of FD have created a great deal of issues over the last year or so, having bumped it to 65536 have made all of it go away :) |
@jippi our limit is 131072 open files and we're far below it... |
Seeing this again.. this is crazy. Nomad starting and stopping images all the time with no clear explanation in the logs. |
@danielbenzvi Few questions:
I have not been able to replicate this. |
This issue can be easily reproduced by creating a batch job that requires more resources than the cluster can handle at the first moment. Once some jobs are queued - all cpu of the workers will be occupied by nomad. In version 0.5.3 - the cpu will remain occupied even if the batch job is stopped. However in version 0.5.6 - after stopping the job the cpu usgae goes to 0 (i am guessing this has something to do with the garbage collection - but i cannot be sure). this is described in #2590 Edit - we limited the nomad agent and its many child processes to one cpu using taskset. this made nomad only consume one cpu. This didn't change the scheduling which is still poor. After a while at pick nomad utilizing just 1/3 of the cluster in a good scenario, most of the time its even less. This means that nomad scheduling for batch job is really broken. |
@burdandrei - thanks for sharing. I started using raw-exec for this purpose and i am so happy that i chose to use raw-exec. My exec script now handles timeouts for batch jobs. Pulling containers from ECR. startup dependencies. logging into s3. destruction of containers in a clean way. reporting of nomad anomalies. This has no price. |
strange, I like docker driver, cause i can stream logs with syslog to ELK, the only thing that i needed is to have soft and not memory limits. |
U'll need more once your things get complex. Also - soft memory is not enough there is also IO. |
@OferE I use dumb-init for that which allows me to rewrite signals (e.g. SIGINT -> SIGQUIT). |
@jzvelc you can add STOPSIGNAL to Dockerfile, and docker will honor it. Works like a charm together with kill_timeout. |
@kak-tus Wanted to bump this issue as 0.6 is now out! Let me know if this has been resolved. |
I'll deploy 0.6 to our staging on sunday and will let you know how it's running, as discribed in #2771 we got a machine with ~ 80 tasks running, and i can see both high CPU and memory usage there, while running 0.5.6 |
@burdandrei Thank you! Would you mind setting https://www.nomadproject.io/docs/agent/configuration/index.html#enable_debug |
@dadgar Thank you. In 0.6 LA is better than in 0.5.2/0.5.4, but not as like as with nomad server/client killed and jobs running. |
@burdandrei Did CPU usage come down as well? @kak-tus I apologize but I am having a hard time following your comment. Did CPU utilization go down? |
It looks like yes. |
Nice! I am going to close this issue now! Thank you so much for testing! |
@dadgar Same for me - yes. |
@dadgar I got something to report: This should be stated in docs that huge job definition with lots of groups is not desirable for cluster, like consul-template is throwing warning if it's watching more than 100 keys, maybe it's worth warning on 10 or more groups. |
@burdandrei did it also fix your hashi-ui memory issues? |
@jippi, yes, when i open cluster tab with the tab of region without 100-group job, memory usage jumps from 100 to 300. When i switch to 100-group job it jums to 2GB in 20 seconds |
Must be due to the Nomad structs or something in the SDK/server since 3rd parties like hashi-ui and replicator also see the same behaviour :) |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
0.5.2
Operating system and Environment details
Ububtu 16.04
Issue
High CPU usage with nomad (before 15:00 at screenshots) and low - without nomad (after 15:00 at screenshots). Near 15:00 nomad was killed, but tasks continued to execution.
May be related to #1995.
LA
CPU metrics
Other server
Reproduction steps
Runned about 20 tasks (more tasks per server - more load).
The text was updated successfully, but these errors were encountered: