-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nomad large batch jobs slowdown on clients #2590
Comments
few discoveries i made: i launched a smaller configuration with 20 m3.medium machine - running a dummy commad with sleep 30.
This is a serious issue - and its a blocker for us. |
Part of the bugs reported above are being caused by #2169 |
strace on one of the CPU-spinning threads, seeing some of IO-waits: |
another experiment: |
another update: |
@OferE There is a lot here but there is one critical issue. Your job size of 85 mbs. That just will not work. That is going through Raft and is part of the payload the schedulers send around which will cause the eval failures you are seeing: Can you instead have one task group and set the count to 50000? |
First of all, thanks for the reply! I highly appreciate it (as i built a lot of infra around nomad). I did the experiment with 5000 groups - about 8.5 MB. still with the same results. |
@OferE I was suggesting something like this:
I am going to try to reproduce high cpu usage and will report back any findings. |
@OferE Can you set
|
Also you shouldn't expect a totally flat rate of allocations on a client especially when the tasks are finishing roughly at the same time because the client has to mark the tasks as finished and then the scheduler has to detect that and then place more work. So it should dip and then go up again, etc. |
Thanks for the advice - i will try your suggestion on Sunday and will report the results. Few comments though:
|
@OferE yeah I have reproduced as well. I have some fixes and will try to get you a build with them applied on top of 0.5.6! I have some more testing and work to do on it though. Hopefully I can get it you by EOD Monday |
thank u so much for this! will do.
…On Sat, Apr 29, 2017 at 6:42 AM, Alex Dadgar ***@***.***> wrote:
@OferE <https://github.com/OferE> yeah I have reproduced as well. I have
some fixes and will try to get you a build with them applied on top of
0.5.6! I have some more testing and work to do on it though. Hopefully I
can get it you by EOD Monday
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2590 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB-zn0NaZAVxe2zdWN9sSahNgGWPddCSks5r0rGsgaJpZM4NI0mC>
.
--
Regards,
Ofer Eliassaf
|
I used your latest executable (thanks again). I launched 3 m4.xlarge machine wach with 4*2300MH and 16GB RAM. here is how my task looks like:
CPU usage is still very high on the agent.
I finally got the prof right (I hope). |
@OferE So here is the first PR that addresses some of the performance issues. This will be a little harder to rebase off of 0.5.6 but if you want I can build a binary from that branch for you to test: #2610. There are other optimizations I have made note of todo in 0.6.0 but will not be able to get to in the immediate time-frame. |
Actually just built that branch. Here it is if you want to test: |
Thank u so much for this! I did 2 experiments:
i think that the memory leak is also relevant to the first experiment but there it is not reaching to higher numbers - though it is not getting freed. I will test it on the original 50K real jobs and let u know... Also, regarding this non official executable u sent me:
thanks again. |
hi, It started fine but started to degregate and when it reached to 12K tasks - the results were poor again. Nomad status command showed 360 tasks running (which was what i expected exactly) even though only ~100 tasks actually ran (i am checking this with docker swarm, and also using nomad status - counting the completed and compare to what was a minute ago). Also - the memory leak i described earlier started to be significant (6% of 16GB after 13K tasks completed) . These issues needs to be solved in order for Nomad batch be production ready... |
@OferE You can run it just on the workers. As for the memory usage, that is part 2 of the optimization I would like to do. So that should go away. I believe you are running raw_exec launching docker containers. I would not be surprised if the issue in your 50k tasks test was due to docker engine misbehaving. Have you tried that test with
Also why aren't you using docker driver? |
I dont believe its docker. Im almost sure that it's nomad, though it might be a bug in my wrapping script (very unlikely too - since in the first 4000- tasks it works perfectly all the time - and its small...). The reason is that whenever i start a new big test on the same cluster it behaves correctly in the first 4000 tasks. It becomes poor a around 13000 tasks. If it was docker - one run after the other would show poor results from the beginning of the second run. I think that the problem is that some of the tasks finishes and nomad is not getting the exit signal of the script that wraps the containers on time for some reason. So when u check for status nomad shows u a good running count, but there are less and less real processes in the cluster. The check that u did to query running allocations is not good enough - u need to count how many r actually finishes - u will see it clear. Docker driver is not good since it forces cgroups, in which i will be forced to limit memory and cpu for my batches. This is not good for us since our tasks have unpredictable footptint usage, i prefer that the os will handle resource utulization for them (they have peaks etc.). Cgroups usage is good for data centeters utilization and not for cloud usages with static allocation (where u allocate machines to a specific usage such as batch/service) |
@OferE That isn't quite how the cgroups works when using cpu shares. Under cpu contention the limits are enforced but otherwise the processes will be allowed to use as much cpu as is available with the shares applying as a weight. When we get the memory use sorted out, lets run the benchmark again and grab some profiling data early in the run and then later when you see the slowdown so we can see what is happening! |
For CPU u r right - but for memory cgroups enforce hard limit which is not good enough. Like I said - on high load the problem is most probably signal handling management of nomad... (as the scheduler always show good number of running task, but in reality the numbers are poor). I'll be happy to help with any profiling u ask, and again i am graceful for your engagement in this issue! |
Hmm, yeah I don't think it is that as the code is quite simple there. Literally waiting on a Golang Process https://github.com/hashicorp/nomad/blob/master/client/driver/executor/executor.go#L402. If it is easy enough do you want to grab the profiling data without the memory fixes since it looks like in the 50K test you weren't effected by memory too much? |
i'll do it - though reaching 50K is very expensive for us because of the performance issue. I'll also try to dig more - maybe i'm wrong ;-) (I really don't know nomad code) |
interesting solution, to memory limiting problem @OferE, we 're running ~ 500 workers on shared machines, and were forced to add memory usage monitoring before starting to move them to nomad, and unfortunatelly, after running first 100, we received hi CPU and memory usage on host. BTW, our nomad file is ~ 700KB, it's one job with 100 groups and some of them got counts inside. |
I would like to revisit this once 0.6.0 is out. A good amount of performance enhancements came out of debugging from what @OferE provided. Splitting the job would help but 0.6.0 should mitigate the need to do so! |
@OferE Bumping this since 0.6.0 is now out! Are you all seeing improved performance? |
i haven't had a chance to upgrade yet - will keep u posted once i get back to this project. (might take a while). The add hoc version u gave back then really helped. thank you so much for this. |
@OferE Awesome! Glad it helped you! I am going to close this because in the related high cpu bug, people are reporting 0.6.0 has fixed the issue. When you upgrade we can re-open if it is still an issue! Would appreciate if you could post your findings either way! Thanks 👍 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
v0.5.3
(we later reproduced the issue with version 0.5.6)
Operating system and Environment details
Ubuntu 14.04 on AWS
Issue
We launched the following configuration:
3 nomad masters on m4.4xlarge machines.
97 nomad slaves on m4.4xlarge machines
M4.4xlarge machines has 16 cores per host each with 2300 MHZ.
Any idea how to make such thing work?
Reproduction steps
We created a nomad batch job with 50K tasks to run.
Each task in its own group since we don't want to bundle many tasks to the same machine since we don't know how long it will take.
we use raw_exec to launch our jobs - and we gave the scheduler limit of 8*2300 CPOU units.
This means that we would expect to schedule 2 groups/tasks on each node simultanously.
In fact we schedlued 50K groups,
The critical problems we saw are:
1.
We didn't see 97*2 tasks running simultanously. What we saw instead is a lot of tasks (say 171) and then the number dropped to zero and only then we saw then next tasks (say 120).
This is not as expected - we expected to see a flat 97*2 tasks running all the time...
The batch ended up after ~500 groups only with status dead. where all the rest of the groups/tasks were in pending state.
When running the job again (nomad run on the same job file)- it got into a pending state.
with the following placement failures:
For some reason it thinks that the cpu is exahsted on 97 nodes even that all 97 nodes runs nothing at all.
Nomad Server logs (if appropriate)
(after i tried to run again the log looks like that)
Nomad Client logs (if appropriate)
Job file (if appropriate)
its a 85MB file.
This is how our configuartion looks like:
The text was updated successfully, but these errors were encountered: