-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce CPU utilization in executors when scanning for PIDs. #5832
Comments
@picoDoc Thanks for the detailed report and repro steps. I tried replicating what you saw - I do see CPU utilization with This is very much an internal implementation detail of the executor and not currently designed to be overridden, I will discuss internally and report back with our concrete plans. |
Discussed this internally and rather than making pidScaninterval configurable we think there are other optimization approaches in the code that's scanning PIDs, it can be done much more effeciently than our current approach. We will target this in a future release. |
pidScanInterval
be made configurable?
Ok cool, thanks. Any idea on a timescale for this improvement? This currently has quite a large impact for us. |
Since the changes in #5951 the test described above seems to fail to launch any jobs. All jobs fail with a timeline something like:
Could this be because this change assumes cgroups are being used? In our case I think because nomad is not launched as root they are not used. Can we re-open this ticket? @langmartin @preetapan |
@picoDoc thanks very much for the followup, I left a testing gap around non-root raw_exec use. In order to take advantage of the fix implemented in #5951, you'll need to run nomad with cgroup creation privileges. This should be possible outside of nomad if you use a root script to create a cgroup that allows the nomad user to create cgroups. Without that permission, #5991 will allow nomad to start, but won't improve your CPU usage. |
Awesome thank you! For reference to give nomad the appropriate cgroup permissions I had to run:
After this I re-ran the tests above using
So thanks, appreciate the help! Would it be worth noting in the documentation that you need to setup cgroups appropriately to take advantage of this optimization? |
Great! glad to hear it. This issue should be documented, I've just added some. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.9.2 (028326684b9da489e0371247a223ef3ae4755d87)
Operating system and Environment details
Issue
I've found when running large numbers of nomad jobs per host (>100) the CPU overhead of the nomad executor processes becomes a major problem. This seems to be due to each nomad executor frequently scanning the process tree (via
collectPids
). Reducing this frequency seems to greatly improve the situation for us (see reproduction below), but it's currently hard-coded to 5 seconds. Could this be made configurable? As far I can tell the collection of pids is only required for telemetry collection? Any help here would be greatly appreciated!Reproduction steps
Running the nomad agent with the config file below, and starting up 10 instances of our test job (just a simple
sleep
command) usingnomad job run test.nomad
, the executors settle down to using about 0.3% CPU each:On the other hand if I repeat this with the
count
in the job config below increased to 500, each executor now uses something closer to 2% CPU:This is an issue for us as in this case the executors start to use the majority of CPU cycles on our hosts. I assume this is because each executor spawns a large number of go threads and so increases the over pid count on the system, which each executor is scanning every 5 seconds. I tried changing the
pidScanInterval
to 120s and recompiling, and this brought the CPU usage per executor down to less than when there were only 10 processes, so if were able to tweak this it would solve our issue.Nomad config
Job file
The text was updated successfully, but these errors were encountered: