-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use cpu_total_compute configuration for CPU usage too #17628
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @TrueBrain I see where you're going with this and how you've threaded it down into the executor. Architecturally this seems like we're having to expose fingerprinted data to the plugin, when normally I'd expect that data to flow the other direction.
I'm going to tag-in @shoenig as I know he's looked at this area of the code a bunch recently to do #16672 and I want to get his thoughts here.
The total cpu compute available is used by the scheduler though (minus reserved), and is a shared resource among all task drivers. I think @TrueBrain is on the right track, but I'd go even further and attempt to purge any reference to the @TrueBrain let us know if you want to keep working on this, you're definitely poking at some old and surprisingly twisty code with this one 🙂 |
After fiddling with this for a while, I noticed that the Because of this, the I also decided to name the function Curious what you think of this work! PS: I can't seem to get |
Thanks @TrueBrain. We're in the middle of shipping the release candidate for 1.6.0 but will circle back to re-review once that work is done. |
This looks good @TrueBrain! The refactoring definitely makes sense. I think we can focus on this as-is and backport it (going into 1.6.1, 1.5.x, 1.4.x). Just FYI in Nomad 1.7 we're planning some significant work around NUMA node detection, where we might go back to plumbing the cpu_total_compute through the RPC layer (along with the rest of the cpu topology). The trick here of passing it through directly to the I tested these changes on ec2 and it seems Task utilization is still broken,
and cpu utilization for the task is still 0, though the system CPU stats are calculated correctly.
Where the task indicates Likewise the node status utilization metric is zero
(for comparison, on my amd64 machine)
|
Ah, yes, I only tested when you set the configuration Edit: meh; bit of a pita to fix properly .. AWS fingerprint is run after CPU fingerprint, and can change the Edit 2: fixed and tested on a t4g on AWS. |
Before this commit, it was only used for fingerprinting, but not for CPU stats on nodes or tasks. This meant that if the auto-detection failed, setting the cpu_total_compute didn't resolved the issue. This issue was most noticeable on ARM64, as there auto-detection always failed.
I think this PR is looking good now, thanks @TrueBrain!
Indeed! That refactor (in combination with refactoring cgroups) is a prerequisite for what I'm working on for 1.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 @shoenig the original ticket is classified as enhancement
but this really feels more like a bugfix and as you've noted we'll need this for upcoming work. Should we just go ahead and backport this to 1.5.x and 1.4.x too?
(manual backport of e190eae) Before this commit, it was only used for fingerprinting, but not for CPU stats on nodes or tasks. This meant that if the auto-detection failed, setting the cpu_total_compute didn't resolved the issue. This issue was most noticeable on ARM64, as there auto-detection always failed.
… (#17992) (manual backport of e190eae) Before this commit, it was only used for fingerprinting, but not for CPU stats on nodes or tasks. This meant that if the auto-detection failed, setting the cpu_total_compute didn't resolved the issue. This issue was most noticeable on ARM64, as there auto-detection always failed. Co-authored-by: Patric Stout <[email protected]>
Fixes #17577.
Currently,
cpu_total_compute
is only used for fingerprinting / capacity, and doesn't influenceTotalTicksAvailable
. ButTotalTicksAvailable
is used to calculate CPU usage ticks. If for example on ARM64 no CPU MHz can be detected,TotalTicksAvailable
remains zero. And in result all CPU usage remains zero.This PR sets out to address that problem by having
cpu_total_compute
overrideTotalTicksAvailable
too.A good way to test if this works, is by setting
cpu_total_compute
to 100. This means that all CPU metrics in the UI etc should show the percentage your CPU is loaded.So next it is a matter of loading a job that uses 100% of a single core, and one should see 1 / CPU-cores of load. The easiest way I could find was a task that does
dd if=/dev/zero of=/dev/null
. That burns a single core for as long as you keep it running.