Use cpu_total_compute configuration for CPU usage too #17628

TrueBrain · 2023-06-20T20:13:10Z

Currently, cpu_total_compute is only used for fingerprinting / capacity, and doesn't influence TotalTicksAvailable. But TotalTicksAvailable is used to calculate CPU usage ticks. If for example on ARM64 no CPU MHz can be detected, TotalTicksAvailable remains zero. And in result all CPU usage remains zero.

This PR sets out to address that problem by having cpu_total_compute override TotalTicksAvailable too.

A good way to test if this works, is by setting cpu_total_compute to 100. This means that all CPU metrics in the UI etc should show the percentage your CPU is loaded.
So next it is a matter of loading a job that uses 100% of a single core, and one should see 1 / CPU-cores of load. The easiest way I could find was a task that does dd if=/dev/zero of=/dev/null. That burns a single core for as long as you keep it running.

drivers/shared/executor/executor.go

tgross

Hi @TrueBrain I see where you're going with this and how you've threaded it down into the executor. Architecturally this seems like we're having to expose fingerprinted data to the plugin, when normally I'd expect that data to flow the other direction.

I'm going to tag-in @shoenig as I know he's looked at this area of the code a bunch recently to do #16672 and I want to get his thoughts here.

drivers/shared/executor/utils.go

client/fingerprint/cpu.go

shoenig · 2023-06-22T16:40:23Z

Architecturally this seems like we're having to expose fingerprinted data to the plugin, when normally I'd expect that data to flow the other direction.

The total cpu compute available is used by the scheduler though (minus reserved), and is a shared resource among all task drivers. I think @TrueBrain is on the right track, but I'd go even further and attempt to purge any reference to the client/stats package from driver packages. The new NomadDriverConfig.CpuCompute RPC field would then represent the final value Nomad client will use, whether the value originated from fingerprinting or from client config. There's an interesting side affect here where TaskResourceUsage utilization metrics would become dependent on cpu_total_compute being set correctly (if set), but I think that's probably fine.

@TrueBrain let us know if you want to keep working on this, you're definitely poking at some old and surprisingly twisty code with this one 🙂

TrueBrain · 2023-07-11T12:45:43Z

After fiddling with this for a while, I noticed that the stats/cpu.go in the client folder contains things used by both client and driver. And helper/cpu.go contains things only used by client. So in the end, I decided to mostly flip those two files around, with some exceptions.

Because of this, the driver modules now no longer (should) need client/stats, and all use helper/stats.

I also decided to name the function cpuTotalTicks; there were several ways this was named, and it was a bit confusing (what does totalAvailable means?). The only thing I did not change, is how the configuration entry is named. That is still cpu_total_compute.

Curious what you think of this work!

PS: I can't seem to get make test to work locally, so I am just hoping nothing is broken. If there is, I will address them as I spot them :)

tgross · 2023-07-11T19:13:50Z

Thanks @TrueBrain. We're in the middle of shipping the release candidate for 1.6.0 but will circle back to re-review once that work is done.

shoenig · 2023-07-17T14:46:45Z

This looks good @TrueBrain! The refactoring definitely makes sense. I think we can focus on this as-is and backport it (going into 1.6.1, 1.5.x, 1.4.x). Just FYI in Nomad 1.7 we're planning some significant work around NUMA node detection, where we might go back to plumbing the cpu_total_compute through the RPC layer (along with the rest of the cpu topology). The trick here of passing it through directly to the NewExecutor constructor doesn't help external plugins.

I tested these changes on ec2 and it seems Task utilization is still broken,

ubuntu@ip-172-31-80-227:~$ ./nomad.tb node status -self -verbose | grep -E '(cpu\.)|(instance-type)'
cpu.arch                                 = arm64
cpu.frequency                            = 2500
cpu.modelname                            = Neoverse-N1
cpu.numcores                             = 8
cpu.reservablecores                      = 8
cpu.totalcompute                         = 20000
platform.aws.instance-type               = t4g.2xlarge

job "example" {
  group "group" {
    task "task" {
      driver = "exec"
      config {
        command = "stress"
        args    = ["-c", "2"] # burn 2 cores
      }
      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

and cpu utilization for the task is still 0, though the system CPU stats are calculated correctly.

ubuntu@ip-172-31-80-227:~$ ./nomad.tb alloc status -stats a8

Task "task" is "running"
Task Resources:
CPU        Memory           Disk     Addresses
0/500 MHz  480 KiB/256 MiB  300 MiB

Memory Stats
Cache  Swap     Usage
0 B    480 KiB  480 KiB

CPU Stats
Percent  System Mode  Throttled Periods  Throttled Time  User Mode
199.91%  0.00%        0                  0               199.91%

Where the task indicates 0/500 MHz I'd expect to see 5000/500 MHz (2 cores * 2500 MHz each).

Likewise the node status utilization metric is zero

ubuntu@ip-172-31-80-227:~$ ./nomad.tb node status -self -stats

Allocated Resources
CPU            Memory          Disk
500/20000 MHz  256 MiB/31 GiB  300 MiB/29 GiB

Allocation Resource Utilization
CPU          Memory
0/20000 MHz  480 KiB/31 GiB

Host Resource Utilization
CPU          Memory          Disk
0/20000 MHz  353 MiB/31 GiB  (/dev/root)

(for comparison, on my amd64 machine)

➜ nomad alloc status -stats 39

Task "task" is "running"
Task Resources:
CPU           Memory           Disk     Addresses
6000/500 MHz  592 KiB/256 MiB  300 MiB

Memory Stats
Cache  Swap     Usage
0 B    592 KiB  592 KiB

CPU Stats
Percent  System Mode  Throttled Periods  Throttled Time  User Mode
200.01%  0.00%        0                  0               200.01%

➜ nomad node status -self -stats

Node Events
Time                  Subsystem  Message
2023-07-17T14:31:16Z  Cluster    Node registered

Allocated Resources
CPU            Memory           Disk
500/96000 MHz  256 MiB/126 GiB  300 MiB/202 GiB

Allocation Resource Utilization
CPU             Memory
5998/96000 MHz  592 KiB/126 GiB

Host Resource Utilization
CPU             Memory           Disk
6119/96000 MHz  685 MiB/126 GiB  19 GiB/233 GiB

TrueBrain · 2023-07-18T08:01:11Z

Ah, yes, I only tested when you set the configuration cpu_total_compute manually. For AWS there is of course the auto-detect part, and so it seems that is not routed through correctly with my PR .. let's see where that information gets lost :)

Edit: meh; bit of a pita to fix properly .. AWS fingerprint is run after CPU fingerprint, and can change the cpu.totalcompute attribute (and resources.CPU). The more I work on this, the more it feels it needs a bit bigger refactor :D But that is not within my scope of expertise atm. So I am going to fix this in the best way I can :)

Edit 2: fixed and tested on a t4g on AWS.

Before this commit, it was only used for fingerprinting, but not for CPU stats on nodes or tasks. This meant that if the auto-detection failed, setting the cpu_total_compute didn't resolved the issue. This issue was most noticeable on ARM64, as there auto-detection always failed.

shoenig · 2023-07-19T17:58:44Z

I think this PR is looking good now, thanks @TrueBrain!

The more I work on this, the more it feels it needs a bit bigger refactor

Indeed! That refactor (in combination with refactoring cgroups) is a prerequisite for what I'm working on for 1.7

tgross

👍 @shoenig the original ticket is classified as enhancement but this really feels more like a bugfix and as you've noted we'll need this for upcoming work. Should we just go ahead and backport this to 1.5.x and 1.4.x too?

(manual backport of e190eae) Before this commit, it was only used for fingerprinting, but not for CPU stats on nodes or tasks. This meant that if the auto-detection failed, setting the cpu_total_compute didn't resolved the issue. This issue was most noticeable on ARM64, as there auto-detection always failed.

… (#17992) (manual backport of e190eae) Before this commit, it was only used for fingerprinting, but not for CPU stats on nodes or tasks. This meant that if the auto-detection failed, setting the cpu_total_compute didn't resolved the issue. This issue was most noticeable on ARM64, as there auto-detection always failed. Co-authored-by: Patric Stout <[email protected]>

TrueBrain commented Jun 20, 2023

View reviewed changes

drivers/shared/executor/executor.go Outdated Show resolved Hide resolved

TrueBrain changed the title ~~Use cpu_total_compute setting also for CPU usage information~~ Use cpu_total_compute setting for CPU usage information too Jun 20, 2023

TrueBrain changed the title ~~Use cpu_total_compute setting for CPU usage information too~~ Use cpu_total_compute configuration for CPU usage information too Jun 20, 2023

TrueBrain changed the title ~~Use cpu_total_compute configuration for CPU usage information too~~ Use cpu_total_compute configuration for CPU usage too Jun 20, 2023

vercel bot deployed to Preview – nomad-storybook-and-ui June 20, 2023 20:16 View deployment

TrueBrain force-pushed the fix-arm64-cpu branch from ab2676f to 35c5ea1 Compare June 20, 2023 21:56

vercel bot deployed to Preview – nomad-storybook-and-ui June 20, 2023 21:59 View deployment

TrueBrain force-pushed the fix-arm64-cpu branch from 35c5ea1 to ecf3cfb Compare June 20, 2023 22:20

vercel bot deployed to Preview – nomad-storybook-and-ui June 20, 2023 22:23 View deployment

tgross requested changes Jun 21, 2023

View reviewed changes

drivers/shared/executor/utils.go Outdated Show resolved Hide resolved

client/fingerprint/cpu.go Show resolved Hide resolved

TrueBrain force-pushed the fix-arm64-cpu branch from ecf3cfb to a85c392 Compare July 11, 2023 12:42

vercel bot deployed to Preview – nomad-storybook-and-ui July 11, 2023 12:45 View deployment

TrueBrain marked this pull request as ready for review July 11, 2023 12:46

TrueBrain force-pushed the fix-arm64-cpu branch from a85c392 to 403bbaa Compare July 11, 2023 13:12

vercel bot deployed to Preview – nomad-storybook-and-ui July 11, 2023 13:15 View deployment

TrueBrain force-pushed the fix-arm64-cpu branch from 403bbaa to b6c5d52 Compare July 11, 2023 13:32

vercel bot deployed to Preview – nomad-storybook-and-ui July 11, 2023 13:35 View deployment

shoenig self-requested a review July 11, 2023 15:44

tgross added this to the 1.6.x milestone Jul 11, 2023

TrueBrain force-pushed the fix-arm64-cpu branch from b6c5d52 to 615e76e Compare July 18, 2023 08:36

vercel bot deployed to Preview – nomad-storybook-and-ui July 18, 2023 08:38 View deployment

shoenig approved these changes Jul 19, 2023

View reviewed changes

tgross approved these changes Jul 19, 2023

View reviewed changes

shoenig added backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line labels Jul 19, 2023

shoenig merged commit e190eae into hashicorp:main Jul 19, 2023

TrueBrain deleted the fix-arm64-cpu branch August 15, 2023 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use cpu_total_compute configuration for CPU usage too #17628

Use cpu_total_compute configuration for CPU usage too #17628

TrueBrain commented Jun 20, 2023 •

edited

Loading

tgross left a comment

shoenig commented Jun 22, 2023

TrueBrain commented Jul 11, 2023

tgross commented Jul 11, 2023

shoenig commented Jul 17, 2023

TrueBrain commented Jul 18, 2023 •

edited

Loading

shoenig commented Jul 19, 2023

tgross left a comment

Use cpu_total_compute configuration for CPU usage too #17628

Use cpu_total_compute configuration for CPU usage too #17628

Conversation

TrueBrain commented Jun 20, 2023 • edited Loading

tgross left a comment

Choose a reason for hiding this comment

shoenig commented Jun 22, 2023

TrueBrain commented Jul 11, 2023

tgross commented Jul 11, 2023

shoenig commented Jul 17, 2023

TrueBrain commented Jul 18, 2023 • edited Loading

shoenig commented Jul 19, 2023

tgross left a comment

Choose a reason for hiding this comment

TrueBrain commented Jun 20, 2023 •

edited

Loading

TrueBrain commented Jul 18, 2023 •

edited

Loading