Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

env_aws: use best-effort lookup table for CPU performance in EC2 #7828

Merged
merged 4 commits into from
Apr 29, 2020

Conversation

shoenig
Copy link
Member

@shoenig shoenig commented Apr 29, 2020

Fixes #7681

The current behavior of the CPU fingerprinter in AWS is that it
reads the current speed from /proc/cpuinfo (CPU MHz field).

This is because the max CPU frequency is not available by reading
anything on the EC2 instance itself. Normally on Linux one would
look at e.g. sys/devices/system/cpu/cpuN/cpufreq/cpuinfo_max_freq
or perhaps parse the values from the CPU max MHz field in
/proc/cpuinfo, but those values are not available.

Furthermore, no metadata about the CPU is made available in the
EC2 metadata service.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-categories.html

Since go-psutil cannot determine the max CPU speed it defaults to
the current CPU speed, which could be basically any number between
0 and the true max. This is particularly bad on large, powerful
reserved instances which often idle at ~800 MHz while Nomad does
its fingerprinting (typically IO bound), which Nomad then uses as
the max, which results in severe loss of available resources.

Since the CPU specification is unavailable programmatically (at least
not without sudo) use a best-effort lookup table. This table was
generated by going through every instance type in AWS documentation
and copy-pasting the numbers.
https://aws.amazon.com/ec2/instance-types/

This approach obviously is not ideal as future instance types will
need to be added as they are introduced to AWS. However, using the
table should only be an improvement over the status quo since right
now Nomad miscalculates available CPU resources on all instance types.

@shoenig shoenig force-pushed the b-ec2-speeds branch 2 times, most recently from c7e35d4 to 2589b85 Compare April 29, 2020 00:47
Fixes #7681

The current behavior of the CPU fingerprinter in AWS is that it
reads the **current** speed from `/proc/cpuinfo` (`CPU MHz` field).

This is because the max CPU frequency is not available by reading
anything on the EC2 instance itself. Normally on Linux one would
look at e.g. `sys/devices/system/cpu/cpuN/cpufreq/cpuinfo_max_freq`
or perhaps parse the values from the `CPU max MHz` field in
`/proc/cpuinfo`, but those values are not available.

Furthermore, no metadata about the CPU is made available in the
EC2 metadata service.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-categories.html

Since `go-psutil` cannot determine the max CPU speed it defaults to
the current CPU speed, which could be basically any number between
0 and the true max. This is particularly bad on large, powerful
reserved instances which often idle at ~800 MHz while Nomad does
its fingerprinting (typically IO bound), which Nomad then uses as
the max, which results in severe loss of available resources.

Since the CPU specification is unavailable programmatically (at least
not without sudo) use a best-effort lookup table. This table was
generated by going through every instance type in AWS documentation
and copy-pasting the numbers.
https://aws.amazon.com/ec2/instance-types/

This approach obviously is not ideal as future instance types will
need to be added as they are introduced to AWS. However, using the
table should only be an improvement over the status quo since right
now Nomad miscalculates available CPU resources on all instance types.
@shoenig
Copy link
Member Author

shoenig commented Apr 29, 2020

Running Nomad with this change on a c5.24xlarge like the original reporter:

run 1

2020-04-29T01:22:55.187Z [DEBUG] client.fingerprint_mgr.cpu: detected cpu frequency: MHz=3346
2020-04-29T01:22:55.187Z [DEBUG] client.fingerprint_mgr.cpu: detected core count: cores=96
2020-04-29T01:22:55.201Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu model name: model="3.6 GHz Intel Xeon Scalable"
2020-04-29T01:22:55.201Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu frequency: MHz=3600
2020-04-29T01:22:55.201Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu cores: cores=96
2020-04-29T01:22:55.201Z [DEBUG] client.fingerprint_mgr.env_aws: setting ec2 cpu ticks: ticks=345600

run 2

2020-04-29T01:23:58.743Z [DEBUG] client.fingerprint_mgr.cpu: detected cpu frequency: MHz=1335
2020-04-29T01:23:58.743Z [DEBUG] client.fingerprint_mgr.cpu: detected core count: cores=96
2020-04-29T01:23:58.755Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu model name: model="3.6 GHz Intel Xeon Scalable"
2020-04-29T01:23:58.755Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu frequency: MHz=3600
2020-04-29T01:23:58.755Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu cores: cores=96
2020-04-29T01:23:58.755Z [DEBUG] client.fingerprint_mgr.env_aws: setting ec2 cpu ticks: ticks=345600

run 3

2020-04-29T01:25:19.395Z [DEBUG] client.fingerprint_mgr.cpu: detected cpu frequency: MHz=2218
2020-04-29T01:25:19.395Z [DEBUG] client.fingerprint_mgr.cpu: detected core count: cores=96
2020-04-29T01:25:19.407Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu model name: model="3.6 GHz Intel Xeon Scalable"
2020-04-29T01:25:19.407Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu frequency: MHz=3600
2020-04-29T01:25:19.407Z [DEBUG] client.fingerprint_mgr.env_aws: lookup ec2 cpu cores: cores=96
2020-04-29T01:25:19.407Z [DEBUG] client.fingerprint_mgr.env_aws: setting ec2 cpu ticks: ticks=345600

plumbing works

ubuntu@ip-172-31-84-116:~$ ./nomad node status 17
ID              = 1707aadb-c741-b6ac-b487-b1d713466260
Name            = ip-172-31-84-116
Class           = <none>
DC              = dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 7m3s
Host Volumes    = <none>
CSI Volumes     = <none>
Driver Status   = mock_driver,raw_exec

Node Events
Time                  Subsystem  Message
2020-04-29T01:25:37Z  Cluster    Node registered

Allocated Resources
CPU           Memory       Disk
0/345600 MHz  0 B/185 GiB  0 B/6.5 GiB

Allocation Resource Utilization
CPU           Memory
0/345600 MHz  0 B/185 GiB

Host Resource Utilization
CPU             Memory           Disk
244/345600 MHz  669 MiB/185 GiB  1.2 GiB/7.7 GiB

Allocations
No allocations placed

@shoenig shoenig marked this pull request as ready for review April 29, 2020 01:31
@shoenig shoenig requested review from notnoop and jrasell April 29, 2020 01:31
Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to see some automation tips/scripts for updating the list, and I have some nitpicky logging suggestions. LGTM otherise.

client/fingerprint/env_aws.go Show resolved Hide resolved
client/fingerprint/env_aws.go Outdated Show resolved Hide resolved
client/fingerprint/env_aws.go Outdated Show resolved Hide resolved
client/fingerprint/env_aws.go Outdated Show resolved Hide resolved
@jippi
Copy link
Contributor

jippi commented Apr 29, 2020

Will the client settings still take precedence when configured?

@shoenig
Copy link
Member Author

shoenig commented Apr 29, 2020

@jippi Yep, configuring cpu_total_compute continues to take the highest precedence.

@shoenig shoenig merged commit a12eb8f into master Apr 29, 2020
@shoenig shoenig deleted the b-ec2-speeds branch April 29, 2020 17:25
@github-actions
Copy link

github-actions bot commented Jan 8, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Available CPU MHz Varying Wildly for Same Instance Type
3 participants