Utilizing a container orchestration engine like Kubernetes, CPU resources are allocated from a pool of platforms entirely based on availability, without taking into account specific features like Intel Speed Select Technology (SST).
The Kubernetes Power Manager is a Kubernetes Operator that has been developed to provide cluster users with a mechanism to dynamically request adjustment of worker node power management settings applied to cores allocated to the Pods. The power management-related settings can be applied to individual cores or to groups of cores, and each may have different policies applied. It is not required that every core in the system be explicitly managed by this Kubernetes power manager. When the Power Manager is used to specify core power related policies, it overrides the default settings
Powerful features from the Intel SST package give users more precise control over CPU performance and power use on a per-core basis. Yet, Kubernetes is purposefully built to operate as an abstraction layer between the workload and such hardware capabilities as a workload orchestrator. Users of Kubernetes who are running performance-critical workloads with particular requirements reliant on hardware capabilities encounter a challenge as a consequence.
The Kubernetes Power Manager bridges the gap between the container orchestration layer and hardware features enablement, specifically Intel SST.
- The Kubernetes Power Manager consists of two main components - the overarching manager which is deployed anywhere on a cluster and the power node agent which is deployed on each node you require power management capabilities.
- The overarching operator is responsible for the configuration and deployment of the power node agent, while the power node agent is responsible for the tuning of the cores as requested by the user.
- Users may want to pre-schedule nodes to move to a performance profile during peak times to minimize spin up. At times not during peak, they may want to move to a power saving profile.
- Unpredictable machine use. Users may use machine learning through monitoring to determine profiles that predict a peak need for compute, to spin up ahead of time.
- Power Optimization over Performance. A user may be interested in fast response time, but not in maximal response time, so may choose to spin up cores on demand and only those cores used but want to remain in power-saving mode the rest of the time.
Please see the diagrams-docs directory for diagrams with a visual breakdown of the power manager and its components.
-
SST-CP - (Speed Select Technology - Core Power)
The user can arrange cores according to priority levels using this capability. When the system has extra power, it can be distributed among the cores according to their priority level. Although it cannot be guaranteed, the system will try to apply the additional power to the cores with the highest priority. There are four levels of priority available:
- Performance
- Balance Performance
- Balance Power
- Power
The Priority level for a core is defined using its EPP (Energy Performance Preference) value, which is one of the options in the Power Profiles. If not all the power is utilized on the CPU, the CPU can put the higher priority cores up to Turbo Frequency (allows the cores to run faster).
-
Frequency Tuning
Frequency tuning allows the individual cores on the system to be sped up or slowed down by changing their frequency. This tuning is done via the Intel Power Optimization Library. The min and max values for a core are defined in the Power Profile and the tuning is done after the core has been assigned by the Native CPU Manager. How exactly the frequency of the cores is changed is by simply writing the new frequency value to the /sys/devices/system/cpu/cpuN/cpufreq/scaling_max|min_freq file for the given core.
-
Time of Day
Time of Day is designed to allow the user to select a specific time of day that they can put all their unused CPUs into “sleep” state and then reverse the process and select another time to return to an “active” state.
-
Scaling Drivers
-
P-State
Modern Intel CPUs automatically employ the Intel P_State CPU power scaling driver. This driver is integrated rather than a module, giving it precedence over other drivers. For Sandy Bridge and newer CPUs, this driver is currently used automatically. The BIOS P-State settings might be disregarded by Intel P-State. The Intel P-State driver utilizes the "Performance" and "Powersave" governors. Performance The CPUfreq governor "performance" sets the CPU statically to the highest frequency within the borders of scaling_min_freq and scaling_max_freq. Powersave The CPUfreq governor "powersave" sets the CPU statically to the lowest frequency within the borders of scaling_min_freq and scaling_max_freq.
-
acpi-cpufreq
The acpi-cpufreq driver setting operates much like the P-state driver but has a different set of available governors. For more information see here. One thing to note is that acpi-cpufreq reports the base clock as the frequency hardware limits however the P-state driver uses turbo frequency limits. Both drivers can make use of turbo frequency; however, acpi-cpufreq can exceed hardware frequency limits when using turbo frequency. This is important to take into account when setting frequencies for profiles.
-
-
Uncore The largest part of modern CPUs is outside the actual cores. On Intel CPUs this is part is called the "Uncore" and has last level caches, PCI-Express, memory controller, QPI, power management and other functionalities. The previous deployment pattern was that an uncore setting was applied to sets of servers that are allocated as capacity for handling a particular type of workload. This is typically a one-time configuration today. The Kubenetes Power Manager now makes this dynamic and through a cloud native pattern. The implication is that the cluster-level capacity for the workload can then configured dynamically, as well as scaled dynamically. Uncore frequency applies to Xeon scalable and D processors could save up to 40% of CPU power or improved performance gains.
-
SST-BF - (Speed Select Technology - Base Frequency)
The base frequency of some cores can be changed by the user using this feature. The CPU's performance is ensured at the basic frequency (a CPU will never go below its base frequency). Priority cores can apply their crucial workloads for a guaranteed performance at a base frequency that is greater than the majority of the other cores on the system.
-
SST-TF - Turbo Frequency
This feature allows the user to set different “All-Core Turbo Frequency” values to individual cores based on their priority. All-Core Turbo is the Turbo Frequency at which all cores can run on the system at the same time. The user can set certain cores to have a higher All-Core Turbo Frequency by lowering this value for other cores or setting them to no value at all.
This feature is only useful when all cores on the system are being utilized, but the user still wants to be able to configure certain cores to get a higher performance than others.
- Node Feature Discovery (NFD) should be deployed in the cluster before running the Kubernetes Power Manager. NFD is used to detect node-level features such as Intel Speed Select Technology - Base Frequency (SST-BF). Once detected, the user can instruct the Kubernetes Power Manager to deploy the Power Node Agent to Nodes with SST-specific labels, allowing the Power Node Agent to take advantage of such features by configuring cores on the host to optimise performance for containerized workloads. Note: NFD is recommended, but not essential. Node labels can also be applied manually. See the NFD repo for a full list of features labels.
- Important: In the kubelet configuration file the cpuManagerPolicy has to set to "static", and the reservedSystemCPUs must be set to the desired value:
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
anonymous:
enabled: false
webhook:
cacheTTL: 0s
enabled: true
x509:
clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
mode: Webhook
webhook:
cacheAuthorizedTTL: 0s
cacheUnauthorizedTTL: 0s
cgroupDriver: systemd
clusterDNS:
- 10.96.0.10
clusterDomain: cluster.local
cpuManagerPolicy: "static"
cpuManagerReconcilePeriod: 0s
evictionPressureTransitionPeriod: 0s
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageMinimumGCAge: 0s
kind: KubeletConfiguration
logging:
flushFrequency: 0
options:
json:
infoBufferSize: "0"
verbosity: 0
memorySwap: { }
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
reservedSystemCPUs: "0"
resolvConf: /run/systemd/resolve/resolv.conf
rotateCertificates: true
runtimeRequestTimeout: 0s
shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s
The Kubernetes Power Manager includes a helm chart for the latest releases, allowing the user to easily deploy everything that is needed for the overarching operator and the node agent to run. The following versions are supported with helm charts:
- v2.0.0
- v2.1.0
- v2.2.0
- v2.3.0
- v2.3.1
- ocp-4.13-v2.3.1
When set up using the provided helm charts, the following will be deployed:
- The intel-power namespace
- The RBAC rules for the operator and node agent
- The operator deployment itself
- The operator's power config
- A shared power profile
To change any of the values the above are deployed with, edit the values.yaml file of the relevant helm chart.
To deploy the Kubernetes Power Manager using Helm, you must have Helm installed. For more information on installing Helm, see the installation guide here https://helm.sh/docs/intro/install/.
To install the latest version, use the following command:
make helm-install
To uninstall the latest version, use the following command:
make helm-uninstall
You can use the HELM_CHART and OCP parameters to deploy an older or Openshift specific version of the Kubernetes Power Manager:
HELM_CHART=v2.3.1 OCP=true make helm-install
HELM_CHART=v2.2.0 make helm-install
HELM_CHART=v2.1.0 make helm-install
Please note when installing older versions that certain features listed in this README may not be supported.
The Kubernetes Power Manager has been tested in different environments.
The below table are results that have been tested and confirmed to function as desired:
OS | Kernel | Container runtime | Kubernetes |
---|---|---|---|
Rocky 8.6 | 6.0.9-1.el8.elrepo.x86_64 | Docker 20.10.18 | v1.25.0 |
Ubuntu 20.04 | 5.15.0-50-generic | Containerd 1.6.9 | 1.25.4 |
Ubuntu 20.04 | 5.15.0-50-generic | cri-o 1.23.4 | 1.25.4 |
Ubuntu 20.04 | 5.15.0-50-generic | Docker 22.6.0 | 1.25.4 |
Ubuntu 20.04 | 5.15.0-50-generic | Containerd 1.6.9 | 1.24.3 |
Ubuntu 20.04 | 5.15.0-50-generic | Containerd 1.6.9 | 1.24.2 |
Ubuntu 20.04 | 5.15.0-50-generic | Containerd 1.6.9 | 1.23.3 |
Ubuntu 20.04 | 5.15.0-50-generic | Docker 20.10.12 | 1.23.3 |
Ubuntu 20.04 | 5.10.0-132-generic | Docker 22.6.0 | 1.25.4 |
Ubuntu 20.04 | 5.4.0-132-generic | Containerd 1.6.9 | 1.25.4 |
Ubuntu 20.04 | 5.4.0-122-generic | Containerd 1.6.9 | 1.24.4 |
CentOS 8 | 4.18.0-372.19.1.el8_6.x86_64 | Containerd 1.6.9 | 1.24.3 |
Rocky 8.6 | 4.18.0-372.19.1.el8_6.x86_64 | Docker 20.10.18 | 1.25.0 |
Note: this does not include additional environments.
Intel Power Optimization Library, takes the desired configuration for the cores associated with Exclusive Pods and tune them based on the requested Power Profile. The Power Optimization Library will also facilitate the use of the Intel SST (Speed Select Technology) Suite (SST-CP - Speed Select Technology-Core Power, and Frequency Tuning) and C-States functionality.
The Power Node Agent is also a containerized application deployed by the Kubernetes Power Manager in a DaemonSet. The primary function of the node agent is to communicate with the node's Kubelet PodResources endpoint to discover the exact cores that are allocated per container. The node agent watches for Pods that are created in your cluster and examines them to determine which Power Profile they have requested and then sets off the chain of events that tunes the frequencies of the cores designated to the Pod.
The Kubernetes Power Manager will wait for the PowerConfig to be created by the user, in which the desired PowerProfiles will be specified. The PowerConfig holds different values: what image is required, what Nodes the user wants to place the node agent on and what PowerProfiles are required.
- powerNodeSelector: This is a key/value map used for defining a list of node labels that a node must satisfy in order for the Power Node Agent to be deployed.
- powerProfiles: The list of PowerProfiles that the user wants available on the nodes.
Once the Config Controller sees that the PowerConfig is created, it reads the values and then deploys the node agent on to each of the Nodes that are specified. It then creates the PowerProfiles and extended resources. Extended resources are resources created in the cluster that can be requested in the PodSpec. The Kubelet can then keep track of these requests. It is important to use as it can specify how many cores on the system can be run at a higher frequency before hitting the heat threshold.
Note: Only one PowerConfig can be present in a cluster. The Config Controller will ignore and delete and subsequent PowerConfigs created after the first.
apiVersion: "power.intel.com/v1"
kind: PowerConfig
metadata:
name: power-config
namespace: intel-power
spec:
powerNodeSelector:
feature.node.kubernetes.io/power-node: "true"
powerProfiles:
- "performance"
- "balance-performance"
- "balance-power"
The Workload Controller is responsible for the actual tuning of the cores. The Workload Controller uses the Intel Power Optimization Library and requests that it creates the Pools. The Pools hold the PowerProfile associated with the cores and the cores that need to be configured.
The PowerWorkload objects are created automatically by the PowerPod controller. This action is undertaken by the Kubernetes Power Manager when a Pod is created with a container requesting exclusive cores and a PowerProfile.
PowerWorkload objects can also be created directly by the user via the PowerWorkload spec. This is only recommended when creating the Shared PowerWorkload for a given Node, as this is the responsibility of the user. If no Shared PowerWorkload is created, the cores that remain in the ‘shared pool’ on the Node will remain at their core frequency values instead of being tuned to lower frequencies. PowerWorkloads are specific to a given node, so one is created for each Node with a Pod requesting a PowerProfile, based on the PowerProfile requested.
apiVersion: "power.intel.com/v1"
kind: PowerWorkload
metadata:
name: performance-example-node-workload
namespace: intel-power
spec:
name: "performance-example-node-workload"
nodeInfo:
containers:
- exclusiveCPUs:
- 2
- 3
- 66
- 67
id: f1be89f7dda457a7bb8929d4da8d3b3092c9e2a35d91065f1b1c9e71d19bcd4f
name: example-container
pod: example-pod
powerProfile: “performance-example-node”
name: “example-node”
cpuIds:
- 2
- 3
- 66
- 67
powerProfile: "performance-example-node"
This workload assigns the “performance” PowerProfile to cores 2, 3, 66, and 67 on the node “example-node”
The Shared PowerWorkload created by the user is determined by the Workload controller to be the designated Shared PowerWorkload based on the AllCores value in the Workload spec. The reserved CPUs on the Node must also be specified, as these will not be considered for frequency tuning by the controller as they are always being used by Kubernetes’ processes. It is important that the reservedCPUs value directly corresponds to the reservedCPUs value in the user’s Kubelet config to keep them consistent. The user determines the Node for this PowerWorkload using the PowerNodeSelector to match the labels on the Node. The user then specifies the requested PowerProfile to use.
A shared PowerWorkload must follow the naming convention of beginning with ‘shared-’. Any shared PowerWorkload that does not begin with ‘shared-’ is rejected and deleted by the PowerWorkload controller. The shared PowerWorkload powerNodeSelector must also select a unique node, so it is recommended that the ‘kubernetes.io/hostname’ label be used. A shared PowerProfile can be used for multiple shared PowerWorkloads.
apiVersion: "power.intel.com/v1"
kind: PowerWorkload
metadata:
name: shared-example-node-workload
namespace: intel-power
spec:
name: "shared-example-node-workload"
allCores: true
reservedCPUs:
- cores: [0, 1]
powerProfile: "performance"
powerNodeSelector:
# Labels other than hostname can be used
- “kubernetes.io/hostname”: “example-node”
powerProfile: "shared-example-node"
Important Version 2.4.0 of the Kubernetes Power Manager allows users the possibility of assigning a specific power profile to a reserved pool. This is turn relies on a change in the PowerWorkload CRD that is not backwards compatible with older versions of Kubernetes Power Manager (v2.3.1 and older). If affected by this problem, you will see similar error in the manager POD's logs:
Failed to watch *v1.PowerWorkload: failed to list *v1.PowerWorkload: json: cannot unmarshal number into Go struct field PowerWorkloadSpec.items.spec.reservedCPUs of type v1.ReservedSpec
To mitigate this problem, we ask customers to update their PowerWorkload manifests as suggested below:
- - 0
- - 1
+ - cores: [0, 1]
We aim to fix this issue in the next release of the Kubernetes Power Manager.
The Profile Controller holds values for specific SST settings which are then applied to cores at host level by the Kubernetes Power Manager as requested. Power Profiles are advertised as extended resources and can be requested via the PodSpec. The Config controller creates the requested high-performance PowerProfiles depending on which are requested in the PowerConfig created by the user.
There are two kinds of PowerProfiles:
- Base PowerProfiles
- Extended PowerProfiles
A Base PowerProfile can be one of three values:
- performance
- balance-performance
- balance-power
These correspond to three of the EPP values associated with SST-CP. Base PowerProfiles are used to tell the Profile controller that the specified profile is being requested for the cluster. The Profile controller takes the created Profile and further creates an Extended PowerProfile. An Extended PowerProfile is Node-specific. The reason behind this is that different Nodes in your cluster may have different maximum frequency limitations. For example, one Node may have the maximum limitation of 3700GHz, while another may only be able to reach frequency levels of 3200GHz. An Extended PowerProfile queries the Node that it is running on to obtain this maximum limitation and sets the Max and Min values of the profile accordingly. An Extended PowerProfile’s name has the following form:
BASE_PROFILE_NAME-NODE_NAME - for example: “performance-example-node”.
Either the Base PowerProfile or the Extended PowerProfile can be requested in the PodSpec, as the Workload controller can determine the correct PowerProfile to use from the Base PowerProfile.
apiVersion: "power.intel.com/v1"
kind: PowerProfile
metadata:
name: performance-example-node
spec:
name: "performance-example-node"
max: 3700
min: 3300
epp: "performance"
The Shared PowerProfile must be created by the user and does not require a Base PowerProfile. This allows the user to have a Shared PowerProfile per Node in their cluster, giving more room for different configurations. The Power controller determines that a PowerProfile is being designated as ‘Shared’ through the use of the ‘shared’ parameter. This flag must be enabled when using a shared pool.
apiVersion: "power.intel.com/v1"
kind: PowerProfile
metadata:
name: shared-example-node1
spec:
name: "shared-example-node1"
max: 1500
min: 1000
shared: true
epp: "power"
governor: "powersave"
apiVersion: "power.intel.com/v1"
kind: PowerProfile
metadata:
name: shared-example-node2
spec:
name: "shared-example-node2"
max: 2000
min: 1500
shared: true
governor: "powersave"
The PowerNode controller provides a window into the cluster's operations. It exposes the workloads that are now being used, the profiles that are being used, the cores that are being used, and the containers that those cores are associated to. Moreover, it informs the user of which Shared Pool is in use. The Default Pool or the Shared Pool can be one of the two shared pools. The Default Pool will hold all the cores in the " shared pool," none of which will have their frequencies set to a lower value, if there is no Shared PowerProfile associated with the Node. The cores in the "shared pool"—apart from those reserved for Kubernetes processes ( reservedCPUs)—will be assigned to the Shared Pool and have their cores tuned by the Intel Power Optimization Library if a Shared PowerProfile is associated with the Node.
activeProfiles:
performance-example-node: true
activeWorkloads:
- cores:
- 2
- 3
- 8
- 9
name: performance-example-node-workload
nodeName: example-node
powerContainers:
- exclusiveCpus:
- 2
- 3
- 8
- 9
id: c392f492e05fc245f77eba8a90bf466f70f19cb48767968f3bf44d7493e18e5b
name: example-container
pod: example-pod
powerProfile: performance-example-node
workload: performance-example-node-workload
sharedPools:
- name: Default
sharedPoolCpuIds:
- 0
- 1
- 4
- 5
- 6
- 7
- 10
activeProfiles:
performance-example-node: true
activeWorkloads:
- cores:
- 2
- 3
- 8
- 9
name: performance-example-node-workload
nodeName: example-node
powerContainers:
- exclusiveCpus:
- 2
- 3
- 8
- 9
id: c392f492e05fc245f77eba8a90bf466f70f19cb48767968f3bf44d7493e18e5b
name: example-container
pod: example-pod
powerProfile: performance-example-node
workload: performance-example-node-workload
sharedPools:
- name: Default
sharedPoolCpuIds:
- 0
- 1
- name: Shared
sharedPoolCpuIds:
- 4
- 5
- 6
- 7
- 10
To save energy on a system, you can command the CPU to go into a low-power mode. Each CPU has several power modes, which are collectively called C-States. These work by cutting the clock signal and power from idle CPUs, or CPUs that are not executing commands.While you save more energy by sending CPUs into deeper C-State modes, it does take more time for the CPU to fully “wake up” from sleep mode, so there is a trade-off when it comes to deciding the depth of sleep.
The driver that is used for C-States is the intel_idle driver. Everything associated with C-States in Linux is stored in the /sys/devices/system/cpu/cpuN/cpuidle file or the /sys/devices/system/cpu/cpuidle file. To check the driver in use, the user simply has to check the /sys/devices/system/cpu/cpuidle/current_driver file.
C-States have to be confirmed if they are actually active on the system. If a user requests any C-States, they need to check on the system if they are activated and if they are not, reject the PowerConfig. The C-States are found in /sys/devices/system/cpu/cpuN/cpuidle/stateN/.
C0 Operating State
C1 Halt
C1E Enhanced Halt
C2 Stop Grant
C2E Extended Stop Grant
C3 Deep Sleep
C4 Deeper Sleep
C4E/C5 Enhanced Deeper Sleep
C6 Deep Power Down
apiVersion: power.intel.com/v1
kind: CStates
metadata:
# Replace <NODE_NAME> with the name of the node to configure the C-States on that node
name: <NODE_NAME>
spec:
sharedPoolCStates:
C1: true
exclusivePoolCStates:
performance:
C1: false
individualCoreCStates:
"3":
C1: true
C6: false
The intel_pstate is a part of the CPU performance scaling subsystem in the Linux kernel (CPUFreq).
In some situations it is desirable or even necessary to run the program as fast as possible and then there is no reason to use any P-states different from the highest one (i.e. the highest-performance frequency/voltage configuration available). In some other cases, however, it may not be necessary to execute instructions so quickly and maintaining the highest available CPU capacity for a relatively long time without utilizing it entirely may be regarded as wasteful. It also may not be physically possible to maintain maximum CPU capacity for too long for thermal or power supply capacity reasons or similar. To cover those cases, there are hardware interfaces allowing CPUs to be switched between different frequency/voltage configurations or (in the ACPI terminology) to be put into different P-states.
In order to offer dynamic frequency scaling, the cpufreq core must be able to tell these drivers of a "target frequency". So these specific drivers will be transformed to offer a "->target/target_index/fast_switch()" call instead of the "->setpolicy()" call. For set_policy drivers, all stays the same, though.
The cpufreq governors decide what frequency within the CPUfreq policy should be used. The P-state driver utilizes the "powersave" and "performance" governors.
The CPUfreq governor "powersave" sets the CPU statically to the lowest frequency within the borders of scaling_min_freq and scaling_max_freq.
The CPUfreq governor "performance" sets the CPU statically to the highest frequency within the borders of scaling_min_freq and scaling_max_freq.
An alternative to the P-state driver is the acpi-cpufreq driver. It operates in a similar fashion to the P-state driver but offers a different set of governors which can be seen here. One notable difference between the P-state and acpi-cpufreq driver is that the afformentioned scaling_max_freq value is limited to base clock frequencies rather than turbo frequencies. When turbo is enabled the core frequency will still be capable of exceeding base clock frequencies and the value of scaling_max_freq.
The TIme Of Day feature allows users to change the configuration of their system at a given time each day. This is done
through the use of a timeofdaycronjob
which schedules itself for a specific time each day and gives users the option of tuning cstates, the shared pool
profile as well as the profile used by individual pods.
apiVersion: power.intel.com/v1
kind: TimeOfDay
metadata:
# Replace <NODE_NAME> with the name of the node to use TOD on
name: <NODE_NAME>
namespace: intel-power
spec:
timeZone: "Eire"
schedule:
- time: "14:56"
# this sets the profile for the shared pool
powerProfile: balance-power
# this transitions exclusive pods matching a given label from one profile to another
# please ensure that only pods to be used by power manager have this label
pods:
- labels:
matchLabels:
power: "true"
target: balance-performance
- labels:
matchLabels:
special: "false"
target: balance-performance
# this field simply takes a cstate spec
cState:
sharedPoolCStates:
C1: false
C6: true
- time: "23:57"
powerProfile: shared
cState:
sharedPoolCStates:
C1: true
C6: false
pods:
- labels:
matchLabels:
power: "true"
target: performance
- labels:
matchLabels:
special: "false"
target: balance-power
- time: "14:35"
powerProfile: balance-power
reservedCPUs: [ 0,1 ]
The TimeOfDay
object is deployed on a per-node basis and should have the same name as the node it's deployed on.
When applying changes to the shared pool, users must specify the CPUs reserved by the system. Additionally the user must
specify a timezone to schedule with.
The configuration for Time Of Day consists of a schedule list. Each item in the list consists of a time and any desired
changes to the system.
The profile
field specifies the desired profile for the shared pool.
The pods
field is used to change the profile associated with a specific pod.
To change the profile of specific pods users must provide a set of labels and profiles. When a pod matching a label is found it will be placed in a workload that matches the requested profile.
Please note that all pods matching a provided label must be configured for use with the power manager by requesting an intial profile and dedicated cores.
Finally the cState
field accepts the spec values from a CStates configuration and applies them to the system.
Uncore frequency can be configured on a system-wide, per-package and per-die level. Die config will precede package,
which will in turn precede system-wide configuration.
Valid max and min uncore frequencies are determined by the hardware
apiVersion: power.intel.com/v1
kind: Uncore
metadata:
name: <NODE_NAME>
namespace: intel-power
spec:
sysMax: 2300000
sysMin: 1300000
dieSelector:
- package: 0
die: 0
min: 1500000
max: 2400000
- PowerConfig CRD
- PowerWorkload CRD
- PowerProfile CRD
- PowerNode CRD
- C-State CRD
The Pod Controller watches for pods. When a pod comes along the Pod Controller checks if the pod is in the guaranteed quality of service class (using exclusive cores, see documentation, taking a core out of the shared pool (it is the only option in Kubernetes that can do this operation). Then it examines the Pods to determine which PowerProfile has been requested and then creates or updates the appropriate PowerWorkload.
Note: the request and the limits must have a matching number of cores and are also in a container-by-container bases. Currently the Kubernetes Power Manager only supports a single PowerProfile per Pod. If two profiles are requested in different containers, the pod will get created but the cores will not get tuned.
Intel Power Optimization Library
- Clone the Kubernetes Power Manager
git clone https://github.com/intel/kubernetes-power-manager
cd kubernetes-power-manager
- Set up the necessary Namespace, Service Account, and RBAC rules for the Kubernetes Power Manager:
kubectl apply -f config/rbac/namespace.yaml
kubectl apply -f config/rbac/rbac.yaml
- Generate the CRD templates, create the Custom Resource Definitions, and install the CRDs:
make
- Docker Images Docker images can either be built locally by using the command:
make images
or available by pulling from the Intel's public Docker Hub at:
- intel/power-operator:TAG
- intel/power-node-agent:TAG
or available by pulling from the Intel's public Docker Hub at:
- intel/power-operator:TAG
- intel/power-node-agent:TAG
- Applying the manager
The manager Deployment in config/manager/manager.yaml contains the following:
apiVersion: apps/v1
kind: Deployment
metadata:
name: controller-manager
namespace: intel-power
labels:
control-plane: controller-manager
spec:
selector:
matchLabels:
control-plane: controller-manager
replicas: 1
template:
metadata:
labels:
control-plane: controller-manager
spec:
serviceAccountName: intel-power-operator
containers:
- command:
- /manager
args:
- --enable-leader-election
imagePullPolicy: IfNotPresent
image: power-operator:v2.3.0
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: [ "ALL" ]
name: manager
resources:
limits:
cpu: 100m
memory: 30Mi
requests:
cpu: 100m
memory: 20Mi
volumeMounts:
- mountPath: /sys/fs
name: cgroup
mountPropagation: HostToContainer
readOnly: true
terminationGracePeriodSeconds: 10
volumes:
- name: cgroup
hostPath:
path: /sys/fs
Apply the manager:
kubectl apply -f config/manager/manager.yaml
The controller-manager-xxxx-xxxx pod will be created.
- Power Config
The example PowerConfig in examples/example-powerconfig.yaml contains the following PowerConfig spec:
apiVersion: "power.intel.com/v1"
kind: PowerConfig
metadata:
name: power-config
spec:
powerNodeSelector:
feature.node.kubernetes.io/power-node: "true"
powerProfiles:
- "performance"
Apply the Config:
kubectl apply -f examples/example-powerconfig.yaml
Once deployed the controller-manager pod will see it via the Config controller and create a Node Agent instance on nodes specified with the ‘feature.node.kubernetes.io/power-node: "true"’ label.
The power-node-agent DaemonSet will be created, managing the Power Node Agent Pods. The controller-manager will finally create the PowerProfiles that were requested on each Node.
- Shared Profile
The example Shared PowerProfile in examples/example-shared-profile.yaml contains the following PowerProfile spec:
apiVersion: power.intel.com/v1
kind: PowerProfile
metadata:
name: shared
namespace: intel-power
spec:
name: "shared"
max: 1000
min: 1000
epp: "power"
shared: true
governor: "powersave"
Apply the Profile:
kubectl apply -f examples/example-shared-profile.yaml
- Shared workload
The example Shared PowerWorkload in examples/example-shared-workload.yaml contains the following PowerWorkload spec:
apiVersion: power.intel.com/v1
kind: PowerWorkload
metadata:
# Replace <NODE_NAME> with the Node associated with PowerWorkload
name: shared-<NODE_NAME>-workload
namespace: intel-power
spec:
# Replace <NODE_NAME> with the Node associated with PowerWorkload
name: "shared-<NODE_NAME>-workload"
allCores: true
reservedCPUs:
# IMPORTANT: The CPUs in reservedCPUs should match the value of the reserved system CPUs in your Kubelet config file
- cores: [0, 1]
powerNodeSelector:
# The label must be as below, as this workload will be specific to the Node
kubernetes.io/hostname: <NODE_NAME>
# Replace this value with the intended shared PowerProfile
powerProfile: "shared"
Replace the necessary values with those that correspond to your cluster and apply the Workload:
kubectl apply -f examples/example-shared-workload.yaml
Once created the workload controller will see its creation and create the corresponding Pool. All of the cores on the system except the reservedCPUs will then be brought down to this lower frequency level. The reservedCPUs will be kept at the system default min and max frequency by default. If the user specifies a profile along with a set of reserved cores then a separate pool will be created for those cores and that profile. If an invalid profile is supplied the cores will instead be placed in the default reserved pool with system defaults. It should be noted that in most instances leaving these cores at system defaults is the best approach to prevent important k8s or kernel related processes from becoming starved.
- Performance Pod
The example Pod in examples/example-pod.yaml contains the following PodSpec:
apiVersion: v1
kind: Pod
metadata:
name: example-power-pod
spec:
containers:
- name: example-power-container
image: ubuntu
command: [ "/bin/sh" ]
args: [ "-c", "sleep 15000" ]
resources:
requests:
memory: "200Mi"
cpu: "2"
# Replace <POWER_PROFILE> with the PowerProfile you wish to request
# IMPORTANT: The number of requested PowerProfiles must match the number of requested CPUs
# IMPORTANT: If they do not match, the Pod will be successfully scheduled, but the PowerWorkload for the Pod will not be created
power.intel.com/<POWER_PROFILE>: "2"
limits:
memory: "200Mi"
cpu: "2"
# Replace <POWER_PROFILE> with the PowerProfile you wish to request
# IMPORTANT: The number of requested PowerProfiles must match the number of requested CPUs
# IMPORTANT: If they do not match, the Pod will be successfully scheduled, but the PowerWorkload for the Pod will not be created
power.intel.com/<POWER_PROFILE>: "2"
Replace the placeholder values with the PowerProfile you require and apply the PodSpec:
kubectl apply -f examples/example-pod.yaml
At this point, if only the ‘performance’ PowerProfile was selected in the PowerConfig, the user’s cluster will contain three PowerProfiles and two PowerWorkloads:
kubectl get powerprofiles -n intel-power
NAME AGE
performance 59m
performance-<NODE_NAME> 58m
shared-<NODE_NAME> 60m
kubectl get powerworkloads -n intel-power
NAME AGE
performance-<NODE_NAME>-workload 63m
shared-<NODE_NAME>-workload 61m
- Delete Pods
kubectl delete pods <name>
When a Pod that was associated with a PowerWorkload is deleted, the cores associated with that Pod will be removed from the corresponding PowerWorkload. If that Pod was the last requesting the use of that PowerWorkload, the workload will be deleted. All cores removed from the PowerWorkload are added back to the Shared PowerWorkload for that Node and returned to the lower frequencies.