Skip to content

Latest commit

 

History

History
678 lines (586 loc) · 32.5 KB

operator.md

File metadata and controls

678 lines (586 loc) · 32.5 KB

CMK operator manual

Table of Contents

Related:

System requirements.

Kubernetes >= v1.5.0 (excluding v1.8.0, details below)

Kubernetes preparation

All of template manifests provided with CMK are using serviceaccount which is defined in cmk-serviceaccount manifest. Before first CMK run, operator should use it to define cmk-serviceaccount. This step isn't obligatory on Kubernetes 1.5 but it's strongly recomended. Kubernetes 1.6 requires it because of RBAC authorization method which will use it to deliver API access from inside of CMK pod(s).

Kubernetes 1.6

From Kubernetes 1.6 RBAC has became default authorization method. Operator needs to prepare additional ClusterRole and ClusterRoleBindings in order to deploy CMK.Those are provided in cmk-rbac-rules manifest. In this case operator must also use provided serviceaccount manifest as well.

Kubernetes 1.7

From Kubernetes 1.7 Custom Resource Definitions has replaced Third Party Resource. Only in Kubernetes 1.7 both are compatible. Operator must migrate from TRP to CRD. To cmk-rbac-rules manifest ClusterRole and ClusterRoleBindings have been added for CRD. CMK will detect the version Kubernetes itself and will be use Custom Resource Definitions if Kubernetes version is 1.7 else Third Party Resource to create Nodereport and Reconcilereport.

Additionally Taints have been moved from alpha to beta and are no logner present in node metadata but directly in spec. Please note that if pod manifest has nodeName: <nodename> selector, taints tolerations are not needed.

Kubernetes 1.8

Kubernetes 1.8.0 is not supported due to extended resources issue(it's impossible to create extended resource). Use Kubernetes 1.8.1+ instead.

Kubernetes 1.9

From Kubernetes 1.9.0 mutating admission controller is being used to update any pod which definition contains any container requesting CMK Extended Resources. CMK webhook modifies it by injecting environmental variable CMK_NUM_CORES with its value set to a number of cores specified in the Extended Resource request. This allows cmk isolate to assign multiple CPU cores to given process. On top of that webhook applies additional changes to the pod which are defined in the configuration file. By default, configuration deployed during cmk cluster-init adds CMK installation and host /proc filesystem volumes, CMK service account, tolerations required for a pod to be scheduled on the CMK enabled node and appropriately annotates pod. Containers specifications are updated with volume mounts (referencing volumes added to the pod) and environmental variable CMK_PROC_FS.

The mutating admission controller is set up by default using mutual TLS, where the webhook service looks to authenticate the Kubernetes API server as well. This requires that the Kubernetes API server be set up to pass webhooks a specified certificate and key. By default the webhook looks to authenticate the certificate it gets passed with the CA file that the Kubernetes API server passes in to each pod when they are created. You can pass in the CA file location you want to use when running the webhook by using the --cafile argument. You can also set the argument --insecure to True and the webhook service will revert back to regular TLS. To set up the Kubernetes API server to pass webhook services certificates and keys, do the following:

When starting the Kubernetes API server, set the `--admission-control-config-file` 
to the location of your admission control configuration file, for example 
/var/lib/kubernetes/cmk_config.yaml.

In the admission control configuration file, specify where the WebhookAdmissionConfiguration
controller should read the credentials, which are stored in a kubeConfig file. This kubeConfig 
file contains the certificate and key data, base64 encoded, that the webhook service will 
use. This certificate should be the one used by your Kubernetes cluster or admin, as it 
needs to be validated against the Kubernetes CA.

The official Kubernetes documentation for setting up the Kubernetes API server to send webhook services certificates can be found here: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#authenticate-apiservers

Setting up the cluster.

https://kubernetes.io/docs/admin/authorization/rbac/#rolebinding-and-clusterrolebinding This section describes the setup required to use the CMK software.

Notes:

TL;DR

Prepare the nodes by running cmk cluster-init using these instructions.

Cluster setup table of contents.

Concepts

Term Meaning
CMK nodes The operator can choose any number of nodes in the kubernetes cluster to work with CMK. These participating nodes will be referred as CMK nodes.
Pod A Pod is an abstraction in Kubernetes to represent one or more containers and their configuration. It is the smallest schedulable unit in Kubernetes.
OIR Acronym for Opaque Integer Resource. In Kubernetes, OIR allow cluster operators to advertise new node-level resources that would be otherwise unknown to the system.
Volume A volume is a directory (on host file system). In Kubernetes, a volume has the same lifetime as the Pod that uses it. Many types of volumes are supported in Kubernetes.
hostPath hostPath is a volume type in Kubernetes. It mounts a file or directory from the host file system into the Pod.

Prepare CMK nodes by running cmk cluster-init.

CMK nodes can be prepared by using cmk cluster-init subcommand. The subcommand is expected to be run as a pod. The cmk-cluster-init-pod template can be used to run cmk cluster-init on a Kubernetes cluster. When run on a Kubernetes cluster, the Pod spawns two Pods per node at most in order to prepare each node.

The only value that requires change in the cmk-cluster-init-pod template is the args field, which can be modified to pass different options.

Following are some example modifications to the args field:

  - args:
      # Change this value to pass different options to cluster-init.
      - "/cmk/cmk.py cluster-init --host-list=node1,node2,node3"

The above command prepares nodes "node1", "node2" and "node3" for the CMK software using default options.

  - args:
      # Change this value to pass different options to cluster-init.
      - "/cmk/cmk.py cluster-init --all-hosts"

The above command prepares all the nodes in the Kubernetes cluster for the CMK software using default options.

  - args:
      # Change this value to pass different options to cluster-init.
      - "/cmk/cmk.py cluster-init --host-list=node1,node2,node3 --cmk-cmd-list=init,discover"

The above command prepares nodes "node1", "node2" and "node3" but only runs the cmk init and cmk discover subcommands on each of those nodes.

  - args:
      # Change this value to pass different options to cluster-init.
      - "/cmk/cmk.py cluster-init --host-list=node1,node2,node3 --num-exclusive-cores=3 --num-shared-cores=1 --excl-non-isolcpus=11-15"

The above command prepares nodes "node1", "node2" and "node3" to have 3 cores placed in the exclusive pool, 1 core placed in the shared pool, and the cores 11-15 placed in the exclusive-non-isolcpus pool. The exclusive-non-isolcpus pool will isolate pods from other pods in the cluster, but will not use cores that are governed by isolcpus.

For more details on the options provided by cmk cluster-init, see this description.

Prepare CMK nodes by running each CMK subcommand as a Pod.

Notes:

  • The instructions provided in this section should only be used if and only if running cmk cluster-init fails for some reason.
  • The subcommands described below should be run in the same order.
  • The documentation in this section assumes that the cmk binary is installed on the host under /opt/bin.
  • In all the pod templates used in this section, the name of container image used is cmk:v1.5.2. It is expected that the cmk container image is built and cached locally in the host. The image field will require modification if the container image is hosted remotely (e.g., in https://hub.docker.com/).

Run cmk init

The CMK nodes in the kubernetes cluster should be initialized in order to be used with the CMK software using cmk-init. To initialize the CMK nodes, the cmk-init-pod template can be used.

cmk init takes the --num-exclusive-cores and the --num-shared-cores flags. In the cmk-init-pod template, the values to these flags can be modified. The value for --num-exclusive-cores and --num-shared-cores can be set by changing the values for the NUM_EXCLUSIVE_CORES and NUM_SHARED_CORES environment variables, respectively.

Values that might require modification in the cmk-init-pod template are shown as snippets below:

    env:
    - name: NUM_EXCLUSIVE_CORES
      # Change this to modify the value passed to `--num-exclusive-cores` flag.
      value: '4'
    - name: NUM_SHARED_CORES
      # Change this to modify the value passed to `--num-shared-cores` flag.
      value: '1'

Advertising CMK Opaque Integer Resource (OIR) slots

All the CMK nodes in the Kubernetes cluster should be patched with CMK OIR slots using cmk discover. The OIR slots are advertised as the exclusive pools need to be allocated exclusively. The number of slots advertised should be equal to the number of cpu lists under the exclusive pool, as determined by examining the CMK configuration configmap. cmk-discover-pod template can be used to advertise the CMK OIR slots.

After running this Pod in a node, the node will be patched with `pod.alpha.kubernetes.io/opaque-int-resource-cmk' OIR.

Run cmk reconcile

In order to reconcile from an outdated CMK configuration state, each CMK node should run cmk reconcile periodically. cmk reconcile can be run periodically using the cmk-reconcile-daemonset template.

In the cmk-reconcile-daemonset template, the time between each invocation of cmk reconcile can be adjusted by changing the value of the CMK_RECONCILE_SLEEP_TIME environment variable. The value specifies time in seconds.

Values that might require modification in the cmk-reconcile-daemonset template are shown as snippets below:

    env:
    - name: CMK_RECONCILE_SLEEP_TIME
        # Change this to modify the sleep interval between consecutive
        # cmk reconcile runs. The value is specified in seconds.
        value: '60'

Run cmk install

cmk install is used to create a zero-dependency binary of the CMK software and place it on the host filesystem. Subsequent containers can isolate themselves by mounting the install directory from the host and then calling cmk isolate. To run it on all the CMK nodes, the cmk-install-pod template can be used.

cmk install takes the --install-dir flag. In the cmk-install-pod template, the value for --install-dir can be configured by changing the path value of the hostPath for the cmk-install-dir.

Values that might require modification in the cmk-install-pod template are shown as snippets below:

  volumes:
  - hostPath:
      # Change this to modify the CMK installation dir in the host file system.
      path: "/opt/bin"
    name: cmk-install-dir

Run cmk webhook (Kubernetes v1.9.0+ only)

cmk webhook is used to run mutating admission webhook server. Whenever there's a request to create a new pod, the webhook can capture that request, check whether any of the containers requests or limits number of the CMK Extended Resources and update pod and its container specification appropriately. This allows to simplify deployment of workloads taking advantage of CMK, by reducing the number of requirements to the minimum.

...
spec:
  containers:
    resources:
      requests:
        cmk.intel.com/exclusive-cores: 2
...

In order to deploy CMK mutating webhook a number of resources needs to be created on the cluster. But even before that, operator needs to have X509 private key and TLS certificate in PEM format generated. Certificates can be self-signed, although using ceritificates signed by proper CA or Kubernetes Certificates API is highly recommended. After meeting that requirement, steps to deploy webhook are as follows:

  1. Certificates in PEM format should be then encoded to Base64 format and placed in the Mutating Admission Configuration and Secret templates.
  2. Update config map template. Config map contains 2 configuration files server.yaml and mutations.yaml. Configuration options are described in the cmk command-line tool documentation.
  3. Create secret, service and config map using kubectl create -f ... command.
  4. Run cmk webhook pod defined in the webhook pod template using kubectl create -f ... command.
  5. If the cmk webhook pod is running correctly, create Mutating Admission Configuration object.

Multi socket support (experimental)

CMK is able to use multiple sockets. During cluster initialization, init module will distribute cores from all sockets across pools. To prevent a situation, where exclusive pool or shared pool are spawned only on a single socket operator is able to use one of two mode policies: packed and spread. Those policies define how cores are assigned to specific pool:

  • packed mode will put cores in the following order:

CMK packed mode

Note: This policy is not topology aware, so there is a possibility that one pool won't spread on multiple sockets.

  • spread mode will put following cores order:

CMK spread mode

Note: This policy is topology aware, so CMK will try to spread pools on each socket.

To select appropriate mode operator can select it during initialization with --shared-mode or --exclusive-mode parameters. Those parameters can be used with cluster-init and init. If operator use two different modes, then those policies will be mixed. In that case exclusive pool is resolving before shared pool.

Power Management Capabilities

CMK supports some power management capabilities on the latest Xeon processors, one of these Speed Select Technology - Base Frequency (SST-BF). CMK is able to discover SST-BF configured nodes through the use of node labels, discovers the SST-BF configured cores and ensures these cores are placed in the exclusive pool. This enables users to use these special cores for their containerized workloads, getting guaranteed performance.

  • More information on SST-BF can be found here
  • More information on configuring a Kubernetes cluster to take advantage of these Power Management capabilites can be found here

SST-CP

To utilize SST-CP cores with CMK, the cores need to be set up before CMK is initialised. More information about setting up the cores can be found here. The SST-CP capable node must also be labeled correctly.

The node gets labeled correctly using Node Feature Discovery, which will use a script provided in the CMK Github repository (located at resources/scripts/sst-cp.sh) to determine whether the node is configured to use SST-CP. This file needs to be moved to the correct place so NFD can find it.

After NFD has been set up in your Kubernetes cluster, the folders /etc/kubernetes/node-feature-discovery/source.d/ and /etc/kubernetes/node-feature-discovery/features.d/ should have been created. To move this SST-CP discovery script to the correct location, move into the directory where you cloned the CMK repository. Then copy the file:

cp resources/scripts/sst-cp.sh /etc/kubernetes/node-feature-discovery/source.d/

NFD will look in this location and execute the script, labeling the node if SST-CP is correctly configured. Then simply initialise CMK with the recommended script, providing the correct number of cores for the exclusive and shared pools, and the correct cores will be placed in the correct pools.

Running the cmk isolate Hello World Pod

After following the instructions in the previous section, the cluster is ready to run the Hello World Pod. The Hello World cmk-isolate-pod template describes a simple Pod with three containers requesting CPUs from the exclusive, shared and the infra pools, respectively, using cmk isolate. The pool is requested by passing the desired value to the --pool flag when using cmk isolate as described in the documentation.

cmk isolate can use --socket-id flag to target on which socket application should be spawned. This flag is optional, suitable only for exclusive pool and if it's not used cmk isolate will use first not reserved core.

cmk isolate also takes the --install-dir flag. In the cmk-isolate-pod template, the value for --install-dir can be modified by changing the path value of the hostPath.

Values that might require modification in the cmk-isolate-pod template are shown as snippets below:

  volumes:
  - hostPath:
      # Change this to modify the CMK installation dir in the host file system.
      path: "/opt/bin"
    name: cmk-install-dir

Notes:

  • The Hello World cmk-isolate-pod consumes the pod.alpha.kubernetes.io/opaque-int-resource-cmk Opaque Integer Resource (OIR) only in the container isolated using the exclusive pool. The CMK software assumes that only container isolated using the exclusive pool requests the OIR and each of these containers should consume exactly one OIR. This restricts the number of pods that can land on a Kubernetes node to the expected value.
  • The cmk isolate Hello World Pod should only be run after following the instructions provided in the Setting up the cluster section.

Validating the environment

Following is an example to validate the environment in one node.

  • Pick a node to test. For illustration, we will use <node-name> as the name of the node.
  • Check if node has appropriate label.
kubectl get node <node-name> -o json | jq .metadata.labels

Example output:

kubectl get node cmk-02-zzwt7w -o json | jq .metadata.labels
{
    "beta.kubernetes.io/arch": "amd64",
    "beta.kubernetes.io/os": "linux",
    "cmk.intel.com/cmk-node": "true",
    "kubernetes.io/hostname": "cmk-02-zzwt7w"
}
  • Check if node has appropriate taint. (kubernetes < v1.7)
kubectl get node <node-name> -o json | jq .metadata.annotations

Example output:

kubectl get node cmk-02-zzwt7w -o json | jq .metadata.annotations
{
      "scheduler.alpha.kubernetes.io/taints": "[{\"value\": \"true\", \"key\": \"cmk\", \"effect\": \"NoSchedule\"}]",
      "volumes.kubernetes.io/controller-managed-attach-detach": "true"
}
  • Check if node has appropriate taint. (kubernetes >= v1.7)
kubectl get node <node-name> -o json | jq .spec.taints

Example output:

kubectl get node cmk-02-zzwt7w -o json | jq .spec.taints
[
  {
    "effect": "NoSchedule",
    "key": "cmk",
    "timeAdded": null,
    "value": "true"
  }
]
  • Check if node has the appropriate OIR. (kubernetes < v1.8)
kubectl get node <node-name> -o json | jq .status.capacity

Example output:

kubectl get node cmk-02-zzwt7w -o json | jq .status.capacity
{
    "alpha.kubernetes.io/nvidia-gpu": "0",
    "cpu": "16",
    "memory": "14778328Ki",
    "pod.alpha.kubernetes.io/opaque-int-resource-cmk": "4",
    "pods": "110"
}
  • Check if node has the appropriate ER. (kubernetes >= v1.8.1)
kubectl get node <node-name> -o json | jq .status.capacity

Example output:

kubectl get node cmk-02-zzwt7w -o json | jq .status.capacity
{
    "alpha.kubernetes.io/nvidia-gpu": "0",
    "cpu": "16",
    "memory": "14778328Ki",
    "cmk.intel.com/exclusive-cores": "4",
    "pods": "110"
}
  • Login to the node and check if CMK configuration directory and binary exisits. Assuming default options were used for cmk cluster-init, you would do the following:
ls /opt/bin/
  • Replace the nodeName in the Pod manifest below to the chosen node name and save it to a file.
apiVersion: v1
kind: Pod
metadata:
  labels:
    app: cmk-isolate-pod
  name: cmk-isolate-pod
spec:
  # Change this to the <node-name> you want to test.
  nodeName: NODENAME
  containers:
  - args:
    - "/opt/bin/cmk isolate --pool=infra sleep -- 10000"
    command:
    - "/bin/bash"
    - "-c"
    env:
    - name: CMK_PROC_FS
      value: "/host/proc"
    image: cmk:v1.5.2
    imagePullPolicy: "Never"
    name: cmk-isolate-infra
    volumeMounts:
    - mountPath: "/host/proc"
      name: host-proc
      readOnly: true
    - mountPath: "/opt/bin"
      name: cmk-install-dir
  restartPolicy: Never
  volumes:
  - hostPath:
      # Change this to modify the CMK installation dir in the host file system.
      path: "/opt/bin"
    name: cmk-install-dir
  - hostPath:
      path: "/proc"
    name: host-proc
  • Run kubectl create -f <file-name>, where <file-name> is name of the Pod manifest file with nodeName field substituted as mentioned in the previous step.
  • Check if any process is isolated in the infra pool using NodeReport for that node. If you using third part resources (kubernetes 1.6.x and older versions) kubectl get NodeReport <node-name> -o json | jq .report.description.pools.infra If you using custom resources definition (kubernetes 1.7.x and newer versions) kubectl get cmk-nodereport <node-name> -o json | jq .spec.report.description.pools.infra

Validating CMK mutating webhook (Kubernetes v1.9.0+)

  • Follow all the above steps, but use simplified Pod manifest:
apiVersion: v1
kind: Pod
metadata:
  labels:
    app: cmk-isolate-pod
  name: cmk-isolate-pod
spec:
  # Change this to the <node-name> you want to test.
  nodeName: NODENAME
  containers:
  - args:
    - "/opt/bin/cmk isolate --pool=exclusive sleep -- 10000"
    command:
    - "/bin/bash"
    - "-c"
    env:
    image: cmk:v1.5.2
    imagePullPolicy: "Never"
    name: cmk-isolate-infra
    resources:
      requests:
        cmk.intel.com/exclusive-cores: 1
  restartPolicy: Never
  • Run kubectl create -f <file-name>, where <file-name> is the name of the Pod manifest file with nodeName field substituted as mentioned in the previous section.
  • Run kubectl get pod cmk-isolate-pod -o json | jq .metadata.annotations and verify that annotation has been added:
{
  "cmk.intel.com/resources-injected": "true"
}
  • Run kubectl get pod cmk-isolate-pod -o json | jq .spec.volumes and verify that extra volumes have been injected:
[
  {
    "name": "default-token-xfd8q",
    "secret": {
      "defaultMode": 420,
      "secretName": "default-token-xfd8q"
    }
  },
  {
    "hostPath": {
      "path": "/proc",
      "type": ""
    },
    "name": "cmk-host-proc"
  },
  {
    "hostPath": {
      "path": "/opt/bin",
      "type": ""
    },
    "name": "cmk-install-dir"
  }
]
  • Run kubectl get pod cmk-isolate-pod -o json | jq .spec.containers[0].env and verify that env variables have been added to the container spec:
[
  {
    "name": "CMK_PROC_FS",
    "value": "/host/proc"
  },
  {
    "name": "CMK_NUM_CORES",
    "value": "1"
  }
]

Dynamic Pool Reconfiguration

Dynamic reconfiguration allows you to reconfigure the pool setup of your CMK nodes in your cluster without having to tear down CMK and clean up any of the configuration directories or configmap associated with CMK. The reconfigure command will look at every pod in every namespace on all of the CMK nodes in your cluster but will only reassign those pods that have been assigned cores using CMK. This knocks a considerable amount of time off of the operation and makes it a lot easier. It also means that you don't have to stop any processes that are currently running in order to reconfigure, as this method will automatically reassign any processes to the new cores in the new configuration. For example, consider the following CMK pool configuration:

   data:
     config: |
       exclusive:
         0:
           3,11: []
           4,12:
	   - '3001'
           5,13: []
           6,14: []
         1: {}
       infra:
         0:
           0-2,8-10: []
         1: {}
       shared:
         0:
           7,15:
	   - '2000, 2001'
         1: {}
	 
    data:
     config: |
       exclusive:
         0:
           3,11: []
           4,12:
	   - '3001'
         1: {}
       infra:
         0:
           0-2,8-10: []
         1: {}
       shared:
         0:
           6,14,7,15:
	   - '2000, 2001'
         1: {}

The processes 2000 and 2001 in the shared pool will have their cpu affinity changed from the original ["7,15"] to the updated ["6,14,7,15"] when the reconfiguration has completed. In the case of the exclusive pool, you can see that the process 3001 remained in the Core List 4,12 instead of being reassigned the Core List 3,11. This is so there is no unnecessary interruption to the process running on those cores because they will be high-priority processes that require low latency and zero interrupts. If the Core List that a process is running in is not available in the updated configuration (for example if only one exclusive pool was requested in the new setup, meaning only Core List 3,11 would be assigned), then of course the exclusive process will have to be reassigned to the new Core List.

How Do You Reconfigure with CMK?

To use this reconfigure method you simply run a pod and us the reconfigure_setup option in cmk.py. The reconfigure option requires the following parameters: num-exclusive-cores, num-shared-cores, excl-non-isolcpus, exclusive-mode, shared-mode, cmk-img, cmk-img-pol, install-dir, saname, namespace

An example PodSpec is provided in the resources/pods folder of the repository. An example command would look like the following:

	"/opt/bin/cmk isolate --pool=infra /opt/bin/cmk -- reconfigure_setup --num-exclusive-cores=2 --num-shared-cores=2 --namespace=cmk-namespace"

The parameters that are not listed in this example take their default value, which can be seen by running the cmk --help command.

What happens if there aren't enough cores to house all of the processes in the current configuration?

This scenario will happen when, for example, your CMK configuration has three cores assigned to the exclusive pool, all of which have a process running on them, and you try to reconfigure CMK to have only two cores assigned to the exclusive pool. The reconfigure command will recognise that one of the processes will not be able to get reassigned to an exclusive core and fail out of the operation before any changes have been made to the configuration files.

The reconfigure operation will automatically detect which nodes in your cluster are CMK nodes and it will reconfigure all of them without you having to specify. It does this detection by looking for the following label in the annotations of the node: "cmk.intel.com/cmk-node" == "true" This label is added by the discover operation, which occurs as part cluster_init, so you don't have to add the label yourself.

Troubleshooting and recovery

If running cmk cluster-init using the cmk-cluster-init-pod template ends up in an error, the recommended way to start troubleshooting is to look at the logs using kubectl logs POD_NAME [CONTAINER_NAME] -f.

For example, assuming you ran the cmk-cluster-init-pod template with default options, it should create two pods on each node named cmk-init-install-discover-pod-<node-name> and cmk-reconcile-nodereport-<node-name>, where <node-name> should be replaced with the name of the node.

If you want to look at the logs from the container which ran the discover subcommand in the pod, you can use kubectl logs -f cmk-init-install-discover-pod-<node-name> discover

If you want to look at the logs from the container which ran the reconcile subcommand in the pod, you can use kubectl logs -f cmk-reconcile-nodereport-pod-<node-name> reconcile

If you want to remove cmk use cmk-uninstall-pod.yaml. nodeSelector can help to fine-grain the deletion for specific node.