Support GPU on eksctl created AWS instances

- Document howto set up GPUs on AWS - Temporarily add a GPU profile to the uwhackweeks hub, until we setup an account for the snowex hackweek Ref 2i2c-org#1309
yuvipanda · May 17, 2022 · 5cb2abc · 5cb2abc
1 parent 42c6ec3
commit 5cb2abc
Show file tree

Hide file tree

Showing 12 changed files with 263 additions and 2 deletions.
diff --git a/config/clusters/uwhackweeks/common.values.yaml b/config/clusters/uwhackweeks/common.values.yaml
@@ -96,6 +96,18 @@ basehub:
             mem_guarantee: 115G
             node_selector:
               node.kubernetes.io/instance-type: m5.8xlarge
+        - display_name: "Large + GPU: p2.xlarge"
+          description: "~4CPUs, 60G RAM, 1 NVIDIA K80 GPU"
+          kubespawner_override:
+            mem_limit: null
+            mem_guarantee: 55G
+            image: "pangeo/ml-notebook:master"
+            environment:
+              NVIDIA_DRIVER_CAPABILITIES: compute,utility
+            extra_resource_limits:
+              nvidia.com/gpu: "1"
+            node_selector:
+              node.kubernetes.io/instance-type: p2.xlarge
     scheduling:
       userPlaceholder:
         enabled: false

diff --git a/config/clusters/uwhackweeks/support.values.yaml b/config/clusters/uwhackweeks/support.values.yaml
@@ -1,6 +1,10 @@
 prometheusIngressAuthSecret:
   enabled: true
 
+nvidiaDevicePlugin:
+  aws:
+    enabled: true
+
 prometheus:
   server:
     ingress:

diff --git a/docs/howto/features/gpu.md b/docs/howto/features/gpu.md
@@ -0,0 +1,138 @@
+(howto:features:gpu=)
+# Enable access to GPUs
+
+GPUs are heavily used in machine learning workflows, and we support
+GPUs on all major cloud providers.
+
+## Setting up GPU nodes
+
+### AWS
+
+#### Requesting Quota Increase
+
+On AWS, GPUs are provisioned by using P series nodes. Before they
+can be accessed, you need to ask AWS for increased quota of P
+series nodes.
+
+1. Login to the AWS management console of the account the cluster i
+   in.
+2. Make sure you are in same region the cluster is in, by checking the
+   region selector on the top right.
+3. Open the [EC2 Service Quotas](https://us-west-2.console.aws.amazon.com/servicequotas/home/services/ec2/quotas)
+   page
+4. Select 'Running On-Demand P Instances' quota
+5. Select 'Request Quota Increase'.
+6. Input the *number of vCPUs* needed. This translates to a total
+   number of GPU nodes based on how many CPUs the nodes we want have.
+   For example, if we are using [P2 nodes](https://aws.amazon.com/ec2/instance-types/p2/)
+   with NVIDIA K80 GPUs, each `p2.xlarge` node gives us 1 GPU and
+   4 vCPUs, so a quota of 8 vCPUs will allow us to spawn 2 GPU nodes.
+   We should fine tune this calculation for later, but for now, the
+   recommendation is to give users a `p2.xlarge` each, so the number
+   of vCPUs requested should be `4 * max number of GPU nodes`.
+7. Ask for the increase, and wait. This can take *several working days*.
+
+#### Setup GPU nodegroup on eksctl
+
+We use `eksctl` with `jsonnet` to provision our kubernetes clusters on
+AWS, and we can configure a node group there to provide us GPUs.
+
+1. In the `notebookNodes` definition in the appropriate `.jsonnet` file,
+   add a node definition for the appropriate GPU node type:
+
+
+   ```
+    {
+        instanceType: "p2.xlarge",
+        tags+: {
+            "k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu": "1"
+        },
+    }
+   ```
+
+   `p2.xlarge` gives us 1 K80 GPU and ~4 CPUs. The `tags` definition
+   is necessary to let the autoscaler know that this nodegroup has
+   1 GPU per node. If you're using a different machine type with
+   more GPUs, adjust this definition accordingly.
+
+2. Render the `.jsonnet` file into a `.yaml` file that `eksctl` can use
+
+   ```bash
+   jsonnet <your-cluster>.jsonnet > <your-cluster>.eksctl.yaml
+   ```
+
+3. Create the nodegroup
+
+   ```bash
+   eksctl create nodegroup -f <your-cluster>.eksctl.yaml --install-nvidia-plugin=false
+   ```
+
+   The `--install-nvidia-plugin=false` is required until
+   [this bug](https://github.com/weaveworks/eksctl/issues/5277)
+   is fixed.
+
+   This should create the nodegroup with 0 nodes in it, and the
+   autoscaler should recognize this!
+
+#### Setting up a GPU user profile
+
+Finally, we need to give users the option of using the GPU via
+a profile. This should be placed in the hub configuration:
+
+```yaml
+jupyterhub:
+    singleuser:
+        profileList:
+        - display_name: "Large + GPU: p2.xlarge"
+          description: "~4CPUs, 60G RAM, 1 NVIDIA K80 GPU"
+          kubespawner_override:
+            mem_limit: null
+            mem_guarantee: 55G
+            image: "pangeo/ml-notebook:<tag>"
+            environment:
+              NVIDIA_DRIVER_CAPABILITIES: compute,utility
+            extra_resource_limits:
+              nvidia.com/gpu: "1"
+            node_selector:
+              node.kubernetes.io/instance-type: p2.xlarge
+```
+
+1. If using a `daskhub`, place this under the `basehub` key.
+2. The image used should have ML tools (pytorch, cuda, etc)
+   installed. The recommendation is to use Pangeo's
+   [ml-notebook](https://hub.docker.com/r/pangeo/ml-notebook)
+   for tensorflow and [pytorch-notebook](https://hub.docker.com/r/pangeo/pytorch-notebook)
+   for pytorch. **Do not** use the `latest` or `master` tags - find
+   a specific tag listed for the image you want, and use that.
+3. The [NVIDIA_DRIVER_CAPABILITIES](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#driver-capabilities)
+   environment variable tells the GPU driver what kind of libraries
+   and tools to inject into the container. Without setting this,
+   GPUs can not be accessed.
+4. The `node_selector` makes sure that these user pods end up on
+   the appropriate nodegroup we created earlier. Change the selector
+   and the `mem_guarantee` if you are using a different kind of node
+
+Do a deployment with this config, and then we can test to make sure
+this works!
+
+#### Testing
+
+1. Login to the hub, and start a server with the GPU profile you
+   just set up.
+2. Open a terminal, and try running `nvidia-smi`. This should provide
+   you output indicating that a GPU is present.
+3. Open a notebook, and run the following python code to see if
+   tensorflow can access the GPUs:
+
+   ```python
+   import tensorflow as tf
+   tf.config.list_physical_devices('GPU')
+   ```
+
+   This should output something like:
+
+   ```
+   [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
+   ```
+
+If either of those tests fail, something is wrong and off you go debugging :)
diff --git a/docs/howto/features/index.md b/docs/howto/features/index.md
@@ -8,6 +8,7 @@ See the sections below for more details:
 :maxdepth: 2
 
 cloud-access
+gpu
 github
 ../customize/docs-service
 ../customize/configure-login-page

diff --git a/docs/howto/operate/new-cluster/aws.md b/docs/howto/operate/new-cluster/aws.md
@@ -21,7 +21,8 @@ eksctl for everything.
    for a quick configuration process.
 
 3. Install the latest version of [eksctl](https://eksctl.io/introduction/#installation). Mac users
-   can get it from homebrew with `brew install eksctl`.
+   can get it from homebrew with `brew install eksctl`. Make sure the version is at least 0.97 -
+   you can check by running `eksctl version`
 
 (new-cluster:aws)=
 ## Create a new cluster

diff --git a/docs/reference/tools.md b/docs/reference/tools.md
@@ -132,6 +132,9 @@ With just one tool to download and configure, you can control multiple AWS servi
 `eksctl` is a simple CLI tool for creating and managing clusters on EKS - Amazon's
 managed Kubernetes service for EC2. See [the `eksctl` documentation for more information](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html).
 
+Make sure you are using at least version 0.97. You
+can check the installed version with `eksctl version`
+
 ### kops
 
 `kops` will not only help you create, destroy, upgrade and maintain a production-grade,

diff --git a/docs/topic/features.md b/docs/topic/features.md
@@ -4,6 +4,14 @@ This document is a concise description of various features we can
 optionally enable on a given JupyterHub. Explicit instructions on how to
 do so should be provided in a linked how-to document.
 
+## GPUs
+
+GPUs are heavily used in machine learning workflows, and we support
+provisioning GPUs for users on all major platforms.
+
+See [the associated howto guide](howto:features:gpu) for more information
+on enabling this.
+
 ## Cloud Permissions
 
 Users of our hubs often need to be granted specific cloud permissions

diff --git a/eksctl/libsonnet/nodegroup.jsonnet b/eksctl/libsonnet/nodegroup.jsonnet
@@ -23,7 +23,7 @@ local makeCloudTaints(taints) = {
     'node.kubernetes.io/instance-type': if std.objectHas($, 'instanceType') then $.instanceType else $.instancesDistribution.instanceTypes[0],
   },
   taints+: {},
-  tags: makeCloudLabels(self.labels) + makeCloudTaints(self.taints),
+  tags+: makeCloudLabels(self.labels) + makeCloudTaints(self.taints),
   iam: {
     withAddonPolicies: {
       autoScaler: true,

diff --git a/eksctl/uwhackweeks.jsonnet b/eksctl/uwhackweeks.jsonnet
@@ -20,6 +20,12 @@ local notebookNodes = [
     { instanceType: "m5.xlarge", minSize: 0 },
     { instanceType: "m5.2xlarge", minSize: 0 },
     { instanceType: "m5.8xlarge", minSize: 0 },
+    {
+        instanceType: "p2.xlarge", minSize: 0,
+        tags+: {
+            "k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu": "1"
+        },
+    },
 ];
 
 // Node definitions for dask worker nodes. Config here is merged

diff --git a/helm-charts/support/templates/aws-nvidia-device-plugin.yaml b/helm-charts/support/templates/aws-nvidia-device-plugin.yaml
@@ -0,0 +1,76 @@
+# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Sourced from $ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.11.0/nvidia-device-plugin.yml
+# Could be made automatic if https://github.com/weaveworks/eksctl/issues/5277 is fixed
+
+{{- if .Values.nvidiaDevicePlugin.aws.enabled }}
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: nvidia-device-plugin-daemonset
+  namespace: kube-system
+spec:
+  selector:
+    matchLabels:
+      name: nvidia-device-plugin-ds
+  updateStrategy:
+    type: RollingUpdate
+  template:
+    metadata:
+      # This annotation is deprecated. Kept here for backward compatibility
+      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
+      annotations:
+        scheduler.alpha.kubernetes.io/critical-pod: ""
+      labels:
+        name: nvidia-device-plugin-ds
+    spec:
+      tolerations:
+      # This toleration is deprecated. Kept here for backward compatibility
+      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
+      - key: CriticalAddonsOnly
+        operator: Exists
+      - key: nvidia.com/gpu
+        operator: Exists
+        effect: NoSchedule
+      # Custom tolerations required for our user pods
+      - effect: NoSchedule
+        key: hub.jupyter.org/dedicated
+        operator: Equal
+        value: user
+      - effect: NoSchedule
+        key: hub.jupyter.org_dedicated
+        operator: Equal
+        value: user
+      # Mark this pod as a critical add-on; when enabled, the critical add-on
+      # scheduler reserves resources for critical add-on pods so that they can
+      # be rescheduled after a failure.
+      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
+      priorityClassName: "system-node-critical"
+      containers:
+      - image: nvcr.io/nvidia/k8s-device-plugin:v0.11.0
+        name: nvidia-device-plugin-ctr
+        args: ["--fail-on-init-error=false"]
+        securityContext:
+          allowPrivilegeEscalation: false
+          capabilities:
+            drop: ["ALL"]
+        volumeMounts:
+          - name: device-plugin
+            mountPath: /var/lib/kubelet/device-plugins
+      volumes:
+        - name: device-plugin
+          hostPath:
+            path: /var/lib/kubelet/device-plugins
+{{- end -}}
diff --git a/helm-charts/support/values.schema.yaml b/helm-charts/support/values.schema.yaml
@@ -62,6 +62,7 @@ properties:
     required:
       - azure
       - gke
+      - aws
     properties:
       azure:
         type: object
@@ -71,6 +72,14 @@ properties:
         properties:
           enabled:
             type: boolean
+      aws:
+        type: object
+        additionalProperties: false
+        required:
+          - enabled
+        properties:
+          enabled:
+            type: boolean
       gke:
         type: object
         additionalProperties: false

diff --git a/helm-charts/support/values.yaml b/helm-charts/support/values.yaml
@@ -129,6 +129,9 @@ nvidiaDevicePlugin:
   # For GKE specific image, defaults to false
   gke:
     enabled: false
+  # For eksctl / AWS specific daemonset, defaults to false
+  aws:
+    enabled: false
 
 # A placeholder as global values that can be referenced from the same location
 # of any chart should be possible to provide, but aren't necessarily provided or