diff --git a/config/clusters/uwhackweeks/common.values.yaml b/config/clusters/uwhackweeks/common.values.yaml index 8a28f151f8..847deff5d7 100644 --- a/config/clusters/uwhackweeks/common.values.yaml +++ b/config/clusters/uwhackweeks/common.values.yaml @@ -96,6 +96,18 @@ basehub: mem_guarantee: 115G node_selector: node.kubernetes.io/instance-type: m5.8xlarge + - display_name: "Large + GPU: p2.xlarge" + description: "~4CPUs, 60G RAM, 1 NVIDIA K80 GPU" + kubespawner_override: + mem_limit: null + mem_guarantee: 55G + image: "pangeo/ml-notebook:master" + environment: + NVIDIA_DRIVER_CAPABILITIES: compute,utility + extra_resource_limits: + nvidia.com/gpu: "1" + node_selector: + node.kubernetes.io/instance-type: p2.xlarge scheduling: userPlaceholder: enabled: false diff --git a/config/clusters/uwhackweeks/support.values.yaml b/config/clusters/uwhackweeks/support.values.yaml index 2ea1760805..adc81d13fc 100644 --- a/config/clusters/uwhackweeks/support.values.yaml +++ b/config/clusters/uwhackweeks/support.values.yaml @@ -1,6 +1,10 @@ prometheusIngressAuthSecret: enabled: true +nvidiaDevicePlugin: + aws: + enabled: true + prometheus: server: ingress: diff --git a/docs/howto/features/gpu.md b/docs/howto/features/gpu.md new file mode 100644 index 0000000000..3eb7dd96d3 --- /dev/null +++ b/docs/howto/features/gpu.md @@ -0,0 +1,140 @@ +(howto:features:gpu=) +# Enable access to GPUs + +GPUs are heavily used in machine learning workflows, and we support +GPUs on all major cloud providers. + +## Setting up GPU nodes + +### AWS + +#### Requesting Quota Increase + +On AWS, GPUs are provisioned by using P series nodes. Before they +can be accessed, you need to ask AWS for increased quota of P +series nodes. + +1. Login to the AWS management console of the account the cluster i + in. +2. Make sure you are in same region the cluster is in, by checking the + region selector on the top right. +3. Open the [EC2 Service Quotas](https://us-west-2.console.aws.amazon.com/servicequotas/home/services/ec2/quotas) + page +4. Select 'Running On-Demand P Instances' quota +5. Select 'Request Quota Increase'. +6. Input the *number of vCPUs* needed. This translates to a total + number of GPU nodes based on how many CPUs the nodes we want have. + For example, if we are using [P2 nodes](https://aws.amazon.com/ec2/instance-types/p2/) + with NVIDIA K80 GPUs, each `p2.xlarge` node gives us 1 GPU and + 4 vCPUs, so a quota of 8 vCPUs will allow us to spawn 2 GPU nodes. + We should fine tune this calculation for later, but for now, the + recommendation is to give users a `p2.xlarge` each, so the number + of vCPUs requested should be `4 * max number of GPU nodes`. +7. Ask for the increase, and wait. This can take *several working days*. + +#### Setup GPU nodegroup on eksctl + +We use `eksctl` with `jsonnet` to provision our kubernetes clusters on +AWS, and we can configure a node group there to provide us GPUs. + +1. In the `notebookNodes` definition in the appropriate `.jsonnet` file, + add a node definition for the appropriate GPU node type: + + + ``` + { + instanceType: "p2.xlarge", + tags+: { + "k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu": "1" + }, + } + ``` + + `p2.xlarge` gives us 1 K80 GPU and ~4 CPUs. The `tags` definition + is necessary to let the autoscaler know that this nodegroup has + 1 GPU per node. If you're using a different machine type with + more GPUs, adjust this definition accordingly. + +2. Render the `.jsonnet` file into a `.yaml` file that `eksctl` can use + + ```bash + jsonnet .jsonnet > .eksctl.yaml + ``` + +3. Create the nodegroup + + ```bash + eksctl create nodegroup -f .eksctl.yaml --install-nvidia-plugin=false + ``` + + The `--install-nvidia-plugin=false` is required until + [this bug](https://github.com/weaveworks/eksctl/issues/5277) + is fixed. + + This should create the nodegroup with 0 nodes in it, and the + autoscaler should recognize this! + +#### Setting up a GPU user profile + +Finally, we need to give users the option of using the GPU via +a profile. This should be placed in the hub configuration: + +```yaml +jupyterhub: + singleuser: + profileList: + - display_name: "Large + GPU: p2.xlarge" + description: "~4CPUs, 60G RAM, 1 NVIDIA K80 GPU" + kubespawner_override: + mem_limit: null + mem_guarantee: 55G + image: "pangeo/ml-notebook:" + environment: + NVIDIA_DRIVER_CAPABILITIES: compute,utility + extra_resource_limits: + nvidia.com/gpu: "1" + node_selector: + node.kubernetes.io/instance-type: p2.xlarge +``` + +1. If using a `daskhub`, place this under the `basehub` key. +2. The image used should have ML tools (pytorch, cuda, etc) + installed. The recommendation is to use Pangeo's + [ml-notebook](https://hub.docker.com/r/pangeo/ml-notebook) + for tensorflow and [pytorch-notebook](https://hub.docker.com/r/pangeo/pytorch-notebook) + for pytorch. **Do not** use the `latest` or `master` tags - find + a specific tag listed for the image you want, and use that. +3. The [NVIDIA_DRIVER_CAPABILITIES](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#driver-capabilities) + environment variable tells the GPU driver what kind of libraries + and tools to inject into the container. Without setting this, + GPUs can not be accessed. +4. The `node_selector` makes sure that these user pods end up on + the appropriate nodegroup we created earlier. Change the selector + and the `mem_guarantee` if you are using a different kind of node + +Do a deployment with this config, and then we can test to make sure +this works! + +#### Testing + +1. Login to the hub, and start a server with the GPU profile you + just set up. +2. Open a terminal, and try running `nvidia-smi`. This should provide + you output indicating that a GPU is present. +3. Open a notebook, and run the following python code to see if + tensorflow can access the GPUs: + + ```python + import tensorflow as tf + tf.config.list_physical_devices('GPU') + ``` + + This should output something like: + + ``` + [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')] + ``` +4. Remember to explicitly shut down your server after testing, + as GPU instances can get expensive! + +If either of those tests fail, something is wrong and off you go debugging :) \ No newline at end of file diff --git a/docs/howto/features/index.md b/docs/howto/features/index.md index 39ddb2a8f4..958265541f 100644 --- a/docs/howto/features/index.md +++ b/docs/howto/features/index.md @@ -8,6 +8,7 @@ See the sections below for more details: :maxdepth: 2 cloud-access +gpu github ../customize/docs-service ../customize/configure-login-page diff --git a/docs/howto/operate/new-cluster/aws.md b/docs/howto/operate/new-cluster/aws.md index 486ecb30e3..7d4cb1cbf3 100644 --- a/docs/howto/operate/new-cluster/aws.md +++ b/docs/howto/operate/new-cluster/aws.md @@ -21,7 +21,8 @@ eksctl for everything. for a quick configuration process. 3. Install the latest version of [eksctl](https://eksctl.io/introduction/#installation). Mac users - can get it from homebrew with `brew install eksctl`. + can get it from homebrew with `brew install eksctl`. Make sure the version is at least 0.97 - + you can check by running `eksctl version` (new-cluster:aws)= ## Create a new cluster diff --git a/docs/reference/tools.md b/docs/reference/tools.md index 81360a0f4b..1e76db9767 100644 --- a/docs/reference/tools.md +++ b/docs/reference/tools.md @@ -132,6 +132,9 @@ With just one tool to download and configure, you can control multiple AWS servi `eksctl` is a simple CLI tool for creating and managing clusters on EKS - Amazon's managed Kubernetes service for EC2. See [the `eksctl` documentation for more information](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html). +Make sure you are using at least version 0.97. You +can check the installed version with `eksctl version` + ### kops `kops` will not only help you create, destroy, upgrade and maintain a production-grade, diff --git a/docs/topic/features.md b/docs/topic/features.md index e04faa07eb..f40ece443b 100644 --- a/docs/topic/features.md +++ b/docs/topic/features.md @@ -4,6 +4,14 @@ This document is a concise description of various features we can optionally enable on a given JupyterHub. Explicit instructions on how to do so should be provided in a linked how-to document. +## GPUs + +GPUs are heavily used in machine learning workflows, and we support +provisioning GPUs for users on all major platforms. + +See [the associated howto guide](howto:features:gpu) for more information +on enabling this. + ## Cloud Permissions Users of our hubs often need to be granted specific cloud permissions diff --git a/eksctl/libsonnet/nodegroup.jsonnet b/eksctl/libsonnet/nodegroup.jsonnet index e54ff39e48..9ba49427d0 100644 --- a/eksctl/libsonnet/nodegroup.jsonnet +++ b/eksctl/libsonnet/nodegroup.jsonnet @@ -23,7 +23,7 @@ local makeCloudTaints(taints) = { 'node.kubernetes.io/instance-type': if std.objectHas($, 'instanceType') then $.instanceType else $.instancesDistribution.instanceTypes[0], }, taints+: {}, - tags: makeCloudLabels(self.labels) + makeCloudTaints(self.taints), + tags+: makeCloudLabels(self.labels) + makeCloudTaints(self.taints), iam: { withAddonPolicies: { autoScaler: true, diff --git a/eksctl/uwhackweeks.jsonnet b/eksctl/uwhackweeks.jsonnet index c6cc96d71d..5bd576114b 100644 --- a/eksctl/uwhackweeks.jsonnet +++ b/eksctl/uwhackweeks.jsonnet @@ -20,6 +20,12 @@ local notebookNodes = [ { instanceType: "m5.xlarge", minSize: 0 }, { instanceType: "m5.2xlarge", minSize: 0 }, { instanceType: "m5.8xlarge", minSize: 0 }, + { + instanceType: "p2.xlarge", minSize: 0, + tags+: { + "k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu": "1" + }, + }, ]; // Node definitions for dask worker nodes. Config here is merged diff --git a/helm-charts/support/templates/aws-nvidia-device-plugin.yaml b/helm-charts/support/templates/aws-nvidia-device-plugin.yaml new file mode 100644 index 0000000000..6579f1b72e --- /dev/null +++ b/helm-charts/support/templates/aws-nvidia-device-plugin.yaml @@ -0,0 +1,76 @@ +# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Sourced from $ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.11.0/nvidia-device-plugin.yml +# Could be made automatic if https://github.com/weaveworks/eksctl/issues/5277 is fixed + +{{- if .Values.nvidiaDevicePlugin.aws.enabled }} +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: nvidia-device-plugin-daemonset + namespace: kube-system +spec: + selector: + matchLabels: + name: nvidia-device-plugin-ds + updateStrategy: + type: RollingUpdate + template: + metadata: + # This annotation is deprecated. Kept here for backward compatibility + # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ + annotations: + scheduler.alpha.kubernetes.io/critical-pod: "" + labels: + name: nvidia-device-plugin-ds + spec: + tolerations: + # This toleration is deprecated. Kept here for backward compatibility + # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ + - key: CriticalAddonsOnly + operator: Exists + - key: nvidia.com/gpu + operator: Exists + effect: NoSchedule + # Custom tolerations required for our user pods + - effect: NoSchedule + key: hub.jupyter.org/dedicated + operator: Equal + value: user + - effect: NoSchedule + key: hub.jupyter.org_dedicated + operator: Equal + value: user + # Mark this pod as a critical add-on; when enabled, the critical add-on + # scheduler reserves resources for critical add-on pods so that they can + # be rescheduled after a failure. + # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ + priorityClassName: "system-node-critical" + containers: + - image: nvcr.io/nvidia/k8s-device-plugin:v0.11.0 + name: nvidia-device-plugin-ctr + args: ["--fail-on-init-error=false"] + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: ["ALL"] + volumeMounts: + - name: device-plugin + mountPath: /var/lib/kubelet/device-plugins + volumes: + - name: device-plugin + hostPath: + path: /var/lib/kubelet/device-plugins +{{- end -}} \ No newline at end of file diff --git a/helm-charts/support/values.schema.yaml b/helm-charts/support/values.schema.yaml index 32377be291..a46e101716 100644 --- a/helm-charts/support/values.schema.yaml +++ b/helm-charts/support/values.schema.yaml @@ -62,6 +62,7 @@ properties: required: - azure - gke + - aws properties: azure: type: object @@ -71,6 +72,14 @@ properties: properties: enabled: type: boolean + aws: + type: object + additionalProperties: false + required: + - enabled + properties: + enabled: + type: boolean gke: type: object additionalProperties: false diff --git a/helm-charts/support/values.yaml b/helm-charts/support/values.yaml index 77117b155c..dbca3b7d1d 100644 --- a/helm-charts/support/values.yaml +++ b/helm-charts/support/values.yaml @@ -129,6 +129,9 @@ nvidiaDevicePlugin: # For GKE specific image, defaults to false gke: enabled: false + # For eksctl / AWS specific daemonset, defaults to false + aws: + enabled: false # A placeholder as global values that can be referenced from the same location # of any chart should be possible to provide, but aren't necessarily provided or