Skip to content

Commit

Permalink
Support GPU on eksctl created AWS instances
Browse files Browse the repository at this point in the history
- Document howto set up GPUs on AWS
- Temporarily add a GPU profile to the uwhackweeks
  hub, until we setup an account for the snowex hackweek

Ref 2i2c-org#1309
  • Loading branch information
yuvipanda committed May 17, 2022
1 parent 42c6ec3 commit 5cb2abc
Show file tree
Hide file tree
Showing 12 changed files with 263 additions and 2 deletions.
12 changes: 12 additions & 0 deletions config/clusters/uwhackweeks/common.values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,18 @@ basehub:
mem_guarantee: 115G
node_selector:
node.kubernetes.io/instance-type: m5.8xlarge
- display_name: "Large + GPU: p2.xlarge"
description: "~4CPUs, 60G RAM, 1 NVIDIA K80 GPU"
kubespawner_override:
mem_limit: null
mem_guarantee: 55G
image: "pangeo/ml-notebook:master"
environment:
NVIDIA_DRIVER_CAPABILITIES: compute,utility
extra_resource_limits:
nvidia.com/gpu: "1"
node_selector:
node.kubernetes.io/instance-type: p2.xlarge
scheduling:
userPlaceholder:
enabled: false
Expand Down
4 changes: 4 additions & 0 deletions config/clusters/uwhackweeks/support.values.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
prometheusIngressAuthSecret:
enabled: true

nvidiaDevicePlugin:
aws:
enabled: true

prometheus:
server:
ingress:
Expand Down
138 changes: 138 additions & 0 deletions docs/howto/features/gpu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
(howto:features:gpu=)
# Enable access to GPUs

GPUs are heavily used in machine learning workflows, and we support
GPUs on all major cloud providers.

## Setting up GPU nodes

### AWS

#### Requesting Quota Increase

On AWS, GPUs are provisioned by using P series nodes. Before they
can be accessed, you need to ask AWS for increased quota of P
series nodes.

1. Login to the AWS management console of the account the cluster i
in.
2. Make sure you are in same region the cluster is in, by checking the
region selector on the top right.
3. Open the [EC2 Service Quotas](https://us-west-2.console.aws.amazon.com/servicequotas/home/services/ec2/quotas)
page
4. Select 'Running On-Demand P Instances' quota
5. Select 'Request Quota Increase'.
6. Input the *number of vCPUs* needed. This translates to a total
number of GPU nodes based on how many CPUs the nodes we want have.
For example, if we are using [P2 nodes](https://aws.amazon.com/ec2/instance-types/p2/)
with NVIDIA K80 GPUs, each `p2.xlarge` node gives us 1 GPU and
4 vCPUs, so a quota of 8 vCPUs will allow us to spawn 2 GPU nodes.
We should fine tune this calculation for later, but for now, the
recommendation is to give users a `p2.xlarge` each, so the number
of vCPUs requested should be `4 * max number of GPU nodes`.
7. Ask for the increase, and wait. This can take *several working days*.

#### Setup GPU nodegroup on eksctl

We use `eksctl` with `jsonnet` to provision our kubernetes clusters on
AWS, and we can configure a node group there to provide us GPUs.

1. In the `notebookNodes` definition in the appropriate `.jsonnet` file,
add a node definition for the appropriate GPU node type:


```
{
instanceType: "p2.xlarge",
tags+: {
"k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu": "1"
},
}
```

`p2.xlarge` gives us 1 K80 GPU and ~4 CPUs. The `tags` definition
is necessary to let the autoscaler know that this nodegroup has
1 GPU per node. If you're using a different machine type with
more GPUs, adjust this definition accordingly.

2. Render the `.jsonnet` file into a `.yaml` file that `eksctl` can use

```bash
jsonnet <your-cluster>.jsonnet > <your-cluster>.eksctl.yaml
```

3. Create the nodegroup

```bash
eksctl create nodegroup -f <your-cluster>.eksctl.yaml --install-nvidia-plugin=false
```

The `--install-nvidia-plugin=false` is required until
[this bug](https://github.com/weaveworks/eksctl/issues/5277)
is fixed.

This should create the nodegroup with 0 nodes in it, and the
autoscaler should recognize this!

#### Setting up a GPU user profile

Finally, we need to give users the option of using the GPU via
a profile. This should be placed in the hub configuration:

```yaml
jupyterhub:
singleuser:
profileList:
- display_name: "Large + GPU: p2.xlarge"
description: "~4CPUs, 60G RAM, 1 NVIDIA K80 GPU"
kubespawner_override:
mem_limit: null
mem_guarantee: 55G
image: "pangeo/ml-notebook:<tag>"
environment:
NVIDIA_DRIVER_CAPABILITIES: compute,utility
extra_resource_limits:
nvidia.com/gpu: "1"
node_selector:
node.kubernetes.io/instance-type: p2.xlarge
```
1. If using a `daskhub`, place this under the `basehub` key.
2. The image used should have ML tools (pytorch, cuda, etc)
installed. The recommendation is to use Pangeo's
[ml-notebook](https://hub.docker.com/r/pangeo/ml-notebook)
for tensorflow and [pytorch-notebook](https://hub.docker.com/r/pangeo/pytorch-notebook)
for pytorch. **Do not** use the `latest` or `master` tags - find
a specific tag listed for the image you want, and use that.
3. The [NVIDIA_DRIVER_CAPABILITIES](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#driver-capabilities)
environment variable tells the GPU driver what kind of libraries
and tools to inject into the container. Without setting this,
GPUs can not be accessed.
4. The `node_selector` makes sure that these user pods end up on
the appropriate nodegroup we created earlier. Change the selector
and the `mem_guarantee` if you are using a different kind of node

Do a deployment with this config, and then we can test to make sure
this works!

#### Testing

1. Login to the hub, and start a server with the GPU profile you
just set up.
2. Open a terminal, and try running `nvidia-smi`. This should provide
you output indicating that a GPU is present.
3. Open a notebook, and run the following python code to see if
tensorflow can access the GPUs:

```python
import tensorflow as tf
tf.config.list_physical_devices('GPU')
```

This should output something like:

```
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
```
If either of those tests fail, something is wrong and off you go debugging :)
1 change: 1 addition & 0 deletions docs/howto/features/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ See the sections below for more details:
:maxdepth: 2
cloud-access
gpu
github
../customize/docs-service
../customize/configure-login-page
Expand Down
3 changes: 2 additions & 1 deletion docs/howto/operate/new-cluster/aws.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@ eksctl for everything.
for a quick configuration process.

3. Install the latest version of [eksctl](https://eksctl.io/introduction/#installation). Mac users
can get it from homebrew with `brew install eksctl`.
can get it from homebrew with `brew install eksctl`. Make sure the version is at least 0.97 -
you can check by running `eksctl version`

(new-cluster:aws)=
## Create a new cluster
Expand Down
3 changes: 3 additions & 0 deletions docs/reference/tools.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,9 @@ With just one tool to download and configure, you can control multiple AWS servi
`eksctl` is a simple CLI tool for creating and managing clusters on EKS - Amazon's
managed Kubernetes service for EC2. See [the `eksctl` documentation for more information](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html).

Make sure you are using at least version 0.97. You
can check the installed version with `eksctl version`

### kops

`kops` will not only help you create, destroy, upgrade and maintain a production-grade,
Expand Down
8 changes: 8 additions & 0 deletions docs/topic/features.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,14 @@ This document is a concise description of various features we can
optionally enable on a given JupyterHub. Explicit instructions on how to
do so should be provided in a linked how-to document.

## GPUs

GPUs are heavily used in machine learning workflows, and we support
provisioning GPUs for users on all major platforms.

See [the associated howto guide](howto:features:gpu) for more information
on enabling this.

## Cloud Permissions

Users of our hubs often need to be granted specific cloud permissions
Expand Down
2 changes: 1 addition & 1 deletion eksctl/libsonnet/nodegroup.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ local makeCloudTaints(taints) = {
'node.kubernetes.io/instance-type': if std.objectHas($, 'instanceType') then $.instanceType else $.instancesDistribution.instanceTypes[0],
},
taints+: {},
tags: makeCloudLabels(self.labels) + makeCloudTaints(self.taints),
tags+: makeCloudLabels(self.labels) + makeCloudTaints(self.taints),
iam: {
withAddonPolicies: {
autoScaler: true,
Expand Down
6 changes: 6 additions & 0 deletions eksctl/uwhackweeks.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,12 @@ local notebookNodes = [
{ instanceType: "m5.xlarge", minSize: 0 },
{ instanceType: "m5.2xlarge", minSize: 0 },
{ instanceType: "m5.8xlarge", minSize: 0 },
{
instanceType: "p2.xlarge", minSize: 0,
tags+: {
"k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu": "1"
},
},
];

// Node definitions for dask worker nodes. Config here is merged
Expand Down
76 changes: 76 additions & 0 deletions helm-charts/support/templates/aws-nvidia-device-plugin.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Sourced from $ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.11.0/nvidia-device-plugin.yml
# Could be made automatic if https://github.com/weaveworks/eksctl/issues/5277 is fixed

{{- if .Values.nvidiaDevicePlugin.aws.enabled }}
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Custom tolerations required for our user pods
- effect: NoSchedule
key: hub.jupyter.org/dedicated
operator: Equal
value: user
- effect: NoSchedule
key: hub.jupyter.org_dedicated
operator: Equal
value: user
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.11.0
name: nvidia-device-plugin-ctr
args: ["--fail-on-init-error=false"]
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
{{- end -}}
9 changes: 9 additions & 0 deletions helm-charts/support/values.schema.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ properties:
required:
- azure
- gke
- aws
properties:
azure:
type: object
Expand All @@ -71,6 +72,14 @@ properties:
properties:
enabled:
type: boolean
aws:
type: object
additionalProperties: false
required:
- enabled
properties:
enabled:
type: boolean
gke:
type: object
additionalProperties: false
Expand Down
3 changes: 3 additions & 0 deletions helm-charts/support/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,9 @@ nvidiaDevicePlugin:
# For GKE specific image, defaults to false
gke:
enabled: false
# For eksctl / AWS specific daemonset, defaults to false
aws:
enabled: false

# A placeholder as global values that can be referenced from the same location
# of any chart should be possible to provide, but aren't necessarily provided or
Expand Down

0 comments on commit 5cb2abc

Please sign in to comment.