Skip to content

Commit

Permalink
fix: Default to server side apply and update MPI operator for NVIDIA … (
Browse files Browse the repository at this point in the history
  • Loading branch information
bryantbiggs authored Oct 28, 2024
1 parent d5ddd10 commit 40ed02f
Show file tree
Hide file tree
Showing 8 changed files with 28 additions and 40 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
repos:
- repo: https://github.com/streetsidesoftware/cspell-cli
rev: v8.15.1
rev: v8.15.2
hooks:
- id: cspell
args: [--exclude, 'ADOPTERS.md', --exclude, '.pre-commit-config.yaml', --exclude, '.gitignore', --exclude, '*.drawio', --exclude, 'mkdocs.yml', --exclude, '.helmignore', --exclude, '.github/workflows/*', --exclude, 'patterns/istio-multi-cluster/*', --exclude, 'patterns/blue-green-upgrade/*', --exclude, '/patterns/vpc-lattice/cross-cluster-pod-communication/*', --exclude, 'patterns/bottlerocket/*', --exclude, 'patterns/nvidia-gpu-efa/generate-efa-nccl-test.sh']
Expand Down
4 changes: 2 additions & 2 deletions patterns/gitops/getting-started-argocd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ The output looks like the following:
Bootstrap the addons using ArgoCD:

```shell
kubectl apply -f bootstrap/addons.yaml
kubectl apply --server-side -f bootstrap/addons.yaml
```

### Monitor GitOps Progress for Addons
Expand Down Expand Up @@ -188,7 +188,7 @@ echo "ArgoCD URL: https://$(kubectl get svc -n argocd argo-cd-argocd-server -o j
Deploy a sample application located in [k8s/game-2048.yaml](k8s/game-2048.yaml) using ArgoCD:

```shell
kubectl apply -f bootstrap/workloads.yaml
kubectl apply --server-side -f bootstrap/workloads.yaml
```

### Monitor GitOps Progress for Workloads
Expand Down
6 changes: 3 additions & 3 deletions patterns/istio/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ cluster with deployed Istio.
for ADDON in kiali jaeger prometheus grafana
do
ADDON_URL="https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/$ADDON.yaml"
kubectl apply -f $ADDON_URL
kubectl apply --server-side -f $ADDON_URL
done
```

Expand Down Expand Up @@ -177,7 +177,7 @@ kubectl port-forward svc/jaeger 16686:16686 -n istio-system
- containerPort: 5000
EOF
kubectl apply -f helloworld.yaml -n sample
kubectl apply --server-side -f helloworld.yaml -n sample
```
```text
Expand Down Expand Up @@ -239,7 +239,7 @@ kubectl port-forward svc/jaeger 16686:16686 -n istio-system
optional: true
EOF
kubectl apply -f sleep.yaml -n sample
kubectl apply --server-side -f sleep.yaml -n sample
```
```text
Expand Down
4 changes: 2 additions & 2 deletions patterns/karpenter-mng/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,13 +54,13 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
2. Provision the Karpenter `EC2NodeClass` and `NodePool` resources which provide Karpenter the necessary configurations to provision EC2 resources:

```sh
kubectl apply -f karpenter.yaml
kubectl apply --server-side -f karpenter.yaml
```

3. Once the Karpenter resources are in place, Karpenter will provision the necessary EC2 resources to satisfy any pending pods in the scheduler's queue. You can demonstrate this with the example deployment provided. First deploy the example deployment which has the initial number replicas set to 0:
```sh
kubectl apply -f example.yaml
kubectl apply --server-side -f example.yaml
```
4. When you scale the example deployment, you should see Karpenter respond by quickly provisioning EC2 resources to satisfy those pending pod requests:
Expand Down
4 changes: 2 additions & 2 deletions patterns/karpenter/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,13 +47,13 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
2. Provision the Karpenter `EC2NodeClass` and `NodePool` resources which provide Karpenter the necessary configurations to provision EC2 resources:

```sh
kubectl apply -f karpenter.yaml
kubectl apply --server-side -f karpenter.yaml
```

3. Once the Karpenter resources are in place, Karpenter will provision the necessary EC2 resources to satisfy any pending pods in the scheduler's queue. You can demonstrate this with the example deployment provided. First deploy the example deployment which has the initial number replicas set to 0:
```sh
kubectl apply -f example.yaml
kubectl apply --server-side -f example.yaml
```
4. When you scale the example deployment, you should see Karpenter respond by quickly provisioning EC2 resources to satisfy those pending pod requests:
Expand Down
4 changes: 2 additions & 2 deletions patterns/ml-container-cache/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,13 +81,13 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
4. Once the EKS cluster and node group have been provisioned, you can deploy the provided example pod that will use a cached image to verify the time it takes for the pod to reach a ready state.
```sh
kubectl apply -f pod-cached.yaml
kubectl apply --server-side -f pod-cached.yaml
```
You can contrast this with the time it takes for a pod that is not cached on a node by using the provided `pod-uncached.yaml` file. This works by simply using a pod that doesn't have a toleration for nodes that contain NVIDIA GPUs, which is where the cached images are provided in this example.
```sh
kubectl apply -f pod-uncached.yaml
kubectl apply --server-side -f pod-uncached.yaml
```
You can also do the same steps above but using the small, utility CLI [ktime](https://github.com/clowdhaus/ktime) which can either collect the pod events to measure the time duration to reach a ready state, or it can deploy a pod manifest and return the same:
Expand Down
40 changes: 14 additions & 26 deletions patterns/nvidia-gpu-efa/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,7 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
## Validate

!!! note

Desired instance type can be specified in [eks.tf](eks.tf#L36).
Desired instance type can be specified in [eks.tf](https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/d5ddd10afef9b4fd3e0cbba865645f0f522992ac/patterns/nvidia-gpu-efa/eks.tf#L38).
Values shown below will change based on the instance type selected (i.e. - `p5.48xlarge` has 8 GPUs and 32 EFA interfaces).
A list of EFA-enabled instance types is available [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types).
If you are using an on-demand capacity reservation (ODCR) for your instance type, please uncomment the `capacity_reservation_specification` block in `eks.tf`
Expand Down Expand Up @@ -66,36 +65,25 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
To deploy the MPI operator execute the following:

```sh
kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.4.0/deploy/v2beta1/mpi-operator.yaml
```

```text
namespace/mpi-operator created
customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org created
serviceaccount/mpi-operator created
clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-admin created
clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-edit created
clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-view created
clusterrole.rbac.authorization.k8s.io/mpi-operator created
clusterrolebinding.rbac.authorization.k8s.io/mpi-operator created
deployment.apps/mpi-operator created
```

In addition to deploying the operator, please apply a patch to the mpi-operator clusterrole
to allow the mpi-operator service account access to `leases` resources in the `coordination.k8s.io` apiGroup.

```sh
kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-do-eks/main/Container-Root/eks/deployment/kubeflow/mpi-operator/clusterrole-mpi-operator.yaml
kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yaml
```

```text
clusterrole.rbac.authorization.k8s.io/mpi-operator configured
namespace/mpi-operator serverside-applied
customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org serverside-applied
serviceaccount/mpi-operator serverside-applied
clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-admin serverside-applied
clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-edit serverside-applied
clusterrole.rbac.authorization.k8s.io/kubeflow-mpijobs-view serverside-applied
clusterrole.rbac.authorization.k8s.io/mpi-operator serverside-applied
clusterrolebinding.rbac.authorization.k8s.io/mpi-operator serverside-applied
deployment.apps/mpi-operator serverside-applied
```

3. EFA info test

This test prints a list of available EFA interfaces by using the `/opt/amazon/efa/bin/fi_info` utility.
The script [generate-efa-info-test.sh](generate-efa-info-test.sh) creates an MPIJob manifest file named `efa-info-test.yaml`. It assumes that there are two cluster nodes with 8 GPU's per node and 32 EFA adapters. If you are not using `p5.48xlarge` instances in your cluster, you may adjust the settings in the script prior to running it.
The script [generate-efa-info-test.sh](https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/main/patterns/nvidia-gpu-efa/generate-efa-info-test.sh) creates an MPIJob manifest file named `efa-info-test.yaml`. It assumes that there are two cluster nodes with 8 GPU's per node and 32 EFA adapters. If you are not using `p5.48xlarge` instances in your cluster, you may adjust the settings in the script prior to running it.
`NUM_WORKERS` - number of nodes you want to run the test on
`GPU_PER_WORKER` - number of GPUs available on each node
Expand All @@ -108,7 +96,7 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
To start the test apply the generated manifest to the cluster:
```sh
kubectl apply -f ./efa-info-test.yaml
kubectl apply --server-side -f ./efa-info-test.yaml
```
```text
Expand Down Expand Up @@ -186,7 +174,7 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
This script creates a file named `efa-nccl-test.yaml`. Apply the manifest to start the EFA nccl test.
```sh
kubectl apply -f ./efa-nccl-test.yaml
kubectl apply --server-side -f ./efa-nccl-test.yaml
```text
mpijob.kubeflow.org/efa-nccl-test created
Expand Down
4 changes: 2 additions & 2 deletions patterns/wireguard-with-cilium/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started
1. Deploy the example pods:

```sh
kubectl apply -f example.yaml
kubectl apply --server-side -f example.yaml
```

```text
Expand Down Expand Up @@ -100,7 +100,7 @@ See [here](https://aws-ia.github.io/terraform-aws-eks-blueprints/getting-started

```sh
kubectl create ns cilium-test
kubectl apply -n cilium-test -f https://raw.githubusercontent.com/cilium/cilium/v1.14.1/examples/kubernetes/connectivity-check/connectivity-check.yaml
kubectl apply --server-side -n cilium-test -f https://raw.githubusercontent.com/cilium/cilium/v1.14.1/examples/kubernetes/connectivity-check/connectivity-check.yaml
```

```text
Expand Down

0 comments on commit 40ed02f

Please sign in to comment.