diff --git a/docs/eks/gpumon.md b/docs/eks/gpumon.md new file mode 100644 index 00000000..d6432116 --- /dev/null +++ b/docs/eks/gpumon.md @@ -0,0 +1,14 @@ +# Monitoring NVIDIA GPU Workloads + +GPUs play an integral part in data intensive workloads. The base infrastructure module of the Observability Accelerator proivdes the ability to deploy the NVIDIA DCGM Exporter Dashboard +The dashboard utilizes metrics scraped from the '/metrics' endpoint that are exposed when running the nvidia gpu operator. + +!!!note + In order to make use of this dashboard, you will need to have a GPU backed EKS cluster and deploy the [GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/amazon-eks.html) + The recommended way of deploying the GPU operator is the [Data on EKS Blueprint](https://github.com/aws-ia/terraform-aws-eks-data-addons/blob/main/nvidia-gpu-operator.tf) + +## Deployment + +This is enabled by default in the [base infrasturcture module](https://aws-observability.github.io/terraform-aws-observability-accelerator/eks/). + + diff --git a/mkdocs.yml b/mkdocs.yml index 918978a0..2b426d65 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -28,6 +28,7 @@ nav: - Amazon EKS: - Infrastructure: eks/index.md - EKS API server: eks/eks-apiserver.md + - EKS GPU montitoring: eks/gpumon.md - Multicluster: - Single AWS account: eks/multicluster.md - Cross AWS account: eks/multiaccount.md