Skip to content

Commit

Permalink
doc start
Browse files Browse the repository at this point in the history
  • Loading branch information
lewinkedrs committed Jan 17, 2024
1 parent f6889ec commit 12dac03
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 0 deletions.
14 changes: 14 additions & 0 deletions docs/eks/gpumon.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Monitoring NVIDIA GPU Workloads

GPUs play an integral part in data intensive workloads. The base infrastructure module of the Observability Accelerator proivdes the ability to deploy the NVIDIA DCGM Exporter Dashboard
The dashboard utilizes metrics scraped from the '/metrics' endpoint that are exposed when running the nvidia gpu operator.

!!!note
In order to make use of this dashboard, you will need to have a GPU backed EKS cluster and deploy the [GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/amazon-eks.html)
The recommended way of deploying the GPU operator is the [Data on EKS Blueprint](https://github.com/aws-ia/terraform-aws-eks-data-addons/blob/main/nvidia-gpu-operator.tf)

## Deployment

This is enabled by default in the [base infrasturcture module](https://aws-observability.github.io/terraform-aws-observability-accelerator/eks/).


1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ nav:
- Amazon EKS:
- Infrastructure: eks/index.md
- EKS API server: eks/eks-apiserver.md
- EKS GPU montitoring: eks/gpumon.md
- Multicluster:
- Single AWS account: eks/multicluster.md
- Cross AWS account: eks/multiaccount.md
Expand Down

0 comments on commit 12dac03

Please sign in to comment.