Skip to content

Commit

Permalink
Add setup guide of gaudi prometheus exporter (#186)
Browse files Browse the repository at this point in the history
* Add setup guide of gaudi prometheus exporter
  • Loading branch information
joshuayao authored Nov 1, 2024
1 parent 021193f commit e9b8637
Show file tree
Hide file tree
Showing 3 changed files with 121 additions and 2 deletions.
33 changes: 31 additions & 2 deletions evals/benchmark/grafana/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,14 +38,16 @@ You should now access `localhost:9090/targets?search=` to open the Prometheus UI

### 1.1 CPU Metrics (optional)

The Prometheus Node Exporter is required for collecting CPU metrics. Install and run the Node Exporter via tarball by the [guide](https://prometheus.io/docs/guides/node-exporter/#installing-and-running-the-node-exporter).
The Prometheus Node Exporter is required for collecting CPU metrics. Deploy the Node Exporter via tarball by the [guide](https://prometheus.io/docs/guides/node-exporter/#installing-and-running-the-node-exporter).

Or install it in a K8S cluster by the following commands:

Ensure namespace `monitoring` was created in your K8S environment.

```bash
git clone https://github.com/opea-project/GenAIEval.git
cd GenAIEval/evals/benchmark/grafana/
kubectl apply -f grafana_node_exporter.yaml
kubectl apply -f prometheus_cpu_exporter.yaml
```

Add the following configuration to `prometheus.yml`:
Expand All @@ -58,6 +60,33 @@ scrape_configs:
- targets: ["<NODE1_IP>:9100", "<NODE2_IP>:9100", ...]
```

### 1.2 Intel® Gaudi® Metrics (optional)

The Intel Gaudi Prometheus Metrics Exporter is required for collecting Intel® Gaudi® AI accelerator metrics.

Follow the [guide](https://docs.habana.ai/en/latest/Orchestration/Prometheus_Metric_Exporter.html#deploying-prometheus-metric-exporter-in-docker) to deploy the metrics exporter in Docker.

Or install it in a K8S cluster by the following commands:

Ensure namespace `monitoring` was created in your K8S environment.

```bash
git clone https://github.com/opea-project/GenAIEval.git
cd GenAIEval/evals/benchmark/grafana/
kubectl apply -f prometheus_gaudi_exporter.yaml
```

Add the following configuration to `prometheus.yml`:

```yaml
scrape_configs:
- job_name: "prometheus-gaudi-exporter"
scrape_interval: 15s
metrics_path: /metrics
static_configs:
- targets: ["<NODE1_IP>:41611", "<NODE2_IP>:41611", ...]
```

Restart Prometheus after saving the changes.

## 2. Setup Grafana
Expand Down
90 changes: 90 additions & 0 deletions evals/benchmark/grafana/prometheus_gaudi_exporter.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
app.kubernetes.io/name: metric-exporter-ds
app.kubernetes.io/version: v0.0.1
name: metric-exporter-ds
namespace: monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/name: metric-exporter-ds
template:
metadata:
labels:
app.kubernetes.io/name: metric-exporter-ds
app.kubernetes.io/version: v0.0.1
spec:
priorityClassName: "system-node-critical"
imagePullSecrets: []
tolerations:
- key: "habana.ai/gaudi"
operator: "Exists"
effect: "NoSchedule"
# Required for network monitoring
hostNetwork: true
containers:
- name: metric-exporter
image: vault.habana.ai/gaudi-metric-exporter/metric-exporter:1.18.0-524
imagePullPolicy: Always
env:
- name: LD_LIBRARY_PATH
value: "/usr/lib/habanalabs"
securityContext:
privileged: true
volumeMounts:
- name: pod-resources
mountPath: /var/lib/kubelet/pod-resources
ports:
- name: habana-metrics
containerPort: 41611
protocol: TCP
resources:
limits:
cpu: 150m
memory: 120Mi
requests:
cpu: 100m
memory: 100Mi
volumes:
- name: pod-resources
hostPath:
path: /var/lib/kubelet/pod-resources

---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: metric-exporter
app.kubernetes.io/version: v0.0.1
name: metric-exporter
namespace: monitoring
spec:
clusterIP: None
ports:
- name: habana-metrics
port: 41611
selector:
app.kubernetes.io/name: metric-exporter-ds

# ---
# apiVersion: monitoring.coreos.com/v1
# kind: ServiceMonitor
# metadata:
# labels:
# app.kubernetes.io/name: metric-exporter
# app.kubernetes.io/version: v0.0.1
# name: metric-exporter
# namespace: monitoring
# spec:
# selector:
# matchLabels:
# app.kubernetes.io/name: metric-exporter
# endpoints:
# - port: habana-metrics
# interval: 30s

0 comments on commit e9b8637

Please sign in to comment.