Update observability README + fix typos #556

eero-t · 2024-11-13T18:20:34Z

Description

This is continuation of #541:

Updates observability README + fix typos
Gives its image files more descriptive names

Issues

n/a.

Type of change

Bug fix (non-breaking change which fixes an issue)

Dependencies

n/a.

Tests

CI + PR file view functionality.

Further work

Additional improvements that could be done e.g. after v1.1 release:

Remove serviceMonitor files (obsolete by this and earlier Helm support PR) from chatqna/
Generalize dashboards under chatqna/dashboards/ and move them to more appropriate place
Optimize images file sizes

Signed-off-by: Eero Tamminen <[email protected]>

Scaling them down, and converting to 8-bit would be good next step, to make also their sizes to more reasonable. Signed-off-by: Eero Tamminen <[email protected]>

for more information, see https://pre-commit.ci

yongfengdu · 2024-11-14T05:05:37Z

kubernetes-addons/Observability/README.md

@@ -40,7 +40,7 @@ kubectl port-forward service/grafana 3000:80

 Open your browser and navigate to http://localhost:3000. Use "admin/prom-operator" as the username and the password to login.

-## 2. Metric for Gaudi Hardware(v1.16.2)
+## 2. Metrics for Gaudi Hardware (v1.16.2)


The 1.1 release are using tgi-gaudi 2.0.6, which is validated with SW stack v 1.18. Shall we update/verify the Metrics with 1.18?

IMHO not worth the trouble for that dashboard. I do not see why Habana would change metric names between version upgrades, and there's a much better Gaudi HW panel in Eval repo [1], which is more worth checking.

In the long run, I think it would be better to separate dashboard for different purposes to different repos, instead of duplicating them:

Drop Gaudi HW dashboard from here [1],

Move PCM one to Eval repo, and

Add k8s specific ones here, that do some assumptions about how deployments are named with current Helm charts, to allow user to select from multiple apps running in different namespaces

=> I can do that after v1.1.

[1] Gaudi HW one here is not very good. It's lacking most metrics, does not allow selecting a node or device for them, and as can be seen from its screenshot in the README, the metric legends are awful:

$ cat Dashboard-Gaudi-HW.json | grep expr | cut -d'"' -f4- | sed 's/",$//' habanalabs_temperature_onboard habanalabs_kube_info habanalabs_memory_free_bytes habanalabs_power_mW habanalabs_utilization

vs one in Eval repo:

$ cat gaudi_grafana.json | grep expr | cut -d'"' -f4- | sed 's/",$//' habanalabs_device_config{instance=\"$node\", UUID=\"$hpu\"} habanalabs_utilization{instance=\"$node\", UUID=\"$hpu\"}/100 habanalabs_power_mW{instance=\"$node\", UUID=\"$hpu\"} habanalabs_pcie_receive_throughput{UUID=\"$hpu\", instance=\"$node\"} habanalabs_pcie_transmit_throughput{UUID=\"$hpu\", instance=\"$node\"} habanalabs_temperature_onchip{UUID=\"$hpu\", instance=\"$node\"} habanalabs_temperature_onboard{UUID=\"$hpu\", instance=\"$node\"} habanalabs_device_config{instance=\"$node\", UUID=\"$hpu\"} habanalabs_device_config{instance=\"$node\", UUID=\"$hpu\"} habanalabs_utilization{instance=\"$node\", UUID=\"$hpu\"}/100 habanalabs_temperature_onboard{UUID=\"$hpu\", instance=\"$node\"} habanalabs_temperature_onchip{UUID=\"$hpu\", instance=\"$node\"} habanalabs_clock_soc_mhz{UUID=\"$hpu\", instance=\"$node\"} habanalabs_memory_used_bytes{UUID=\"$hpu\", instance=\"$node\"} habanalabs_power_mW{instance=\"$node\", UUID=\"$hpu\"} habanalabs_memory_used_bytes{UUID=\"$hpu\", instance=\"$node\"} / habanalabs_memory_total_bytes{UUID=\"$hpu\", instance=\"$node\"}

eero-t added 2 commits November 13, 2024 20:07

Update observability README + fix typos

c2e9097

Signed-off-by: Eero Tamminen <[email protected]>

Give image files reasonable names

4c0417f

Scaling them down, and converting to 8-bit would be good next step, to make also their sizes to more reasonable. Signed-off-by: Eero Tamminen <[email protected]>

eero-t requested a review from daisy-ycguo as a code owner November 13, 2024 18:20

[pre-commit.ci] auto fixes from pre-commit.com hooks

2619d24

for more information, see https://pre-commit.ci

lianhao approved these changes Nov 14, 2024

View reviewed changes

lianhao added this to the v1.1 milestone Nov 14, 2024

yongfengdu reviewed Nov 14, 2024

View reviewed changes

yongfengdu approved these changes Nov 15, 2024

View reviewed changes

lianhao merged commit 1d77b81 into opea-project:main Nov 15, 2024
7 checks passed

eero-t deleted the grafana-readme branch December 5, 2024 10:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update observability README + fix typos #556

Update observability README + fix typos #556

eero-t commented Nov 13, 2024 •

edited

Loading

yongfengdu Nov 14, 2024

eero-t Nov 14, 2024 •

edited

Loading

Update observability README + fix typos #556

Update observability README + fix typos #556

Conversation

eero-t commented Nov 13, 2024 • edited Loading

Description

Issues

Type of change

Dependencies

Tests

Further work

yongfengdu Nov 14, 2024

Choose a reason for hiding this comment

eero-t Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

eero-t commented Nov 13, 2024 •

edited

Loading

eero-t Nov 14, 2024 •

edited

Loading