-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update observability README + fix typos #556
Conversation
Signed-off-by: Eero Tamminen <[email protected]>
Scaling them down, and converting to 8-bit would be good next step, to make also their sizes to more reasonable. Signed-off-by: Eero Tamminen <[email protected]>
for more information, see https://pre-commit.ci
@@ -40,7 +40,7 @@ kubectl port-forward service/grafana 3000:80 | |||
|
|||
Open your browser and navigate to http://localhost:3000. Use "admin/prom-operator" as the username and the password to login. | |||
|
|||
## 2. Metric for Gaudi Hardware(v1.16.2) | |||
## 2. Metrics for Gaudi Hardware (v1.16.2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 1.1 release are using tgi-gaudi 2.0.6, which is validated with SW stack v 1.18. Shall we update/verify the Metrics with 1.18?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO not worth the trouble for that dashboard. I do not see why Habana would change metric names between version upgrades, and there's a much better Gaudi HW panel in Eval repo [1], which is more worth checking.
In the long run, I think it would be better to separate dashboard for different purposes to different repos, instead of duplicating them:
- Drop Gaudi HW dashboard from here [1],
- Move PCM one to Eval repo, and
- Add k8s specific ones here, that do some assumptions about how deployments are named with current Helm charts, to allow user to select from multiple apps running in different namespaces
=> I can do that after v1.1.
[1] Gaudi HW one here is not very good. It's lacking most metrics, does not allow selecting a node or device for them, and as can be seen from its screenshot in the README, the metric legends are awful:
$ cat Dashboard-Gaudi-HW.json | grep expr | cut -d'"' -f4- | sed 's/",$//'
habanalabs_temperature_onboard
habanalabs_kube_info
habanalabs_memory_free_bytes
habanalabs_power_mW
habanalabs_utilization
vs one in Eval repo:
$ cat gaudi_grafana.json | grep expr | cut -d'"' -f4- | sed 's/",$//'
habanalabs_device_config{instance=\"$node\", UUID=\"$hpu\"}
habanalabs_utilization{instance=\"$node\", UUID=\"$hpu\"}/100
habanalabs_power_mW{instance=\"$node\", UUID=\"$hpu\"}
habanalabs_pcie_receive_throughput{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_pcie_transmit_throughput{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_temperature_onchip{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_temperature_onboard{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_device_config{instance=\"$node\", UUID=\"$hpu\"}
habanalabs_device_config{instance=\"$node\", UUID=\"$hpu\"}
habanalabs_utilization{instance=\"$node\", UUID=\"$hpu\"}/100
habanalabs_temperature_onboard{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_temperature_onchip{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_clock_soc_mhz{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_memory_used_bytes{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_power_mW{instance=\"$node\", UUID=\"$hpu\"}
habanalabs_memory_used_bytes{UUID=\"$hpu\", instance=\"$node\"} / habanalabs_memory_total_bytes{UUID=\"$hpu\", instance=\"$node\"}
Description
This is continuation of #541:
Issues
n/a
.Type of change
Dependencies
n/a
.Tests
CI + PR file view functionality.
Further work
Additional improvements that could be done e.g. after v1.1 release:
serviceMonitor
files (obsolete by this and earlier Helm support PR) fromchatqna/
chatqna/dashboards/
and move them to more appropriate place