Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update observability README + fix typos #556

Merged
merged 3 commits into from
Nov 15, 2024

Conversation

eero-t
Copy link
Contributor

@eero-t eero-t commented Nov 13, 2024

Description

This is continuation of #541:

  • Updates observability README + fix typos
  • Gives its image files more descriptive names

Issues

n/a.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Dependencies

n/a.

Tests

CI + PR file view functionality.

Further work

Additional improvements that could be done e.g. after v1.1 release:

  • Remove serviceMonitor files (obsolete by this and earlier Helm support PR) from chatqna/
  • Generalize dashboards under chatqna/dashboards/ and move them to more appropriate place
  • Optimize images file sizes

Scaling them down, and converting to 8-bit would be good next step,
to make also their sizes to more reasonable.

Signed-off-by: Eero Tamminen <[email protected]>
@eero-t eero-t requested a review from daisy-ycguo as a code owner November 13, 2024 18:20
@lianhao lianhao added this to the v1.1 milestone Nov 14, 2024
@@ -40,7 +40,7 @@ kubectl port-forward service/grafana 3000:80

Open your browser and navigate to http://localhost:3000. Use "admin/prom-operator" as the username and the password to login.

## 2. Metric for Gaudi Hardware(v1.16.2)
## 2. Metrics for Gaudi Hardware (v1.16.2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 1.1 release are using tgi-gaudi 2.0.6, which is validated with SW stack v 1.18. Shall we update/verify the Metrics with 1.18?

Copy link
Contributor Author

@eero-t eero-t Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO not worth the trouble for that dashboard. I do not see why Habana would change metric names between version upgrades, and there's a much better Gaudi HW panel in Eval repo [1], which is more worth checking.

In the long run, I think it would be better to separate dashboard for different purposes to different repos, instead of duplicating them:

  • Drop Gaudi HW dashboard from here [1],
  • Move PCM one to Eval repo, and
  • Add k8s specific ones here, that do some assumptions about how deployments are named with current Helm charts, to allow user to select from multiple apps running in different namespaces

=> I can do that after v1.1.


[1] Gaudi HW one here is not very good. It's lacking most metrics, does not allow selecting a node or device for them, and as can be seen from its screenshot in the README, the metric legends are awful:

$ cat Dashboard-Gaudi-HW.json  | grep expr | cut -d'"' -f4- | sed 's/",$//'
habanalabs_temperature_onboard
habanalabs_kube_info
habanalabs_memory_free_bytes
habanalabs_power_mW
habanalabs_utilization

vs one in Eval repo:

$ cat gaudi_grafana.json | grep expr | cut -d'"' -f4- | sed 's/",$//'
habanalabs_device_config{instance=\"$node\", UUID=\"$hpu\"}
habanalabs_utilization{instance=\"$node\", UUID=\"$hpu\"}/100
habanalabs_power_mW{instance=\"$node\", UUID=\"$hpu\"}
habanalabs_pcie_receive_throughput{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_pcie_transmit_throughput{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_temperature_onchip{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_temperature_onboard{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_device_config{instance=\"$node\", UUID=\"$hpu\"}
habanalabs_device_config{instance=\"$node\", UUID=\"$hpu\"}
habanalabs_utilization{instance=\"$node\", UUID=\"$hpu\"}/100
habanalabs_temperature_onboard{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_temperature_onchip{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_clock_soc_mhz{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_memory_used_bytes{UUID=\"$hpu\", instance=\"$node\"}
habanalabs_power_mW{instance=\"$node\", UUID=\"$hpu\"}
habanalabs_memory_used_bytes{UUID=\"$hpu\", instance=\"$node\"} / habanalabs_memory_total_bytes{UUID=\"$hpu\", instance=\"$node\"}

@lianhao lianhao merged commit 1d77b81 into opea-project:main Nov 15, 2024
7 checks passed
@eero-t eero-t deleted the grafana-readme branch December 5, 2024 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants