-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a collector to fetch energy values from /sys/cray/pm_counters
on Cray systems
#239
Comments
Do you have an HPE system with these counters available to you? I have help facilitate examples/testing/... if you need it. |
Cheers @jhansonhpe for the interest and help offered. Yes, I managed to get some example files from Adastra machine. If you work for HPE, I have few questions about these counters that you may have answers:
Thanks a lot! |
Yes I work for HPE. I manage the team that writes monitoring software for the systems. The feature of pm_counters is ease of node level access to the values so things like slurm (there is an included plugin from SchedMD) can read the values without having privileged access to Redfish or the system level data collectors. On a randomly chosen node (same node type as Frontier at ORNL) the counters are There is no ipmi in CrayEx node controllers so for ceems to get all the sensor data this becomes a challenge for direct access. Certainly system admins could allow a query to the monitoring databases if there are more metrics desired for ceems to collect or consume from kafka. There will be a slight difference in timestamps as there is a small time in flight delay to get metrics to kafka. |
Is a Redfish Mockup Creator of a node controller for the blade type above. |
Cheers @jhansonhpe for very detailed responses. Appreciate it.
This is awesome. This would be amazing if the community can standardize the kernel module (like the one from OpenIPMI Driver) that you are using on Cray nodes to give node level access to power/energy counters from generic BMCs. I assume the node controller the kernel module uses now is specific to Cray nodes.
Actually, I have added a Redfish Collector to be able to get power metrics from Redfish API server. The little inconvenience is that the BMC network is seldom reachable from the compute node. So, we need to proxy the requests to Redfish from a reverse proxy that must be deployed on a management node where BMC network is reachable. As you rightly pointed out, this will induce slight differences in timestamps due to network latencies. We have a HPE machine in our center too which is a SGI where in-band IPMI access is configured. So, I was just curious to know if HPE manages Cray nodes in the same way or not. Thanks for the Redfish mockup responses. Very helpful!! |
Most hardware is (slowly) moving away from IPMI (insecure, legacy standard) to Redfish. It does come with this exact challenge. I would expect most systems to not permit access to Redfish via direct query (network isolation as you point out plus providing the access credentials AND the possibility for denial of service if the queries are too fast/heavy). On HPE systems with CSM or HPCM sensor data can be made available in kafka for consumption. |
https://cray-hpe.github.io/docs-csm/en-10/operations/power_management/user_access_to_compute_node_power_data/
The text was updated successfully, but these errors were encountered: