You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When /anfhome becomes full, I would expect the admin of the cluster to be notified via email, and users to be notified on the command line when they submit new jobs.
Actual Behavior
When /anfhome is >90% full [1], a node health check fails and new jobs are stuck in "configure" stage forever.
Users without access to the CycleCloud dashboard have no way of knowing why this is happening.
[message="ERROR : Node Health Checks failed - hcl-pg0-22 - BLZ221031015025 - ERROR: nhc: Health check failed: check_fs_used: /anfhome is 90% full (3822928128kB), threshold is 90%";priority="high";level="error"]
Steps to Reproduce the Problem
Fill /anfhome to 90% and try submitting jobs
[1] By the way, it appears this value of 90% is hardcoded (?), at least it does not reflect the value of alerting.local_volume_threshold: 80 or anf.alert_threshold: 80 from my config.yml file
The text was updated successfully, but these errors were encountered:
Thanks Xavier for the pointers on how to enable alerts to the admin.
Actually, both settings are enabled in my case and the log analytics workspace exists, but no alerts are set up
Perhaps my deployment recipes are outdated and this is fixed in later versions; I currently can't touch the system.
Even when proper admin alerts were setup, I still wonder whether preventing new nodes from starting for disk >90% is the right approach... I guess the idea is that you should never actually reach that point?
If there was a way to forward this information to the user that would be very helpful - it can always happen that an admin cannot react for some time
the purpose of the alert is to not reach that point for sure. The best would be to have this in an alias email instead of a single admin. The grafana dashboard is also providing a way to monitor the diskspace of mounts and infra VMs, but without alerts.
Version
v1.0.35
In what area(s)?
/area monitoring
Expected Behavior
When /anfhome becomes full, I would expect the admin of the cluster to be notified via email, and users to be notified on the command line when they submit new jobs.
Actual Behavior
When /anfhome is >90% full [1], a node health check fails and new jobs are stuck in "configure" stage forever.
Users without access to the CycleCloud dashboard have no way of knowing why this is happening.
Steps to Reproduce the Problem
Fill /anfhome to 90% and try submitting jobs
[1] By the way, it appears this value of 90% is hardcoded (?), at least it does not reflect the value of
alerting.local_volume_threshold: 80
oranf.alert_threshold: 80
from my config.yml fileThe text was updated successfully, but these errors were encountered: