Disk usage warning prevents new nodes from starting #1723

ltalirz · 2023-10-10T10:03:13Z

Version

v1.0.35

In what area(s)?

/area monitoring

Expected Behavior

When /anfhome becomes full, I would expect the admin of the cluster to be notified via email, and users to be notified on the command line when they submit new jobs.

Actual Behavior

When /anfhome is >90% full [1], a node health check fails and new jobs are stuck in "configure" stage forever.
Users without access to the CycleCloud dashboard have no way of knowing why this is happening.

[message="ERROR : Node Health Checks failed - hcl-pg0-22 - BLZ221031015025 - ERROR:  nhc:  Health check failed:  check_fs_used:  /anfhome is 90% full (3822928128kB), threshold is 90%";priority="high";level="error"]

Steps to Reproduce the Problem

Fill /anfhome to 90% and try submitting jobs

[1] By the way, it appears this value of 90% is hardcoded (?), at least it does not reflect the value of alerting.local_volume_threshold: 80 or anf.alert_threshold: 80 from my config.yml file

The text was updated successfully, but these errors were encountered:

xpillons · 2023-10-10T10:08:17Z

Please see how to configure monitoring and alerts https://azure.github.io/az-hop/operate/alerting.html
This is only available in the Terraform deployment, need help to port it on bicep.

You need :

to enable log analytics workspace or use an existing one
enable alerting with alerting.enabled=true and set the admin_email

ltalirz · 2023-10-10T10:43:58Z

Thanks Xavier for the pointers on how to enable alerts to the admin.
Actually, both settings are enabled in my case and the log analytics workspace exists, but no alerts are set up

Perhaps my deployment recipes are outdated and this is fixed in later versions; I currently can't touch the system.

Even when proper admin alerts were setup, I still wonder whether preventing new nodes from starting for disk >90% is the right approach... I guess the idea is that you should never actually reach that point?
If there was a way to forward this information to the user that would be very helpful - it can always happen that an admin cannot react for some time

xpillons · 2023-10-11T10:24:30Z

the purpose of the alert is to not reach that point for sure. The best would be to have this in an alias email instead of a single admin. The grafana dashboard is also providing a way to monitor the diskspace of mounts and infra VMs, but without alerts.

ltalirz added the kind/bug Something isn't working label Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disk usage warning prevents new nodes from starting #1723

Disk usage warning prevents new nodes from starting #1723

ltalirz commented Oct 10, 2023 •

edited

Loading

xpillons commented Oct 10, 2023 •

edited

Loading

ltalirz commented Oct 10, 2023 •

edited

Loading

xpillons commented Oct 11, 2023

Disk usage warning prevents new nodes from starting #1723

Disk usage warning prevents new nodes from starting #1723

Comments

ltalirz commented Oct 10, 2023 • edited Loading

Version

In what area(s)?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

xpillons commented Oct 10, 2023 • edited Loading

ltalirz commented Oct 10, 2023 • edited Loading

xpillons commented Oct 11, 2023

ltalirz commented Oct 10, 2023 •

edited

Loading

xpillons commented Oct 10, 2023 •

edited

Loading

ltalirz commented Oct 10, 2023 •

edited

Loading