Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk usage warning prevents new nodes from starting #1723

Open
ltalirz opened this issue Oct 10, 2023 · 3 comments
Open

Disk usage warning prevents new nodes from starting #1723

ltalirz opened this issue Oct 10, 2023 · 3 comments
Labels
kind/bug Something isn't working

Comments

@ltalirz
Copy link
Contributor

ltalirz commented Oct 10, 2023

Version

v1.0.35

In what area(s)?

/area monitoring

Expected Behavior

When /anfhome becomes full, I would expect the admin of the cluster to be notified via email, and users to be notified on the command line when they submit new jobs.

Actual Behavior

When /anfhome is >90% full [1], a node health check fails and new jobs are stuck in "configure" stage forever.
Users without access to the CycleCloud dashboard have no way of knowing why this is happening.

[message="ERROR : Node Health Checks failed - hcl-pg0-22 - BLZ221031015025 - ERROR:  nhc:  Health check failed:  check_fs_used:  /anfhome is 90% full (3822928128kB), threshold is 90%";priority="high";level="error"]
image

Steps to Reproduce the Problem

Fill /anfhome to 90% and try submitting jobs

[1] By the way, it appears this value of 90% is hardcoded (?), at least it does not reflect the value of alerting.local_volume_threshold: 80 or anf.alert_threshold: 80 from my config.yml file

@ltalirz ltalirz added the kind/bug Something isn't working label Oct 10, 2023
@xpillons
Copy link
Collaborator

xpillons commented Oct 10, 2023

Please see how to configure monitoring and alerts https://azure.github.io/az-hop/operate/alerting.html
This is only available in the Terraform deployment, need help to port it on bicep.

You need :

  • to enable log analytics workspace or use an existing one
  • enable alerting with alerting.enabled=true and set the admin_email

@ltalirz
Copy link
Contributor Author

ltalirz commented Oct 10, 2023

Thanks Xavier for the pointers on how to enable alerts to the admin.
Actually, both settings are enabled in my case and the log analytics workspace exists, but no alerts are set up

image

Perhaps my deployment recipes are outdated and this is fixed in later versions; I currently can't touch the system.

Even when proper admin alerts were setup, I still wonder whether preventing new nodes from starting for disk >90% is the right approach... I guess the idea is that you should never actually reach that point?
If there was a way to forward this information to the user that would be very helpful - it can always happen that an admin cannot react for some time

@xpillons
Copy link
Collaborator

the purpose of the alert is to not reach that point for sure. The best would be to have this in an alias email instead of a single admin. The grafana dashboard is also providing a way to monitor the diskspace of mounts and infra VMs, but without alerts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants