-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HealthChecker Callback #2002
Conversation
Can you please update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like its on your todo, but can you show a screenshot of a slack alert getting triggered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Hanlin!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM mod 2 minor nits. Great v1! Excited to test it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a separate target since it doesn't really fit under dev -- the rest are pretty clearly more dev things and this is not. But no strong opinions
Co-authored-by: Mihir Patel <[email protected]>
What does this PR do?
Adds a
HealthChecker
callback to monitor node health. Will log a warning if GPUs on a node appear to be in poor health. Provides an optional interface to send a message via Slack webhook.Note: this adds the
pynvml
andslack_sdk
to the dependencies. Unsure if this should be in thebase
install or a separate tag.TODO:
Before submitting
pre-commit
on your change? (see thepre-commit
section of prerequisites)