Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

GPU utilization monitoring & handling #4789

Closed
10 tasks done
suiguoxin opened this issue Aug 6, 2020 · 0 comments
Closed
10 tasks done

GPU utilization monitoring & handling #4789

suiguoxin opened this issue Aug 6, 2020 · 0 comments
Assignees
Labels

Comments

@suiguoxin
Copy link
Member

suiguoxin commented Aug 6, 2020

  • Alert-manager: Kill low-gpu-utilization jobs, tag abnormal jobs
    • add virtual cluster info in job-exporter
    • config monitor rules in prometheus
    • send action request through webhook
    • job-handler: deal with webhook request & redirect to RestServer
    • realize customized SMTP service in alert-handler, send alert email to user when possible, change email template to ejs.
  • Job tags:
    • DB: job-tag table
    • RestServer:
      • getJobList : filter by tag
      • getJobDetails : with tag info
      • tag : put / delete
  • Cordon node with k8s API when GPU GCC Error
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants