OpenPAI has a built-in alert system. The alert system has some existing alert rules and actions. It can also let the admin customize them. In this document, we will have a detailed introduction to this topic.
OpenPAI uses Prometheus
to monitor system metrics, e.g. memory usage, disk usage, GPU usage, and so on. Using the metrics, we can set up several alert rules. The alert rules define some alert conditions and are also configured in Prometheus. When a certain condition is fulfilled, Prometheus will send a corresponding alert.
For example, the following configuration is the pre-defined GpuUsedByExternalProcess
alert. It uses the metric gpu_used_by_external_process_count
. If an external process using the GPU resource in OpenPAI over 5 minutes, Prometheus will fire a GpuUsedByExternalProcess
alert.
alert: GpuUsedByExternalProcess
expr: gpu_used_by_external_process_count > 0
for: 5m
annotations:
summary: found NVIDIA used by external process in {{$labels.instance}}
For the detailed syntax of alert rules, please refer to here.
All alerts fired by the alert rules, including the pre-defined rules and the customized rules, will be shown on the home page of webportal (on the top-right corner).
By default, OpenPAI provides you with a lot of metrics and some pre-defined alert rules. You can go to http(s)://<your master IP>/prometheus/graph
to explore different metrics. Some frequently-used metrics include:
task_gpu_percent
: GPU usage percent for a single task in OpenPAI jobstask_cpu_percent
: CPU usage percent for a single task in OpenPAI jobsnode_memory_MemTotal_bytes
: Total memory amount in bytes for nodesnode_memory_MemAvailable_bytes
: Available memory amount in bytes for nodes
To view existing alert rules based on the metrics, you can go to http(s)://<your master IP>/prometheus/alerts
, which includes their definitions and status.
You can define customized alerts in the prometheus
field in services-configuration.yml
.
For example, We can add a customized alert PAIJobGpuPercentLowerThan0_3For1h
by adding:
prometheus:
customized-alerts: |
groups:
- name: customized-alerts
rules:
- alert: PAIJobGpuPercentLowerThan0_3For1h
expr: avg(task_gpu_percent{virtual_cluster=~"default"}) by (job_name, username) < 0.3
for: 1h
labels:
severity: warn
annotations:
summary: "{{$labels.job_name}} has a job gpu percent lower than 30% for 1 hour"
description: Monitor job level gpu utilization in certain virtual clusters.
The PAIJobGpuPercentLowerThan0_3For1h
alert will be fired when the job on virtual cluster default
has a task level average GPU percent lower than 30%
for more than 1 hour
.
The alert severity can be defined as info
, warn
, error
, or fatal
by adding a label.
Here we use warn
.
Here the metric task_gpu_percent
is used, which describes the GPU utilization at the task level.
Remember to push service config to the cluster and restart the prometheus
service after your modification with the following commands in the dev-box container:
./paictl.py service stop -n prometheus
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n prometheus
Please refer to Prometheus Alerting Rules for alerting rule syntax.
Admin can choose how to handle the alerts by different alert actions. We provide some basic alert actions and you can also customize your actions. In this section, we will first introduce the existing actions and the matching rules between these actions and alerts. Then we will let you know how to add new alert actions. The actions and matching rules are both handled by alert-manager
.
The alert actions and the matching rules are realized in the alert-manager
service. To define them, you should modify the alert-manager
field in services-configuration.yml
. The full spec of the configuration is as follows:
alert-manager:
port: 9093 # optional, do not change this if you do not want to change the port alert-manager is listening on
pai-bearer-token: 'your-application-token-for-pai-rest-server'
alert-handler:
port: 9095 # optional, do not change this if you do not want to change the port alert-handler is listening on
email-configs: # email-notification will only be enabled when this field is not empty
admin-receiver: [email protected]
smtp-host: smtp.office365.com
smtp-port: 587
smtp-from: [email protected]
smtp-auth-username: [email protected]
smtp-auth-password: password-for-alert-sender
customized-routes: # routes are the matching rules between alerts and receivers
routes:
- receiver: pai-email-admin-user-and-stop-job
match:
alertname: PAIJobGpuPercentLowerThan0_3For1h
customized-receivers: # receivers are combination of several actions
- name: "pai-email-admin-user-and-stop-job"
actions:
# the email template for `email-admin` and `email-user `can be chosen from ['general-template', 'kill-low-efficiency-job-alert']
# if no template specified, 'general-template' will be used.
email-admin:
email-user:
template: 'kill-low-efficiency-job-alert'
stop-jobs: # no parameters required for stop-jobs action
tag-jobs:
tags:
- 'stopped-by-alert-manager'
We have provided so far these following actions:
email-admin
: Send emails to the assigned admin.email-user
: Send emails to the owners of jobs. Currently, this action uses the same email template asemail-admin
.stop-jobs
: Stop jobs by calling OpenPAI REST API. Be careful about this action because it stops jobs without notifying related users.tag-jobs
: Add a tag to jobs by calling OpenPAI REST API.cordon-nodes
: Call Kubernetes API to cordon the corresponding nodes.fix-nvidia-gpu-low-perf
: Start a privileged container to fix NVIDIA GPU Low Performance State issue.
But before you use them, you have to add proper configuration in the alert-handler
field. For example, email-admin
needs you to set up an SMTP account to send the email and an admin email address to receive the email. Also, the tag-jobs
and stop-jobs
action calls OpenPAI REST API, so you should set a rest server token for them. To get the token, you should go to your profile page (in the top-right corner on Webporal, click View my profile
), and use Create application token
to create one. Generally speaking, there are two parts of the configuration in the alert-handler
field. One is email-configs
. The other is pai-bearer-token
. The requirements for different actions are shown in the following table:
email-configs | pai-bearer-token | |
---|---|---|
cordon-nodes | - | - |
email-admin | required | - |
email-user | required | required |
stop-jobs | - | required |
tag-jobs | - | required |
fix-nvidia-gpu-low-perf | - | - |
In addition, some actions may depend on certain fields in the labels
of alert instances. The labels of the alert instance
are generated based on the expression in the alert rule. For example, the expression of the PAIJobGpuPercentLowerThan0_3For1h
alert we mentioned in previous section is avg(task_gpu_percent{virtual_cluster=~"default"}) by (job_name, username) < 0.3
. This expression returns a list, the element in which contains the job_name
field. So there will be also a job_name
field and a username
field in the labels of the alert instance. stop-jobs
action depends on the job_name
field, and it will stop the corresponding job based on it. To inspect the labels of an alert, you can visit http(s)://<your master IP>/prometheus/alerts
. If the alert is firing, you can see its labels on this page. For the depended fields of each pre-defined action, please refer to the following table:
depended on label field | |
---|---|
cordon-nodes | node_name |
email-admin | - |
email-user | - |
stop-jobs | job_name |
tag-jobs | job_name |
fix-nvidia-gpu-low-perf | node_name, minor_number |
The matching rules between alerts and actions are defined using receivers
and routes
.
A receiver
is simply a group of actions, a route
matches the alerts to a specific receiver
.
With the default configuration, all the alerts will match the default alert receiver which triggers only email-admin
action (But if you don't set the email configuration, the action won't work).
You can add new receivers with related matching rules to assign actions to alerts in the alert-manager
field in service-configuration.yml
.
For example :
alert-manager:
......
customized-routes: # routes are the matching rules between alerts and receivers
routes:
- receiver: pai-email-admin-user-and-stop-job
match:
alertname: PAIJobGpuPercentLowerThan0_3For1h
customized-receivers: # receivers are combination of several actions
- name: "pai-email-admin-user-and-stop-job"
actions:
# the email template for `email-admin` and `email-user `can be chosen from ['general-template', 'kill-low-efficiency-job-alert']
# if no template specified, 'general-template' will be used.
email-admin:
email-user:
template: 'kill-low-efficiency-job-alert'
stop-jobs: # no parameters required for stop-jobs action
tag-jobs:
tags:
- 'stopped-by-alert-manager'
......
Here we define:
- a receiver
pai-email-admin-user-and-stop-job
, which contains the actionsemail-admin
,email-user
,stop-jobs
andtag-jobs
- a route, which matches the alert
pai-email-admin-user-and-stop-job
to the receiverpai-email-admin-user-and-stop-job
.
As a consequence, when the alert PAIJobGpuPercentLowerThan0_3For1h
is fired, all these 4 actions will be triggered.
For routes
definition, we adopt the syntax of Prometheus Alertmanager.
For receivers
definition, you can simply:
- name the receiver in
name
field; - list the actions to use in
actions
and fill corresponding parameters for the actions:email-admin
:- template: Optional, can be choose from ['general-template', 'cluster-usage', 'kill-low-efficiency-job-alert', 'job-status-change'], by default 'general-template'.
email-user
:- template: Optional, can be choose from ['general-template', 'kill-low-efficiency-job-alert', 'job-status-change'], by default 'general-template'.
cordon-nodes
: No parameters requiredstop-jobs
: No parameters requiredtag-jobs
:- tags: required, list of tags
You can also add customized email templates by adding a template folder in pai/src/alert-manager/deploy/alert-templates
.
Two files need to be present: one email body template file named html.ejs
and one email subject template file named subject.ejs
.
The folder name will be automatically passed as the template name.
Remember to push service config to the cluster and restart the alert-manager
service after your modification with the following commands in the dev-box container:
./paictl.py service stop -n alert-manager
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager
For OpenPAI service management, please refer to here.
If you want to add new customized actions, follow these steps:
We provide alert-handler
as a lightweight express
application, where you can add customized APIs easily.
For example, the stop-jobs
action is realized by calling the localhost:9095/alert-handler/stop-jobs
API through webhook
,
the request is then forward to the OpenPAI Rest Server to stop the job.
You can add new APIs in alert-handler
and adapt the request to realize the required action.
The source code of alert-handler
is available here.
As stated before, to make an action available, administrators need to provide the necessary configurations. Check this folder and define the dependencies' rules for your customized actions.
When customized receivers are defined in service-configuration.yml
,
the actions
will then be rendered as webhook_configs here.
The actions we provide, email-admin
, email-user
, stop-jobs
, tag-jobs
, and cordon-nodes
, can be called within alert-manager
by sending POST requests to alert-handler
:
localhost:{your_alert_handler_port}/alert-handler/send-email-to-admin
localhost:{your_alert_handler_port}/alert-handler/send-email-to-user
localhost:{your_alert_handler_port}/alert-handler/stop-jobs
localhost:{your_alert_handler_port}/alert-handler/tag-jobs/:tag
localhost:{your_alert_handler_port}/alert-handler/cordon-nodes
The request body will be automatically filled by alert-manager
with webhook
and alert-handler
will adapt the requests to various actions.
Please define how to render your customized action to the alert-handler
API request
here
Remember to re-build and push the docker image, and restart the alert-manager
service after your modification with the following commands in the dev-box container:
./build/pai_build.py build -c /cluster-configuration/ -s alert-manager
./build/pai_build.py push -c /cluster-configuration/ -i alert-handler cluster-utilization
./paictl.py service stop -n alert-manager
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager
We provide the functionality to send cluster GPU utilization report regularly to admin users.
The report includes the statistics for:
- Cluster GPU utilization
- User GPU utilization
- Job GPU utilization
To enable this feature, you should configure the alert-manager
field in services-configuration.yml
.
pai-bearer-token
& cluster-utilization
->schedule
are necessary fields for this feature.
For the syntax of schedule
, please refer to Cron Schedule Syntax.
For example, "0 0 * * *"
means daily report at UTC 00:00.
Please also make sure that the email-admin
action is enabled.
alert-manager:
pai-bearer-token: 'your-application-token-for-pai-rest-server'
cluster-utilization: # cluster-utilization is a k8s CronJob which reports the GPU utilization of the cluster
# for schedule syntax, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
schedule: "0 0 * * *" # daily report at UTC 00:00
To make your configuration take effect, restart the alert-manager
service after your modification with the following commands in the dev-box container:
./paictl.py service stop -n alert-manager
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager
We provide the functionality to check the k8s cert expiration date and send warning to admin users.
You can configure the alert-manager
->cert-expiration-checker
field in services-configuration.yml
.
schedule
, alert-residual-days
and cert-path
are necessary fields for this feature, and we have default value for the fields.
For the syntax of schedule
, please refer to Cron Schedule Syntax.
For example, "0 0 * * *"
means daily report at UTC 00:00.
An alert will be send to the admin, if email-admin
action is enabled.
alert-manager:
cert-expiration-checker: # cert-expiration-checker is a k8s CronJob which check the cert expiration date
# for schedule syntax, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
schedule: '0 0 * * *' # daily check at UTC 00:00
alert-residual-days: 30 # send alert if the expiration date is coming soon
cert-path: '/etc/kubernetes/ssl' # the k8s cert path in master node
To make your configuration take effect, restart the alert-manager
service after your modification with the following commands in the dev-box container:
./paictl.py service stop -n alert-manager
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager
We provide the functionality to send job status change notifications to users. If enabled, the users will get notified by email of the status changes.
The users can also customize the status change they want to be notified in the job config, refer to here for details.
To enable this feature, you should configure the alert-manager
field in services-configuration.yml
.
pai-bearer-token
& job-status-change-notification
->enable
are necessary fields for this feature.
Please make sure that the email-user
action is enabled.
alert-manager:
pai-bearer-token: 'your-application-token-for-pai-rest-server'
job-status-change-notification: # send job status change notification to users when enabled
enable: true
To make your configuration take effect, restart the alert-manager
service after your modification with the following commands in the dev-box container:
./paictl.py service stop -n alert-manager
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager