Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Commit

Permalink
[Alert manager] k8s cert expiration checker (#5409)
Browse files Browse the repository at this point in the history
* add cert-expiration-checker cronjob

* update

* update

* update

* update

* update

* update

* update

* update

* update

* test

* fix

* update

* update

* update

* update

* update

* fix

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* test

* update

* update

* update

* add doc

* update

* fix lint
  • Loading branch information
yiyione authored Apr 9, 2021
1 parent 6b441c4 commit 4cc1e90
Show file tree
Hide file tree
Showing 11 changed files with 176 additions and 16 deletions.
56 changes: 42 additions & 14 deletions docs/manual/cluster-admin/how-to-use-alert-system.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ prometheus:
The `PAIJobGpuPercentLowerThan0_3For1h` alert will be fired when the job on virtual cluster `default` has a task level average GPU percent lower than `30%` for more than `1 hour`.
The alert severity can be defined as `info`, `warn`, `error`, or `fatal` by adding a label.
Here we use `warn`.
Here the metric `task_gpu_percent` is used, which describes the GPU utilization at the task level.
Here the metric `task_gpu_percent` is used, which describes the GPU utilization at the task level.

Remember to push service config to the cluster and restart the `prometheus` service after your modification with the following commands [in the dev-box container](./basic-management-operations.md#pai-service-management-and-paictl):
```bash
Expand Down Expand Up @@ -94,20 +94,20 @@ alert-manager:
alertname: PAIJobGpuPercentLowerThan0_3For1h
customized-receivers: # receivers are combination of several actions
- name: "pai-email-admin-user-and-stop-job"
actions:
actions:
# the email template for `email-admin` and `email-user `can be chosen from ['general-template', 'kill-low-efficiency-job-alert']
# if no template specified, 'general-template' will be used.
email-admin:
email-user:
email-user:
template: 'kill-low-efficiency-job-alert'
stop-jobs: # no parameters required for stop-jobs action
tag-jobs:
tags:
tags:
- 'stopped-by-alert-manager'

```

We have provided so far these following actions:
We have provided so far these following actions:

- `email-admin`: Send emails to the assigned admin.
- `email-user`: Send emails to the owners of jobs. Currently, this action uses the same email template as `email-admin`.
Expand All @@ -132,15 +132,15 @@ In addition, some actions may depend on certain fields in the `labels` of alert
| | depended on label field |
| :-------------------------: | :---------------------: |
| cordon-nodes | node_name |
| email-admin | - |
| email-admin | - |
| email-user | - |
| stop-jobs | job_name |
| tag-jobs | job_name |
| fix-nvidia-gpu-low-perf | node_name, minor_number |


The matching rules between alerts and actions are defined using `receivers` and `routes`.
A `receiver` is simply a group of actions, a `route` matches the alerts to a specific `receiver`.
A `receiver` is simply a group of actions, a `route` matches the alerts to a specific `receiver`.

With the default configuration, all the alerts will match the default alert receiver which triggers only `email-admin` action (But if you don't set the email configuration, the action won't work).
You can add new receivers with related matching rules to assign actions to alerts in the `alert-manager` field in [`service-configuration.yml`](./basic-management-operations.md#pai-service-management-and-paictl).
Expand All @@ -158,15 +158,15 @@ alert-manager:
alertname: PAIJobGpuPercentLowerThan0_3For1h
customized-receivers: # receivers are combination of several actions
- name: "pai-email-admin-user-and-stop-job"
actions:
actions:
# the email template for `email-admin` and `email-user `can be chosen from ['general-template', 'kill-low-efficiency-job-alert']
# if no template specified, 'general-template' will be used.
email-admin:
email-user:
email-user:
template: 'kill-low-efficiency-job-alert'
stop-jobs: # no parameters required for stop-jobs action
tag-jobs:
tags:
tags:
- 'stopped-by-alert-manager'
......
```
Expand All @@ -183,16 +183,16 @@ For `receivers` definition, you can simply:

- name the receiver in `name` field;
- list the actions to use in `actions` and fill corresponding parameters for the actions:
- `email-admin`:
- `email-admin`:
- template: Optional, can be choose from ['general-template', 'kill-low-efficiency-job-alert'], by default 'general-template'.
- `email-user`:
- `email-user`:
- template: Optional, can be choose from ['general-template', 'kill-low-efficiency-job-alert'], by default 'general-template'.
- `cordon-nodes`: No parameters required
- `stop-jobs`: No parameters required
- `tag-jobs`:
- tags: required, list of tags

You can also add customized email templates by adding a template folder in `pai/src/alert-manager/deploy/alert-templates`.
You can also add customized email templates by adding a template folder in `pai/src/alert-manager/deploy/alert-templates`.
Two files need to be present: one email body template file named `html.ejs` and one email subject template file named `subject.ejs`.
The folder name will be automatically passed as the template name.

Expand Down Expand Up @@ -258,7 +258,7 @@ We provide the functionality to send cluster GPU utilization report regularly to

The report includes the statistics for:
- Cluster GPU utilization
- User GPU utilization
- User GPU utilization
- Job GPU utilization

To enable this feature, you should configure the `alert-manager` field in `services-configuration.yml`.
Expand All @@ -282,3 +282,31 @@ To make your configuration take effect, restart the `alert-manager` service afte
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager
```

## Cluster k8s cert expiration checker

We provide the functionality to check the k8s cert expiration date and send warning to admin users.

This feature will be enable by default, if the action `email-admin` is enabled.
You can configure the `alert-manager`->`cert-expiration-checker` field in `services-configuration.yml`.
`schedule`, `alert-residual-days` and `cert-path` are necessary fields for this feature, and we have default value for the fields.
For the syntax of `schedule`, please refer to [Cron Schedule Syntax](https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax).
For example, `"0 0 * * *"` means daily report at UTC 00:00.
Please also make sure that the [`email-admin`](#Existing-Actions-and-Matching-Rules) action is enabled.

```yaml
alert-manager:
cert-expiration-checker: # cert-expiration-checker is a k8s CronJob which check the cert expiration date
# for schedule syntax, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
schedule: '0 0 * * *' # daily check at UTC 00:00
alert-residual-days: 30 # send alert if the expiration date is coming soon
cert-path: '/etc/kubernetes/ssl' # the k8s cert path in master node
```

To make your configuration take effect, restart the `alert-manager` service after your modification with the following commands in the dev-box container:

```bash
./paictl.py service stop -n alert-manager
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager
```
10 changes: 10 additions & 0 deletions src/alert-manager/build/cert-expiration-checker.common.dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

FROM python:3.7

COPY ./src/cert-expiration-checker .

RUN pip3 install -r requirements.txt

ENTRYPOINT ["python3", "send_alert.py"]
4 changes: 4 additions & 0 deletions src/alert-manager/config/alert-manager.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,7 @@ cluster-utilization:
configured: False
use-pylon: False
repeat-interval: '24h'
cert-expiration-checker:
schedule: '0 0 * * *'
alert-residual-days: 30
cert-path: '/etc/kubernetes/ssl'
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: cert-expiration-checker
spec:
schedule: "{{ cluster_cfg["alert-manager"]["cert-expiration-checker"]["schedule"] }}"
jobTemplate:
spec:
template:
spec:
containers:
- name: cert-expiration-checker
image: {{ cluster_cfg['cluster']['docker-registry']['prefix'] }}cert-expiration-checker:{{ cluster_cfg['cluster']['docker-registry']['tag'] }}
imagePullPolicy: Always
env:
- name: PAI_URI
{%- if "ssl" in cluster_cfg["pylon"] and cluster_cfg["pylon"]["ssl"] %}
value: "{{ cluster_cfg['pylon']['uri-https']}}"
{%- else %}
value: "{{ cluster_cfg['pylon']['uri']}}"
{%- endif %}
- name: ALERT_RESIDUAL_DAYS
value: "{{ cluster_cfg["alert-manager"]["cert-expiration-checker"]["alert-residual-days"] }}"
volumeMounts:
- mountPath: /etc/kubernetes/ssl
name: kubenetes-ssl
volumes:
- name: kubenetes-ssl
hostPath:
path: {{ cluster_cfg["alert-manager"]["cert-expiration-checker"]["cert-path"] }}
imagePullSecrets:
- name: {{ cluster_cfg["cluster"]["docker-registry"]["secret-name"] }}
restartPolicy: OnFailure
nodeSelector:
pai-master: "true"
3 changes: 2 additions & 1 deletion src/alert-manager/deploy/service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ template-list:
- alert-manager-deployment.yaml
- alert-manager-configmap.yaml
- alert-manager-cronjob.yaml
- alert-manager-cert-expiration-check-cronjob.yaml
- start.sh

start-script: start.sh
Expand All @@ -37,4 +38,4 @@ upgraded-script: upgraded.sh


deploy-rules:
- in: pai-master
- in: pai-master
3 changes: 2 additions & 1 deletion src/alert-manager/deploy/start.sh.template
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

pushd $(dirname "$0") > /dev/null

# crate configmap for alert-templates
# create configmap for alert-templates
{% if cluster_cfg["alert-manager"]["alert-handler"]["configured"] -%}
{% if 'email-admin' in cluster_cfg["alert-manager"]["actions-available"] -%}
kubectl create configmap alert-templates \
Expand All @@ -34,6 +34,7 @@ kubectl create configmap alert-templates \
kubectl apply --overwrite=true -f rbac.yaml || exit $?
kubectl apply --overwrite=true -f alert-manager-configmap.yaml || exit $?
kubectl apply --overwrite=true -f alert-manager-deployment.yaml || exit $?
kubectl apply --overwrite=true -f alert-manager-cert-expiration-check-cronjob.yaml || exit $?
{% if cluster_cfg["alert-manager"]["cluster-utilization"]["configured"] -%}
kubectl apply --overwrite=true -f alert-manager-cronjob.yaml || exit $?
{% endif -%}
Expand Down
1 change: 1 addition & 0 deletions src/alert-manager/deploy/stop.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ kubectl delete --ignore-not-found --now configmap/alert-templates
kubectl delete --ignore-not-found --now configmap/alertmanager
kubectl delete --ignore-not-found --now deployment/alertmanager
kubectl delete --ignore-not-found --now cronjob/cluster-utilization
kubectl delete --ignore-not-found --now cronjob/cert-expiration-checker

if kubectl get clusterrolebinding | grep -q "alert-manager-role-binding"; then
kubectl delete clusterrolebinding alert-manager-role-binding || exit $?
Expand Down
Empty file.
10 changes: 10 additions & 0 deletions src/alert-manager/src/cert-expiration-checker/pylintrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
[SETTINGS]

max-line-length=140

disable =
missing-docstring,
invalid-name,
cell-var-from-loop,
undefined-loop-variable,
too-many-locals,
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
requests==2.23.0
pyOpenSSL==20.0.1
65 changes: 65 additions & 0 deletions src/alert-manager/src/cert-expiration-checker/send_alert.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
from datetime import timezone, datetime, timedelta
import logging
import os
import requests
import ssl
from OpenSSL import crypto

ALERT_PREFIX = "/alert-manager/api/v1/alerts"
APISERVER_CERT_PATH = '/etc/kubernetes/ssl/apiserver.crt'
alertResidualDays = int(os.environ.get('ALERT_RESIDUAL_DAYS'))

def enable_request_debug_log(func):
def wrapper(*args, **kwargs):
requests_log = logging.getLogger("urllib3")
level = requests_log.level
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True

try:
return func(*args, **kwargs)
finally:
requests_log.setLevel(level)
requests_log.propagate = False

return wrapper

@enable_request_debug_log
def send_alert(pai_url: str, residualTime: int, certExpirationInfo: str):
trigger_time = str(datetime.now(timezone.utc).date())
post_url = pai_url.rstrip("/") + ALERT_PREFIX
alerts = []
alert = {
"labels": {
"alertname": "k8s cert expiration",
"severity": "warn",
"trigger_time": trigger_time,
},
"annotations": {
"summary": f"The k8s cert will be expired in {residualTime} days.",
"message": f"{certExpirationInfo}",
},
"generatorURL": "alert/script",
}
alerts.append(alert)
logging.info("Sending alerts to alert-manager...")
resp = requests.post(post_url, json=alerts)
resp.raise_for_status()
logging.info("Alerts sent to alert-manager.")

def main():
PAI_URI = os.environ.get("PAI_URI")
certfile = open(APISERVER_CERT_PATH).read()
cert = crypto.load_certificate(crypto.FILETYPE_PEM, certfile)
expirationTime = datetime.strptime(cert.get_notAfter().decode('ascii'), r'%Y%m%d%H%M%SZ')
delta = expirationTime - datetime.now()
if (delta < timedelta(days = alertResidualDays)):
send_alert(PAI_URI, delta.days, f'Not after {expirationTime}')

if __name__ == "__main__":
logging.basicConfig(
format=
"%(asctime)s - %(levelname)s - %(filename)s:%(lineno)s - %(message)s",
level=logging.INFO,
)
main()

0 comments on commit 4cc1e90

Please sign in to comment.