- An openshift 3.7 cluster with Prometheus
- Podified ManageIQ installed on the cluster (preferably from docker.io/containermgmt/manageiq-pods)
- A control host (to run the ansible playbooks from, can be the cluster master or your laptop)
You'll need ansible 2.4 or newer for the manageiq modules and the python manageiq api client installed on the "control host".
The "jq" command is also useful to have.
Use the following command to install them (assuming your control host has EPEL enabled):
sudo yum install -y python-pip ansible jq wget; sudo pip install manageiq-client
- Route to CFME/ManageIQ
- Username and password to CFME/ManageIQ (default: admin:smartvm)
- Route to Prometheus
- Master Hostname
- CA certificate for the master
$ wget https://raw.githubusercontent.com/container-mgmt/cm-ops-flow/master/make_env.sh
$ bash make_env.sh | tee cm_ops_vars.sh
$ source cm_ops_vars.sh
oadm policy add-cluster-role-to-user cluster-admin $USER
(replace $USER with your LDAP username)
This step "enables" the two built-in alert profiles (note: there's no ansible module for this step yet).
- Find the href of the MiqEnterprise object (usually 1, sometimes not)
export ENTERPRISE_HREF="$(curl -u ${OPENSHIFT_CFME_AUTH} https://${OPENSHIFT_CFME_ROUTE}/api/enterprises/ | jq -r ".resources[0].href")"
- Find the hrefs for the two built-in profiles:
export PROMETHEUS_PROVIDER_PROFILE="$(curl -k -u ${OPENSHIFT_CFME_AUTH} "https://${OPENSHIFT_CFME_ROUTE}/api/alert_definition_profiles?filter\[\]=guid=a16fcf51-e2ae-492d-af37-19de881476ad" | jq -r ".resources[0].href")"
export PROMETHEUS_NODE_PROFILE="$(curl -k -u ${OPENSHIFT_CFME_AUTH} "https://${OPENSHIFT_CFME_ROUTE}/api/alert_definition_profiles?filter\[\]=guid=ff0fb114-be03-4685-bebb-b6ae8f13d7ad" | jq -r ".resources[0].href")"
- Assign them to the enterprise (This requires ManageIQ/manageiq-api PR #177):
curl -k -u ${OPENSHIFT_CFME_AUTH} -d "{\"action\": \"assign\", \"objects\": [\"${ENTERPRISE_HREF}\"]}" ${PROMETHEUS_PROVIDER_PROFILE}
curl -k -u ${OPENSHIFT_CFME_AUTH} -d "{\"action\": \"assign\", \"objects\": [\"${ENTERPRISE_HREF}\"]}" ${PROMETHEUS_NODE_PROFILE}
Download the playbook from this github repository:
curl https://raw.githubusercontent.com/container-mgmt/cm-ops-flow/master/miq_add_provider.yml > miq_add_provider.yml
Run ansible:
ansible-playbook --extra-vars \
"provider_name=${OPENSHIFT_PROVIDER_NAME}\
management_admin_token=${OPENSHIFT_MANAGEMENT_ADMIN_TOKEN} \
ca_crt=\"${OPENSHIFT_CA_CRT}\" \
openshift_master_host=${OPENSHIFT_MASTER_HOST} \
cfme_route=${OPENSHIFT_CFME_ROUTE} \
prometheus_metrics_route=${OPENSHIFT_PROMETHEUS_METRICS_ROUTE} \
prometheus_alerts_route=${OPENSHIFT_PROMETHEUS_ALERTS_ROUTE} \
cfme_user=${OPENSHIFT_CFME_USER} \
cfme_pass=${OPENSHIFT_CFME_PASS}" \
miq_add_provider.yml
If this step fails, you might have ansible older than 2.4 or don't have the manageiq-api python package installed.
There's no API for this stage (yet).
Using the UI, click the top-right menu, then click Configuration.
Under the Server Control->Server Roles heading, toggle all "Capacity & Utilization" switches to "on".
Click "Save" on the buttom-right corner.
(See ManageIQ/manageiq issue #14238 for the original documentation)
Run oc edit configmap -n openshift-metrics prometheus
to edit the configmap,
Add the alert rules under prometheus.rules:
# Supported annotations:
# miqTarget: ContainerNode|ExtManagementSystem, defaults to ContainerNode.
# miqIgnore: "true|false", should ManageIQ pick up this alert, defaults to true.
# description: A string the screen will show
# labels:
# severity: ERROR|WARNING|INFO. defaults to ERROR.
prometheus.rules: |
groups:
- name: example-rules
interval: 30s # defaults to global interval
rules:
#
# ------------- Copy below this line -------------
#
- alert: "Node Down"
expr: up{job="kubernetes-nodes"} == 0
annotations:
miqTarget: "ContainerNode"
url: "https://www.example.com/node_down_fixing_instructions"
description: "Node {{$labels.instance}} is down"
labels:
severity: "ERROR"
- alert: "Too Many Pods"
expr: sum(kubelet_running_pod_count) > 30
annotations:
miqTarget: "ExtManagementSystem"
url: "https://www.example.com/too_many_pods_fixing_instructions"
description: "Too many running pods"
labels:
severity: "ERROR"
To reload the configuration, delete the pod OR send a HUP signal to the Prometheus process.
oc rsh -n openshift-metrics -c prometheus prometheus-0
kill -HUP 1
wget https://raw.githubusercontent.com/container-mgmt/cm-ops-flow/master/expose_alertmanager.sh
bash expose_alertmanager.sh
# Add the new route to environment
bash make_env.sh | tee cm_ops_vars.sh
source cm_ops_vars.sh
Triggering The "Too Many Pods" test scenario and measuring different intervals related to alerting:
# trigger "Too Many Pods" test scenario
oc create namespace "${OPENSHIFT_MONITORING_DEMO_NS}"
# Create a replication controller and scale it
cat <<EOF | oc create -n "${OPENSHIFT_MONITORING_DEMO_NS}" -f - 2>&1
apiVersion: v1
kind: ReplicationController
metadata:
name: dummy
spec:
replicas: 1
selector:
app: dummy
template:
metadata:
name: dummy
labels:
app: dummy
spec:
containers:
- name: dummy
image: registry.hub.docker.com/mtayer/request-dumper
ports:
- containerPort: 80
EOF
oc scale rc -n ${OPENSHIFT_MONITORING_DEMO_NS} dummy --replicas=10
# Measure intervals
wget https://raw.githubusercontent.com/container-mgmt/cm-ops-flow/master/measure_alerts.sh
bash measure_alerts.sh
# Resolve & Measure
oc scale rc -n ${OPENSHIFT_MONITORING_DEMO_NS} dummy --replicas=0
bash measure_alerts.sh
# Clean Up
oc delete namespace ${OPENSHIFT_MONITORING_DEMO_NS}