title | description | menu | menu_name | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Incident Concepts |
Incident Concepts |
|
product_searchlight_7.0.0-rc.0 |
A Incident
is a Kubernetes Custom Resource Definition
(CRD).
It provides information on notifications sent by Searchlight for alerts.
As with all other Kubernetes objects, a Incident has apiVersion
, kind
, and metadata
fields. It also has a .spec
section.
Below is an example Incident object maintained by Searchlight.
apiVersion: monitoring.appscode.com/v1alpha1
kind: Incident
metadata:
name: cluster.pod-exists-demo-0.20180428-1109
namespace: demo
labels:
monitoring.appscode.com/alert: pod-exists-demo-0
monitoring.appscode.com/alert-type: cluster
monitoring.appscode.com/recovered: true
status:
lastNotificationType: Recovery
notifications:
- type: Problem
checkOutput: Found 10 pod(s) instead of 11
author: searchlight-user
firstTimestamp: 20180428-1109
lastTimestamp: 20180428-1119
state: Critical
- type: Acknowledgement
checkOutput: Found 10 pod(s) instead of 11
author: admin
comment: working on fix
firstTimestamp: 20180428-1142
lastTimestamp: 20180428-1142
state: Critical
- type: Recovery
checkOutput: Found all pods
author: searchlight-user
firstTimestamp: 20180428-1237
lastTimestamp: 20180428-1237
state: Ok
Here,
metadata.name
represents Incident name with format{host-type}.{alert-name}.{time}
metadata.namespace
represents Namespace where ClusterAlert is created.metadata.labels
provides additional information on Alert and Notificationstatus
provides information on notificationsstatus.lastNotificationType
represents last type of notification that was sentstatus.notifications
provides list of notifications that were sent
Searchlight plugin hyperalert
adds notification information in status.notifications
when a notification is occurred in Icinga.
Lets see some examples to understand this scenario.
We have an Incident with following status
part
status:
lastNotificationType: Problem
notifications:
- type: Problem
checkOutput: Found 10 pod(s) instead of 11
author: searchlight-user
firstTimestamp: 20180428-1109
lastTimestamp: 20180428-1109
state: Critical
When there is no existing Incident available, Searchlight creates one. From above, we can see that, a Icinga notification of type Problem has invoked for Critical state.
Service output
is updated with last Service check output.
If ClusterAlert is set to send notifications on every 5m, more notifications will be invoked with 5m interval. Now every time a notification of type Problem is invoked, Searchlight updates this existing notification information. Following fields are updated:
- checkOutput
- lastTimestamp
- state
Now, suppose, user acknowledges this problem. So another notification of type Acknowledgement is invoked. When notification type is Acknowledgement, a new notification is added in list.
Following is the latest status
of this Incident
status:
lastNotificationType: Acknowledgement
notifications:
- type: Problem
checkOutput: Found 10 pod(s) instead of 11
author: searchlight-user
firstTimestamp: 20180428-1109
lastTimestamp: 20180428-1119
state: Critical
- type: Acknowledgement
checkOutput: Found 10 pod(s) instead of 11
author: admin
comment: working on fix
firstTimestamp: 20180428-1142
lastTimestamp: 20180428-1142
state: Critical
User admin
acknowledges this issues with comment working on fix
. status.lastNotificationType
is updated with last notification type.
To know how to acknowledge, see here.
When this problem is fixed, another notification of type Recovery is invoked. With this notification, this Incident is locked. For furthers incidents, another new Incident object will be created.
This is the final status
of this Incident
status:
lastNotificationType: Recovery
notifications:
- type: Problem
checkOutput: Found 10 pod(s) instead of 11
author: searchlight-user
firstTimestamp: 20180428-1109
lastTimestamp: 20180428-1119
state: Critical
- type: Acknowledgement
checkOutput: Found 10 pod(s) instead of 11
author: admin
comment: working on fix
firstTimestamp: 20180428-1142
lastTimestamp: 20180428-1142
state: Critical
- type: Recovery
checkOutput: Found all pods
author: searchlight-user
firstTimestamp: 20180428-1237
lastTimestamp: 20180428-1237
state: Ok
And also, label monitoring.appscode.com/recovered: true
is added in label. This represents that, This Incident is recovered.