Skip to content

Commit

Permalink
Update incident documentation (#60)
Browse files Browse the repository at this point in the history
Co-authored-by: Karl-Johan Grahn <[email protected]>
  • Loading branch information
karl-johan-grahn and Karl-Johan Grahn authored May 13, 2022
1 parent 51d2a91 commit f76f7f6
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 3 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.14.14] - 2022-05-13
### Update
- Update incident documentation

## [0.14.13] - 2022-04-01
### Fix
- Fix spelling
Expand Down
18 changes: 15 additions & 3 deletions docs/src/incidents/incidents.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,16 @@
# Incident declaration automation
# Incident management
Practice incident management skills so they become second nature, so you don't struggle to follow the process.

## Implementing an incident management process
The steps for implementing an incident management process can be summarized as follows:
1. Define exit criteria for types of incidents
1. Avoid groupthink and formalize assessment of operations with a risk assessment matrix. Risk assessment could be combined with Analytic Hierarchy Process (AHP).
1. Appoint delegates for critical functions to avoid single points of failure
1. Agree organization wise on effort required for different levels of severity and priority
1. Define and automate response plans, and make sure communication section includes backup communication methods
1. Work with developers to create playbooks for all services, which need verification and approval process

## Incident declaration automation
`devopsbot` automates incident declaration:
- Incident responders and commanders should focus on the essential tasks of resolving the incident
- Automate the repetitive steps in the incident management process
Expand All @@ -15,12 +27,12 @@ An incident is defined as something that:
- Was not planned for
- Cannot be resolved within 1 hour

## During incidents
### During incidents
The workflow during an incident is as follows:

![incident declaration flow using the bot](./devopsbot.drawio.png)

## After incidents
### After incidents
When an incident has been declared as resolved, there is a need to communicate the resolution and
learn from the experience.

Expand Down

0 comments on commit f76f7f6

Please sign in to comment.