Skip to content

Incident Management Protocol

Dani Donisa edited this page Sep 9, 2020 · 31 revisions

Effective Incident Management is vital to limit disruption caused by an incident and go back to normal business operation as fast as possible.

When an incident happens, the priorities are:

  1. Stop the bleeding
  2. Restore the service
  3. Preseve the evidence in order to find the root-cause

Key elements:

  • Continuous communication to keep stakeholders and users up to date.
  • A role to coordinate communication between the different parties.
  • A role to think on the big picture and longer-term tasks to offload those duties from the people working on the incident resolution.
  • A predefined communication channel where all the communication goes through.

Clear roles

Its very important that every team member working in the incident resolution knows its role. The role separation helps in knowing what can the team member do and what should not do, because it belongs to other role's duties.

Every role should have full autonomy inside its boundaries and complete trust from the rest of the team members.

Every team member should ask for help to it's Planning Lead when the workload starts to feel excessive, panic or overwhelming.

Every lead can delegate components of work to other colleagues as they see fit.

Every team member should be confortable with every role, so shuffle the roles around when possible.

The Incident Manager (IM) Role

The duties of the Incident Manager are:

  • Keep the high level state of the incident at all times via the Living Incident State Document.
  • Structure the task force and assigns/delegates roles according to necessities and priorities.
  • Hold all roles that are not explicitly delegated.
  • Remove roadblocks that interfere in Ops duties, if needed.

They should be capable of answering the following questions:

  • What is the impact to the users?
  • What are the users seeing?
  • How many/which users are affected (all, logged users, beta)?
  • When did all start?
  • How many related issues users have opened?
  • Is there any security implication? Is there any data loss?

During this assessment phase is when the Living Incident State Document starts to be filled in.

Their challenges are:

  • Keep the team communication effective.
  • To be up to date to the current theories about the incident, observations and the team's lines of work.
  • Clearly assign the roles. Escalate as needed.
  • Are effective decissions being made?
  • Ensure the changes into the system are made carefully and intentionally.
  • Is the team exhausted? Can we handoff the incident management?

Ops Lead Role

Works with the IM in the incident response.

The Ops team should be the only one doing changes in the system during the incident.

The Ops team should:

  1. Observe what's happening in the system. Share and confirm observations.
  2. Develop theories about what's happening.
  3. Develop experiments that prove or disprove those theories and carry them out.
  4. Repeat

Observations and decissions made by the Ops team should be written into the designated internal communication channel. The goal is to:

  1. Focus the incident response in only one place to minimize confusion.
  2. Be of a great value, later on, when rebuilding the incident timeline during the postmortem.

Communications Role

Is the spokesperson of the task force during the incident. Its duties are to ensure the rest of the team and the users are up to date via the designated external communication channel.

The Communications role can also be in charge of maintain the Live Incident State Document if the IM agrees.

Planning Lead Role

Works as support to Ops, taking care of other long-term tasks like:

  • File bugs, create Trello cards, Github Issues, etc.
  • Prepare Handoffs to other teams (if necessary).
  • Keep track of changes in the system to revert them once the incident is resolved. Things like Monkey Patches, Hotfixes, etc.

To keep track of the changes, they need to keep track of:

  1. who changed what
  2. when
  3. how
  4. why
  5. how to revert it.

Trello cards and/or a Github Issues are good places to do it.

Living Incident State Document

We need a place to track the incident response. The Living Incident State Document is the place to do it.

It is the duty of the IM to keep this document live and up to date.

It should be a functional document, with the most important information at the top.

It should be a document editable for everyone in the team. Ideally, editable in real time. It should be readable for everyone interested in how the incident is evolving.

TBD: We could host it in Github Wiki as its open for reading to everyone but editable only by team members.

Document example: https://landing.google.com/sre/sre-book/chapters/incident-document/

Clear, Live handoff

TBD: The handoff is the process to transfer all roles to other teams (maybe in other time zones at the end of the work day). We don't have other teams in other time zones, and I guess if we don't work on the incident resolution, nobody else will do it. So I guess the handoff has no sense in our team.

Nevertheless, when some team member involved in the incident resolution logs off, the IM should be clearly aware and explicitly notified of this, so they can appoint someone else to fulfill the role, or keep it to themselves, and transfer all relevant knowledge back.

When to declare an Incident

Its better to declare an incident early and call it off later, than to spin up an incident response team when everything is messed up by unorganized tinkering.

In case of doubt, we follow this guideline:

  • Do we have a service disruption, like main page not reachable, users not able to log in?
  • Do we have 500 errors piling up in our monitoring tools?
  • Do we have unresponsiveness, page slowdown?
  • Is the issue visible to the users?
  • Is the problem still unresolved after an hour of focused analysis and working of the issue?

When to close an Incident

The incident is closed when the involved services are back to normal operation. This does not include those long-term tasks created during the incident response.

Incident post-mortem

Once the incident is resolved, a Postmortem is needed to find the root cause of the incident and take next steps to ensure the incident does not happen again.

This does not mean the root cause can't happen again. It means the next time we detect it earlier and we react before it spins out of control.

Training

TBD: In order to have a fast and smooth reaction in front of an incident we could war-game the incident management with the team periodically. Pick up something already resolved and role-play the response.

Tools

Internal communication channel

We use Rocket Chat daily, our internal [build-solutions|https://chat.suse.de/group/build-solutions] channel will be used for internal communication.

TBD: As an alternative, we could create a dedicated incident channel to avoid polluting our channel.

External communication channel

The external communication channel will be the existing public mailing list:

TBD: We could also send updates via Twitter, and/or use OBS Announcements.

Incident document template

The [Incident document template|https://github.com/openSUSE/open-build-service/wiki/Incident-Document-Template] is taken from the Google SRE Book.

TBD: As an alternative we can use https://etherpad.opensuse.org/, as it allows the team realtime editing capabilities.

Communications templates

Having to come up with sentences to use in communication updates is not something we should do during an incident. So having predefined and previously agreed-on communication templates removes that burden and allows the Communication role to focus on what instead of the how.

Service disruption

Title: Open Build Service Service Disruption
We are currently experiencing a service disruption.
Our team is working to identify the root cause and implement a solution. 
**ADD_GENERAL_IMPACT** users may be affected.
We will send an additional update in **NEXT_UPDATE_TIME** minutes.

General unresponsiveness

Title: Open Build Service Page Unresponsiveness
The site is currently experiencing a higher than normal amount of load, and may be causing pages to be slow or unresponsive.
**ADD_GENERAL_IMPACT** users may be affected.
We’re investigating the cause and will provide an update in **NEXT_UPDATE_TIME** minutes.

More examples:

https://support.atlassian.com/statuspage/docs/incident-template-library/

Status page

TBD: We could use the status page to communicate updates to interested parties.

https://status.opensuse.org/ https://docs.cachethq.io/docs/incident-statuses

Clone this wiki locally