Skip to content

Incident Management Protocol

Henne Vogelsang edited this page Sep 15, 2021 · 31 revisions

Effective incident management is vital to go back to normal business operation as fast as possible. To be effective it's very important that everybody working in the incident resolution knows their role. The role separation helps in knowing what the role should and should not do in order to avoid confusion and chaos around who's responsible for what.

Here is who needs to do what whenever an alert isn't a false positive, the service is down or there are other major disruptions.

Role: The Incident Manager (IM)

By default the DemoBugSquad is the incident manager.

However, everybody can declare themselves incident manager if they notice that the DemoBugSquad is not responding.

The duties of the Incident Manager are:

  • Create an incident state document (template) on etherpad.opensuse.org
  • Declare the incident to our team channel (:warning: We have an incident going on, follow it here: https://etherpad.opensuse.org/p/KaqRIWahiQOthDdf1OeR)
  • Fulfill all other roles (Communications & OPS)
  • Delegate roles to anyone in the team, according to necessities

Role: Communications

The duties of the communications are:

  • Continuous update of the incident state document
  • Continuous communication to keep stakeholders up to date (IRC/Slack/Mail/Status Messages etc.)
  • Declare the incident as resolved
  • Write a post mortem on our blog

Role: Ops

The duties of Ops are:

  • Stop the bleeding and restore the service
  • Find the root-cause

FAQ

When to Declare an Incident

It's better to declare an incident early and call it off later, than to spin up an incident response team when everything is messed up by unorganized tinkering.

When to Close an Incident

The incident is closed when the involved services are back to normal operation. This does not include those long-term tasks created during the incident response.

What to communicate to people?

Having to come up with sentences to use in communication is hard during an incident. Find some templates below, you can find more on the internet.

Service Disruption

Title: Open Build Service Service Disruption
We are currently experiencing a service disruption.
Our team is working to identify the root cause and implement a solution. 
All build.opensuse.org users may be affected.
You can follow the current state on our incident document: https://etherpad.opensuse.org/p/KaqRIWahiQOthDdf1OeR
We will come back to you here once we resolved the incident.

General Unresponsiveness

Title: Open Build Service Page Unresponsiveness
The site is currently experiencing a higher than normal amount of load, and may be causing pages to be slow or unresponsive.
**ADD_GENERAL_IMPACT** users may be affected.
You can follow the current state on our incident document: https://etherpad.opensuse.org/p/KaqRIWahiQOthDdf1OeR
We will come back to you here once we resolved the incident.

Where to communicate with people?

Our communication channels include

  • Our mailing list [email protected]
  • IRC (irc://irc.libera.chat/openSUSE-buildservice)
  • OBS Status Messages
  • Slack (#help-obs & #team-build-solutions)
Clone this wiki locally