-
Notifications
You must be signed in to change notification settings - Fork 445
Incident Management Protocol
Effective incident management is vital to go back to normal business operation as fast as possible. To be effective it's very important that everybody working in the incident resolution knows their role. The role separation helps in knowing what the role should and should not do in order to avoid confusion and chaos around who's responsible for what.
Here is who needs to do what whenever an alert isn't a false positive, the service is down or there are other major disruptions.
By default the DemoBugSquad is the incident manager.
However, everybody can declare themselves incident manager if they notice that the DemoBugSquad is not responding.
The duties of the Incident Manager are:
- Create an incident state document (template) on etherpad.opensuse.org
- Declare the incident to our team channel (
:warning: We have an incident going on, follow it here: https://etherpad.opensuse.org/p/KaqRIWahiQOthDdf1OeR
) - Fulfill all other roles (Communications & OPS)
- Delegate roles to anyone in the team, according to necessities
The duties of the communications are:
- Continuous update of the incident state document
- Continuous communication to keep stakeholders up to date (IRC/Slack/Mail/Status Messages etc.)
- Declare the incident as resolved
- Write a post mortem on our blog
The duties of Ops are:
- Stop the bleeding and restore the service
- Find the root-cause
It's better to declare an incident early and call it off later, than to spin up an incident response team when everything is messed up by unorganized tinkering.
The incident is closed when the involved services are back to normal operation. This does not include those long-term tasks created during the incident response.
Having to come up with sentences to use in communication is hard during an incident. Find some templates below, you can find more on the internet.
Title: Open Build Service Service Disruption
We are currently experiencing a service disruption.
Our team is working to identify the root cause and implement a solution.
All build.opensuse.org users may be affected.
You can follow the current state on our incident document: https://etherpad.opensuse.org/p/KaqRIWahiQOthDdf1OeR
We will come back to you here once we resolved the incident.
Title: Open Build Service Page Unresponsiveness
The site is currently experiencing a higher than normal amount of load, and may be causing pages to be slow or unresponsive.
**ADD_GENERAL_IMPACT** users may be affected.
You can follow the current state on our incident document: https://etherpad.opensuse.org/p/KaqRIWahiQOthDdf1OeR
We will come back to you here once we resolved the incident.
Our communication channels include
- Our mailing list [email protected]
- IRC (irc://irc.libera.chat/openSUSE-buildservice)
- OBS Status Messages
- Slack (#help-obs & #team-build-solutions)
- Development Environment Overview
- Development Environment Tips & Tricks
- Spec-Tips
- Code Style
- Rubocop
- Testing with VCR
- Authentication
- Authorization
- Autocomplete
- BS Requests
- Events
- ProjectLog
- Notifications
- Feature Toggles
- Build Results
- Attrib classes
- Flags
- The BackendPackage Cache
- Maintenance classes
- Cloud uploader
- Delayed Jobs
- Staging Workflow
- StatusHistory
- OBS API
- Owner Search
- Search
- Links
- Distributions
- Repository
- Data Migrations
- next_rails
- Ruby Update
- Rails Profiling
- Installing a local LDAP-server
- Remote Pairing Setup Guide
- Factory Dashboard
- osc
- Setup an OBS Development Environment on macOS
- Run OpenQA smoketest locally
- Responsive Guidelines
- Importing database dumps
- Problem Statement & Solution
- Kickoff New Stuff
- New Swagger API doc
- Documentation and Communication
- GitHub Actions
- How to Introduce Software Design Patterns
- Query Objects
- Services
- View Components
- RFC: Core Components
- RFC: Decorator Pattern
- RFC: Backend models