-
Notifications
You must be signed in to change notification settings - Fork 441
Incident Management Protocol
Effective Incident Management is vital to limit disruption caused by an incident and go back to normal business operation as fast as possible.
When an incident happens, the priorities are:
- Stop the bleeding
- Restore the service
- Preseve the evidence in order to find the root-cause
Key elements:
- Continuous communication to keep stakeholders and users up to date.
- A role to coordinate communication between the different parties.
- A role to think on the big picture and longer-term tasks to offload those duties from the people working on the incident resolution.
- A predefined communication channel where all the communication goes through.
Its very important that every team member working in the incident resolution knows its role. The role separation helps in knowing what can the team member do and what should not do, because it belongs to other role's duties.
Every role should have full autonomy inside its boundaries and complete trust from the rest of the team members.
Every team member should ask for help to it's Planning Lead when the workload starts to feel excessive, panic or overwhelming.
Every lead can delegate components of work to other colleagues as they see fit.
Every team member should be confortable with every role, so shuffle the roles around when possible.
The duties of the Incident Manager are:
- Keep the high level state of the incident at all times via the Living Incident State Document.
- Structure the task force and assigns/delegates roles according to necessities and priorities.
- Hold all roles that are not explicitly delegated.
- Remove roadblocks that interfere in Ops duties, if needed.
They should be capable of answering the following questions:
- What is the impact to the users?
- What are the users seeing?
- How many/which users are affected (all, logged users, beta)?
- When did all start?
- How many related issues users have opened?
- Is there any security implication? Is there any data loss?
During this assessment phase is when the Living Incident State Document starts to be filled in.
Their challenges are:
- Keep the team communication effective.
- To be up to date to the current theories about the incident, observations and the team's lines of work.
- Clearly assign the roles. Escalate as needed.
- Are effective decissions being made?
- Ensure the changes into the system are made carefully and intentionally.
- Is the team exhausted? Can we handoff the incident management?
Works with the IM in the incident response.
The Ops team should be the only one doing changes in the system during the incident.
The Ops team should:
- Observe what's happening in the system. Share and confirm observations.
- Develop theories about what's happening.
- Develop experiments that prove or disprove those theories and carry them out.
- Repeat
Observations and decissions made by the Ops team should be written into the designated internal communication channel. The goal is to:
- Focus the incident response in only one place to minimize confusion.
- Be of a great value, later on, when rebuilding the incident timeline during the postmortem.
Is the spokesperson of the task force during the incident. Its duties are to ensure the rest of the team and the users are up to date via the designated external communication channel.
The Communications role can also be in charge of maintain the Live Incident State Document if the IM agrees.
Works as support to Ops, taking care of other long-term tasks like:
- File bugs, create Trello cards, Github Issues, etc.
- Prepare Handoffs to other teams (if necessary).
- Keep track of changes in the system to revert them once the incident is resolved. Things like Monkey Patches, Hotfixes, etc.
To keep track of the changes, they need to keep track of:
- who changed what
- when
- how
- why
- how to revert it.
Trello cards and/or a Github Issues are good places to do it.
We need a place to track the incident response. The Living Incident State Document is the place to do it.
It is the duty of the IM to keep this document live and up to date.
It should be a functional document, with the most important information at the top.
It should be a document editable for everyone in the team. Ideally, editable in real time. It should be readable for everyone interested in how the incident is evolving.
TBD: We could host it in Github Wiki as its open for reading to everyone but editable only by team members.
Document example: https://landing.google.com/sre/sre-book/chapters/incident-document/
TBD: The handoff is the process to transfer all roles to other teams (maybe in other time zones at the end of the work day). We don't have other teams in other time zones, and I guess if we don't work on the incident resolution, nobody else will do it. So I guess the handoff has no sense in our team.
Nevertheless, when some team member involved in the incident resolution logs off, the IM should be clearly aware and explicitly notified of this, so they can appoint someone else to fulfill the role, or keep it to themselves, and transfer all relevant knowledge back.
Its better to declare an incident early and call it off later, than to spin up an incident response team when everything is messed up by unorganized tinkering.
In case of doubt, we follow this guideline:
- Do we have a service disruption, like main page not reachable, users not able to log in?
- Do we have 500 errors piling up in our monitoring tools?
- Do we have unresponsiveness, page slowdown?
- Is the issue visible to the users?
- Is the problem still unresolved after an hour of focused analysis and working of the issue?
The incident is closed when the involved services are back to normal operation. This does not include those long-term tasks created during the incident response.
Once the incident is resolved, a Postmortem is needed to find the root cause of the incident and take next steps to ensure the incident does not happen again.
This does not mean the root cause can't happen again. It means the next time we detect it earlier and we react before it spins out of control.
TBD: In order to have a fast and smooth reaction in front of an incident we could war-game the incident management with the team periodically. Pick up something already resolved and role-play the response.
We use Rocket Chat daily, our internal [build-solutions|https://chat.suse.de/group/build-solutions] channel will be used for internal communication.
TBD: As an alternative, we could create a dedicated incident channel to avoid polluting our channel.
The external communication channel will be the existing public mailing list:
TBD: We could also send updates via Twitter, and/or use OBS Announcements.
The [Incident document template|https://github.com/openSUSE/open-build-service/wiki/Incident-Document-Template] is taken from the Google SRE Book.
TBD: As an alternative we can use https://etherpad.opensuse.org/, as it allows the team realtime editing capabilities.
Having to come up with sentences to use in communication updates is not something we should do during an incident. So having predefined and previously agreed-on communication templates removes that burden and allows the Communication role to focus on what instead of the how.
Title: Open Build Service Service Disruption
We are currently experiencing a service disruption.
Our team is working to identify the root cause and implement a solution.
**ADD_GENERAL_IMPACT** users may be affected.
We will send an additional update in **NEXT_UPDATE_TIME** minutes.
Title: Open Build Service Page Unresponsiveness
The site is currently experiencing a higher than normal amount of load, and may be causing pages to be slow or unresponsive.
**ADD_GENERAL_IMPACT** users may be affected.
We’re investigating the cause and will provide an update in **NEXT_UPDATE_TIME** minutes.
More examples:
https://support.atlassian.com/statuspage/docs/incident-template-library/
TBD: We could use the status page to communicate updates to interested parties.
https://status.opensuse.org/ https://docs.cachethq.io/docs/incident-statuses
- Development Environment Overview
- Development Environment Tips & Tricks
- Spec-Tips
- Code Style
- Rubocop
- Testing with VCR
- Authentication
- Authorization
- Autocomplete
- BS Requests
- Events
- ProjectLog
- Notifications
- Feature Toggles
- Build Results
- Attrib classes
- Flags
- The BackendPackage Cache
- Maintenance classes
- Cloud uploader
- Delayed Jobs
- Staging Workflow
- StatusHistory
- OBS API
- Owner Search
- Search
- Links
- Distributions
- Repository
- Data Migrations
- next_rails
- Ruby Update
- Rails Profiling
- Installing a local LDAP-server
- Remote Pairing Setup Guide
- Factory Dashboard
- osc
- Setup an OBS Development Environment on macOS
- Run OpenQA smoketest locally
- Responsive Guidelines
- Importing database dumps
- Problem Statement & Solution
- Kickoff New Stuff
- New Swagger API doc
- Documentation and Communication
- GitHub Actions
- How to Introduce Software Design Patterns
- Query Objects
- Services
- View Components
- RFC: Core Components
- RFC: Decorator Pattern
- RFC: Backend models