-
Notifications
You must be signed in to change notification settings - Fork 33
Incident Response Guide
We don't just develop software for our customers. We're also responsible for deploying it to production and keeping it stable. Naturally, accidents and various incidents happen sometimes. Below is the standard procedure for how we respond to incidents, which strongly recommended for all development teams at ivelum.
By incident, we mean a critical situation that requires an immediate response. Examples of incidents are downtimes of any project in production or a serious malfunction of key functionalities. Non-critical problems that can wait at least one working day before they must be resolved are not counted as incidents.
Responding to incidents is the direct responsibility of all engineers at ivelum. If critical problems in production are detected, the person who is in the nearest proximity to the problem should deal with it. This would be the person who is online at that moment or who has received an alert from the automated monitoring system. If there is no one online and the problem is detected by the client themselves, then they will call the mobile numbers of the people closest to the project. Here is the order of action:
- In a public chat channel, immediately confirm to the client that you are online and have begun to investigate the issue;
- Make an initial diagnosis to understand whether the problem is reproducible and figure out what layer the problem is on, eg application layer, infrastructure, third-party provider problem, content problem, etc;
- Identify who can solve this problem. If you are able to do it yourself, this means confirming to the client that you have already started working on the solution. If this is not your project or you are having difficulty deciding how to fix it, try to contact someone from the team that supports this project and involve them in the decision. If this is a problem with a third-party provider, then send a request to that provider's technical support (if applicable) and think about what workarounds can be done internally;
- In any case, you should maintain regular communication with the client through a public chat channel, and maintain this communication until the problem is resolved, or until you and the client have agreed that the problem is not critical and its resolution can wait;
- If the incident was a serious one, and included a long interruption of work on production, data loss, or any other noticeable loss for the client (including reputational), then inform the client that they will receive a detailed postmortem on the incident on the next business day.
IMPORTANT: even if the case seems obvious to you, do not rush to provide a detailed report immediately. It is better to postpone writing it until the next day. Experience shows that critical incidents are stressful situations, and analysis that comes immediately after the incident has been mitigated is usually poor, as emotions are still running high. In stressful situations, people tend to blame themselves and may not see the bigger picture. For example, they may miss the shortcomings in the existing processes that created the conditions giving rise to the problem. Postponing a detailed report for one day gives you the opportunity to take a fresh look at the situation and discuss it with the team.
The postmortem is written by someone from the team supporting the project, and it must be checked and approved by the team lead. The report must answer the following questions:
-
What happened? This part describes the sequence of events that led to the incident and provides information about the exact damage caused. If it was a service interruption, you should specify which service did not work, whether the breakdown was complete or of certain key functionalities. If possible, specify the exact time interval when the breakdown occurred. If the incident involves the loss or compromise of data, specify which data was lost or compromised.
-
How was the problem fixed? This part describes the actions that the team took to solve the problem and their results.
-
What caused the problem? This is where we analyze the conditions that gave way to the problem. There may have been a variety of contributing factors: the code logic, the system architecture, a deficiency in the dev process or code delivery, the choice of unreliable third-party services, the human factor, etc.
-
What have we done or are going to do to minimize these risks in the future? This section describes the specific measures that have been approved by the team lead to reduce the risk of similar problems in the future and to minimize their damage if they do occur again.
You must implement the measures that you have decided upon to prevent similar future incidents. This seems obvious, but unfortunately it's too often ignored in practice. Teams are always busy with work and there are always other "more important" things to do. There's always the temptation to postpone taking measures until the time is right. Don't do that. If you can't implement remediation measures immediately, then set a task with a clear deadline and an assignee, and put it on the high priority list.