Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace GitHub with PagerDuty in our Incident Response process #508

Merged
merged 16 commits into from
Oct 4, 2022
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 92 additions & 20 deletions projects/managed-hubs/incidents.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,34 +75,106 @@ Here is the process that we follow for incidents:
Incident first response template
```

2. **Open an incident issue**.
For each {term}`Incident` we create a dedicated issue to track its progress. [{bdg-primary}`open an incident issue`](https://github.com/2i2c-org/infrastructure/issues/new?assignees=&labels=type%3A+Hub+Incident%2Csupport&template=3_incident-report.md&title=%5BIncident%5D+%7B%7B+TITLE+%7D%7D) and notify our engineering team via Slack.
3. **Try resolving the issue** and take notes while you gather information about it.
4. **If after 30 minutes the issue is not solved or you know you cannot resolve it**
- Ping our engineering team and our Project Manager in the {guilabel}`#support-freshdesk` channel so that they are aware of the incident.
- Add the incident issue to [our team backlog](https://github.com/orgs/2i2c-org/projects/22/).
5. **Designate an {term}`Incident Commander`**. Do this in the Incident issue. By default, this is the Support Steward.
- Confirm that the Incident Commander has the bandwidth and ability to serve in this role.
- If not, delegate this to another team member.[^note-on-delegation]
6. **Designate an {term}`External Liason`**. Do this in the Incident issue. By default, this is the Incident Commander, though they may delegate this to others.[^note-on-delegation]
7. **Investigate and resolve the incident**. The Incident Commander should follow the structure of the incident issue opened in the step above.
8. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.[^note-on-delegation]
9. **Communicate our status every few hours**. The {term}`External Liason` is expected to communicate incident status and plan with the {term}`Community Representative`s. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started:
2. **Trigger an incident in PagerDuty**, using the 2i2c slack so we have a central location to discuss the incident.
Use `/pd trigger` in the {guilabel}`#pagerduty-notifications` channel on the 2i2c slack to trigger the incident -
after you type the command and hit `enter`, you should get a dialog box with options.

For "Impacted Service", select "Managed JupyterHubs". We can have more fine-grained services here later if we wish.

Assign it to whoever is the **Incident Commander**. This is by default one of the support stewards or whoever is
triggering the event, but not necessarily[^note-on-delegation]!

Provide a descriptive but short Title, but don't sweat it too much!

If there is a freshdesk ticket for this, provide a link to that in the description.

Check the box for "Create a dedicated Public Slack channel for this incident" to create a *new slack channel*
for discussing the incident. This helps keep chatter off other channels *and* provides an easy location to gather
information for the incident report afte the fact.
choldgraf marked this conversation as resolved.
Show resolved Hide resolved

This officially marks the beginning of an incident, and will help make sure we don't accidentally miss steps during
or after the incident.

yuvipanda marked this conversation as resolved.
Show resolved Hide resolved
3. **Try resolving the issue** and communicate on the incident specific channel while you gather information and perform
actions - even if only to mark these as notes to yourself.
4. **Designate an {term}`External Liason`**. Do this in the Incident issue. By default, this is the Incident Commander, though they may delegate this to others.[^note-on-delegation]
5. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.[^note-on-delegation]
6. **Communicate our status every few hours**. The {term}`External Liason` is expected to communicate incident status and plan with the {term}`Community Representative`s. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started:

```{button-link} https://2i2c.freshdesk.com/a/admin/canned_responses/folders/80000143608/responses/80000247492/edit
:color: primary

Incident update template
```

9. **Communicate when the incident is resolved**. When we believe the incident is resolved, communicate with the Community Representative that things should be back to normal. Mark the FreshDesk ticket as {guilabel}`Resolved`.
10. **Fill in the {term}`Incident Report`**. The Incident Commander should do this in partnership with the Incident Response Team.
11. **Close the incident ticket**. Once we have confirmation from the community (or no response after 48 working hours), and have filled in the incident {term}`Incident Report`, then close the incident by:
- Closing the incident issue on GitHub
- Marking the FreshDesk ticket as {guilabel}`Closed`
7. **Communicate when the incident is resolved**. When we believe the incident
is resolved, communicate with the Community Representative that things should be
back to normal.
- Marking the incident as "Resolved" in pagerduty.
- Marking the FreshDesk ticket as {guilabel}`Closed`

[^note-on-delegation]: If you cannot find somebody to take on this work, or feel uncomfortable delegating, the {term}`Project Manager` should help you, and is empowered to delegate on your behalf.

## Creating the Incident Report

Once the incident is resolved, we must create an {term}`Incident Report`. This helps us understand what went wrong,
and how we can improve our systems to prevent a recurrance. This is a *very important* part of making our infrastructure
and human processes more stable and stress free over time, so we should try to do this after each incident. The
**Incident Commander** is responsible for making sure the Incident Report is done, even though they may not be the
person doing it.

Note that we *must* practice a [blameless culture](https://www.blameless.com/sre/what-are-blameless-postmortems-do-they-work-how)
around incident reports - Incidents are *always* caused by systemic issues, and hence solutions must be systemic
too. Go out of your way to make sure there is no finger-pointing.

We use PagerDuty's [postmortem](https://support.pagerduty.com/docs/postmortems) feature to create the Incident Report.
This lets us use notes, status updates from pagerduty as well as messages from Slack easily in the incident report!

1. Open the incident in the PagerDuty web interface, and Click the "New Postmortem Report" button on top. The incident
needs to be already resolved before this feature is available.

2. The "Owner of the Review Process" should be set to the Incident Commander, or someone else they delegate to explicitly.

3. Fill out the "Impact Start Time" to be our best guess for when the incident started (*not* when the report came in), and
the "Impact End Time" to be when service was restored. Best guesses will do!

4. Add the slack channel we created for this incident as a "Data Source", filled in with an appropriate time to cover all
the messages there. You can add other channels too if there was conversation there about the incident. Click "Save Data Sources"
to populate the timeline below with messages from the slack channels.

5. Fill out the timeline! The goal is to be concise but make it possible for someone reading it to answer "what happened, and when?".
The timeline should include:

yuvipanda marked this conversation as resolved.
Show resolved Hide resolved
1. The beginning of the impact.
2. When the incident was brought to our attention, with a link to the source (Freshdesk ticket, slack message, etc).
3. When we responded to the incident. This would coincide with the creation of the PagerDuty incident.
4. Various debugging actions performed to ascertain the cause of the issue. Talking to yourself as you do this on the
slack channel helps a lot here, as it helps communicate your methods to others on the team as well as help improve
processes in the future more easily.

Examples here would be things like `Looked at hub logs with "kubectl logs -n temple -l component=hub" and found <this>` or
`Opened the cloud console and discovered notifications about quota". Pasting in commands is very helpful! This is an
choldgraf marked this conversation as resolved.
Show resolved Hide resolved
important way for team members to learn from each other - what you take for granted is perhaps news to someone else, or
you might learn alternate ways of doing things!
5. Actions taken to attempt to fix the issue, and their outcome. Paste commands executed if possible, as well as any
GitHub PRs made. Putting this in Slack again helps.
6. Any extra communication from the community affected that helped.
7. Whenever the impact was fixed, and how that was verified.
8. Whatever else you think would be helpful to someone who finds this incident report a few months from now, trying to fix a
similar incident.

6. Fill out the "Analysis" section to the extent possible. In particular, the "Action Items" should be a list with items
linked out to GitHub issues created for follow-up. Perfection is the enemy of the good here. Save as you go.

7. Click "Save & View Report* when you are done, and ask other members of the incident response team to review the incident report.
choldgraf marked this conversation as resolved.
Show resolved Hide resolved
They might add missing context, additional action items / summary details, or redact information. The person listed as
the "Owner of the Review Process" is still responsible for making sure the rest of the process is completed.

8. After sufficient review, and if the Incident Commander is happy with its completeness, mark the Status dropdown up top as "Reviewed".

9. Download the PDF, and add it to the `2i2c/infrastrtucture` repository under the `incidents/` directory. This make sure our
incidents are all *public*, so others can learn from them as well.

## Handing off Incident Commander status

During an incident, it may be necessary to designate another person to be the Incident Commander.
Expand All @@ -112,8 +184,8 @@ This is encouraged and expected, especially for more complex or longer incidents
To designate another team member as the Incident Commander, follow these steps:

1. **Confirm with them** that they are able and willing to serve as the Incident Commander.
2. **Update the Incident Report issue** by updating the Incident Commander name in the top comment.
3. **Notify the team** with a comment in the Incident Report issue.
2. **Reassign the incident on PagerDuty** to the new commander. This should produce a message in the slack channel for this event,
thus communicating this change to the rest of the team.

## Key terms

Expand Down