-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace GitHub with PagerDuty in our Incident Response process #508
Changes from 6 commits
3d0eb11
957d409
dc5a9bb
a7d90a3
e27bda6
39140ac
a496228
5249fb7
220bbe7
e122e09
7ca2a8f
5648f43
767bfbf
edf52ca
f299b58
7d6af89
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -53,11 +53,12 @@ Subject Matter Experts | |
- They may **delegate** this responsibilitiy to another team member if they wish (e.g., to the {term}`Support Steward` team.) | ||
- We may interact with external stakeholders via comments in Incident Response issues if it helps resolve the incident more quickly. | ||
|
||
(incidents:communications)= | ||
### Internal communication | ||
|
||
- The Slack channel [{guilabel}`#support-freshdesk`](https://2i2c.slack.com/archives/C028WU9PFBN) contains real-time communication about support issues. Use this to signal-boost support requests related to {term}`Incidents`. | ||
- [Issues with the {guilabel}`incident` label](https://github.com/2i2c-org/infrastructure/issues?q=is%3Aopen+label%3A%22type%3A+Hub+Incident%22+sort%3Aupdated-desc) are where we track progress when [resolving incidents](support:incident-response). | ||
|
||
- [`2i2c-org.pagerduty.com`](https://2i2c-org.pagerduty.com/) is a dashboard for managing incidents. | ||
This is the "source of truth" for any active or historical incidents. | ||
- [The `#pagerduty-notifications` Slack channel](https://2i2c.slack.com/archives/C041E05LVHB) is where we control PagerDuty and have discussion about an incident. This allows us to have an easily-accessible communication channel for incidents. In general, most interactions with PagerDuty should be via this channel. | ||
|
||
(support:incident-response)= | ||
## Incident response process | ||
|
@@ -75,34 +76,104 @@ Here is the process that we follow for incidents: | |
Incident first response template | ||
``` | ||
|
||
2. **Open an incident issue**. | ||
For each {term}`Incident` we create a dedicated issue to track its progress. [{bdg-primary}`open an incident issue`](https://github.com/2i2c-org/infrastructure/issues/new?assignees=&labels=type%3A+Hub+Incident%2Csupport&template=3_incident-report.md&title=%5BIncident%5D+%7B%7B+TITLE+%7D%7D) and notify our engineering team via Slack. | ||
3. **Try resolving the issue** and take notes while you gather information about it. | ||
4. **If after 30 minutes the issue is not solved or you know you cannot resolve it** | ||
- Ping our engineering team and our Project Manager in the {guilabel}`#support-freshdesk` channel so that they are aware of the incident. | ||
- Add the incident issue to [our team backlog](https://github.com/orgs/2i2c-org/projects/22/). | ||
5. **Designate an {term}`Incident Commander`**. Do this in the Incident issue. By default, this is the Support Steward. | ||
- Confirm that the Incident Commander has the bandwidth and ability to serve in this role. | ||
- If not, delegate this to another team member.[^note-on-delegation] | ||
6. **Designate an {term}`External Liason`**. Do this in the Incident issue. By default, this is the Incident Commander, though they may delegate this to others.[^note-on-delegation] | ||
7. **Investigate and resolve the incident**. The Incident Commander should follow the structure of the incident issue opened in the step above. | ||
8. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.[^note-on-delegation] | ||
9. **Communicate our status every few hours**. The {term}`External Liason` is expected to communicate incident status and plan with the {term}`Community Representative`s. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started: | ||
2. **Trigger an incident in PagerDuty**. Below are instructions for doing so via [the 2i2c slack](incidents:communications). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should have a team learning session for how to use the PagerDuty interface. Could somebody record themselves going through the full PagerDuty workflow as part of these instructions? (maybe we could use something like https://loom.com/ ?) |
||
- **Type `/pd trigger` and hit `enter`** to trigger the incident. | ||
After you hit `enter`, you should get a dialog box with options. | ||
- For "Impacted Service", **select `Managed JupyterHubs`**. | ||
- **Assign it to the Incident Commander**. By default this is one of the {term}`Support Stewards` or the person triggering the event, but may be delegated to others[^note-on-delegation]! | ||
- **Provide a descriptive but short title**, but don't sweat it too much! | ||
- **Add a link to the FreshDesk ticket** in the description (if there is one). | ||
- **Create a new Slack channel** by checking the box for `Create a dedicated Public Slack channel for this incident`. | ||
Use this channel for all conversations about the incident. | ||
|
||
This officially marks the beginning of an incident, and will help make sure we don't accidentally miss steps during or after the incident. | ||
|
||
3. **Try resolving the issue** and communicate on the incident-specific channel while you gather information and perform actions - even if only to mark these as notes to yourself. | ||
4. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.[^note-on-delegation] | ||
5. **Communicate our status every few hours**. The {term}`External Liason` is expected to communicate incident status and plan with the {term}`Community Representative`s. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do we define the External Liason? Before we said "use the issue" for this but I don't think there's a way to explicitly say this in PagerDuty There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @choldgraf I've updated with some suggestions |
||
|
||
```{button-link} https://2i2c.freshdesk.com/a/admin/canned_responses/folders/80000143608/responses/80000247492/edit | ||
:color: primary | ||
|
||
Incident update template | ||
``` | ||
|
||
9. **Communicate when the incident is resolved**. When we believe the incident is resolved, communicate with the Community Representative that things should be back to normal. Mark the FreshDesk ticket as {guilabel}`Resolved`. | ||
10. **Fill in the {term}`Incident Report`**. The Incident Commander should do this in partnership with the Incident Response Team. | ||
11. **Close the incident ticket**. Once we have confirmation from the community (or no response after 48 working hours), and have filled in the incident {term}`Incident Report`, then close the incident by: | ||
- Closing the incident issue on GitHub | ||
- Marking the FreshDesk ticket as {guilabel}`Closed` | ||
6. **Communicate when the incident is resolved**. When we believe the incident is resolved, communicate with the Community Representative that things should be back to normal. | ||
- Mark the incident as "Resolved" in pagerduty. | ||
- Mark the FreshDesk ticket as {guilabel}`Closed`. | ||
7. **Create an incident report**. | ||
See [](incidents:create-report) for more information. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I moved all of this into a dedicated section so it didn't clutter these to-do lists too much. |
||
|
||
[^note-on-delegation]: If you cannot find somebody to take on this work, or feel uncomfortable delegating, the {term}`Project Manager` should help you, and is empowered to delegate on your behalf. | ||
|
||
(incidents:create-report)= | ||
## Create an Incident Report | ||
|
||
Once the incident is resolved, we must create an {term}`Incident Report`. | ||
The **Incident Commander** is responsible for making sure the Incident Report is completed, even though they may not be the person doing it. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When can the incident commander assign reporting duties to somebody else? I think if we don't define this then they will always be the ones that end up doing it. If we have this documented elsewhere we should cross-link it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @choldgraf tried to clarify this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks good to me, but I think an unanswered question is when and why would a person other than the incident commander be the one to fill this out? Should we encourage the engineer most-involved in resolving the incident to fix this? Should we try to rotate between members? Should we encourage the incident commander to do this most of the time but only ask somebody else if they really can't do it in a timely fashion? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @choldgraf I think our current experiences have been that all the 3 roles have been basically the same person, and I think our wide timezone spread has made it difficult for delegation to happen as well. I think this is an important question to answer, but heavily constrained by our timezone setup. Do you think this should block this PR? |
||
|
||
We practice a [blameless culture](https://www.blameless.com/sre/what-are-blameless-postmortems-do-they-work-how) around incident reports. | ||
Incidents are **always** caused by systemic issues, and hence solutions must be systemic too. | ||
Go out of your way to make sure there is no finger-pointing. | ||
|
||
We use [PagerDuty's postmortem feature](https://support.pagerduty.com/docs/postmortems) to create the Incident Report. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This documentation suggests that this requires a paid plan. Is that true for us? If so we should make an explicit decision that we wish to pay for this: https://support.pagerduty.com/docs/postmortems There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice! One question: what if we began by using PagerDuty for the incident itself, and GitHub to define the post-mortem content? The reason I mention this is because (aside from the cost questions), it might solve two other problems:
If over time it seems like this doesn't work for us, we could always upgrade to a business plan. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I would push back on this being "easy" a little bit. We do have to sift them out from all of the other issues going on in the infrastructure repo, and it's a very busy repo. We could transfer those issues to https://github.com/2i2c-org/incident-reports though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When I do this I just filter by the Hub Incident label. At least this restricts to only the "incident" issues...agree it would be nicer if it were like a little Jupyter Book site or something. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @sgibson91! IMO, it gets completely lost in GitHub and it's really difficult to find - especially as it is an issue comment that gets edited. It is also an incredibly manual process right now, and after trying out the process once in pagerduty I think using it is going to make postmortems far more likely. Also you can go to pagerduty and see a list of all incident reports - https://2i2c-org.pagerduty.com/postmortems. In the future, I think we can use pagerduty for escalations and automated alerts as well. So regardless of what we do, I really want to move away from 'edit a github issue comment' as the model for how we do incident reports. |
||
This lets us use notes, status updates from pagerduty as well as messages from Slack easily in the incident report! | ||
|
||
1. **Ensure that the incident is resolved**. | ||
If not, refer to the proper step in [](support:incident-response). | ||
The incident needs to be resolved before a report can be generated. | ||
2. **Open the incident** in the PagerDuty web interface, and click the `New Postmortem Report` button on top. | ||
3. `Owner of the Review Process` should be set to the Incident Commander, or someone else they delegate to explicitly. | ||
4. `Impact Start Time` is our best guess for when the incident started (*not* when the report came in). | ||
`Impact End Time` is when service was restored. | ||
Best guesses will do! | ||
5. **Add Data Sources** that we will use to keep track of the actions that happened around the incident. | ||
- Link to the slack channel we created for this incident as a "Data Source", filled in with an appropriate time to cover all the messages there. | ||
- Add any other channels where there was conversation there about the incident (e.g., GitHub Issues or Pull Requests). | ||
|
||
Click `Save Data Sources` to populate the timeline below with messages from the slack channels. | ||
6. **Fill out the timeline**. The goal is to be concise but make it possible for someone reading it to answer "what happened, and when?". | ||
See [](incidents:postmortem-timeline) for more information. | ||
7. **Fill out the "Analysis" section** to the extent possible. | ||
In particular, the "Action Items" should be a list with items linked out to GitHub issues created for follow-up. | ||
Perfection is the enemy of the good here. Save as you go. | ||
8. **Click "Save & View Report"** when you are done, and ask other members of the incident response team to review the incident report. | ||
They might add missing context, additional action items / summary details, or redact information. The person listed as | ||
the "Owner of the Review Process" is still responsible for making sure the rest of the process is completed. | ||
9. After sufficient review, and if the Incident Commander is happy with its completeness, **mark the Status dropdown as "Reviewed"**. | ||
10. Download the PDF, and add it to the `2i2c/infrastrtucture` repository under the `incidents/` directory. This make sure our incidents are all *public*, so others can learn from them as well. | ||
|
||
% Is there a way to share incidents in a way that doesn't require adding a binary blob to our repository? I think this generates extra toil in a process that already has a lot of toil, and also adds some clunkiness to git-based workflows. For example, could we have a public Google Drive folder where we drag/drop incident reports? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Signal-boosting this question from when I was making edits. What do you think about a Google Drive folder that exists just for the purpose of sharing incident reports? If not that, maybe we could have a dedicated documentation site for this instead of using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm ok with it being in a different repo. Let's just do that, I don't want to use Google Drive for this - it's far less public. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @choldgraf I have:
|
||
|
||
|
||
(incidents:postmortem-timeline)= | ||
### Writing an incident timeline | ||
|
||
Below are some tips and crucial information that is needed for a useful and thorough incident timeline. | ||
|
||
yuvipanda marked this conversation as resolved.
Show resolved
Hide resolved
|
||
The timeline should include: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the easiest thing we can do is link to pre-existing timelines that are well-written. Then people can just riff off of the structure of those There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @choldgraf i agree! once we have a few more incidents we can incorporate that here. |
||
|
||
1. The beginning of the impact. | ||
2. When the incident was brought to our attention, with a link to the source (Freshdesk ticket, slack message, etc). | ||
3. When we responded to the incident. This would coincide with the creation of the PagerDuty incident. | ||
4. Various debugging actions performed to ascertain the cause of the issue. | ||
Talking to yourself as you do this on the slack channel helps a lot here, as it helps communicate your methods to others on the team as well as help improve | ||
processes in the future more easily. | ||
|
||
For example: | ||
|
||
- `Looked at hub logs with "kubectl logs -n temple -l component=hub" and found <this>` | ||
- `Opened the cloud console and discovered notifications about quota`. | ||
|
||
Pasting in commands is very helpful! | ||
This is an important way for team members to learn from each other - what you take for granted is perhaps news to someone else, or you might learn alternate ways of doing things! | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ❤️ |
||
5. Actions taken to attempt to fix the issue, and their outcome. | ||
Paste commands executed if possible, as well as any GitHub PRs made. | ||
If you've already done this in the incident Slack channel you may simply copy/paste text here. | ||
6. Any extra communication from the community affected that helped. | ||
7. Whenever the incident was fixed, and how that was verified. | ||
8. Whatever else you think would be helpful to someone who finds this incident report a few months from now, trying to fix a similar incident. | ||
|
||
## Handing off Incident Commander status | ||
|
||
During an incident, it may be necessary to designate another person to be the Incident Commander. | ||
|
@@ -112,16 +183,19 @@ This is encouraged and expected, especially for more complex or longer incidents | |
To designate another team member as the Incident Commander, follow these steps: | ||
|
||
1. **Confirm with them** that they are able and willing to serve as the Incident Commander. | ||
2. **Update the Incident Report issue** by updating the Incident Commander name in the top comment. | ||
3. **Notify the team** with a comment in the Incident Report issue. | ||
2. **Reassign the incident on PagerDuty** to the new commander. This should produce a message in the slack channel for this event, | ||
thus communicating this change to the rest of the team. | ||
|
||
## Key terms | ||
|
||
```{glossary} | ||
Incident Report | ||
Incident Reports | ||
A document that describes what went wrong during an incident and what we'll do to avoid it in the future. When we have an {term}`Incident`, we create an Incident Report issue. | ||
This helps us explain what went wrong, and directs actions to avoid the incident in the future. Its goal is to identify improvements to process, technology, and team dynamics that can avoid incidents like this in the future. It is **not** meant to point fingers at anybody and care should be taken to avoid making it seem like any one person is at fault[^post-mortems]. | ||
|
||
This helps us understand what went wrong, and how we can improve our systems to prevent a recurrance. Its goal is to identify improvements to process, technology, and team dynamics that can avoid incidents like this in the future. It is **not** meant to point fingers at anybody and care should be taken to avoid making it seem like any one person is at fault. | ||
|
||
This is a *very important* part of making our infrastructure and human processes more stable and stress-free over time, so we should do this after each incident.[^post-mortems]. | ||
``` | ||
|
||
[^post-mortems]: See the [Google SRE post-mortem culture](https://sre.google/sre-book/postmortem-culture/) and the [Blameless guide to post-mortems](https://www.blameless.com/sre/what-are-blameless-postmortems-do-they-work-how) for some guidelines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah we should also add a bullet-point for
incident-specific
channels since we create those below.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and reworded this a little