Replace GitHub with PagerDuty in our Incident Response process #508

yuvipanda · 2022-09-08T01:32:09Z

PagerDuty is specifically tailored towards handling incidents, so let's use that rather
than try to rig an incident response process on top of GitHub ourselves. I also want us
to focus on the Incident response team primarily accessing PagerDuty via their slack
integration, rather than having to swap out to GitHub.

This PR only seeks to replace GitHub with PagerDuty / Slack, and makes no other
changes for now.

A summary of the changes are:

New incidents are triggered by creating an Incident in PagerDuty
During an incident, the incident response team interacts with the incident
primarily via slack, although they can do so via the web interface too
After the incident, we use the pagerduty postmortems
feature to collect information and make a postmortem. This has several features that
make this process much easier than editing comments on GitHub.

Ref 2i2c-org/infrastructure#1118

yuvipanda · 2022-09-08T06:26:51Z

Update: Moved to 2i2c-org/infrastructure#1118 (comment) as it is unrelated to this PR

yuvipanda · 2022-09-08T06:28:12Z

This made me realize that I could reduce the scope of this PR to just replacing GitHub with Pagerduty - that alone would've caught that the incident from last week wasn't actually 'over', and would have prevented today's UToronto outage!

yuvipanda · 2022-09-08T07:55:21Z

2i2c-org/infrastructure#1118 (comment) has follow-up tasks

yuvipanda · 2022-09-09T08:17:47Z

I've tried to follow the incident response suggestions here, and created one for the UToronto outage yesterday https://2i2c-org.pagerduty.com/postmortems/171317d6-5f19-7511-7d3a-117b13f62584

projects/managed-hubs/incidents.md

Co-authored-by: Sarah Gibson <[email protected]>

choldgraf

This seems like a really helpful service to use in order to provide some structure to our incident response process. In general, it looks good to me. I had a few comments and suggestions throughout.

One concern I have is that this is a fairly involved process. How can we ensure that we reliably and diligently follow this process?

Argh I took a pass and added some suggested edits to up the lists into different sections, but accidentally pushed to your branch instead of making a PR. Happy to discuss and I'll provide some comments below to focus discussion.

choldgraf · 2022-09-12T13:11:53Z

projects/managed-hubs/incidents.md

-7. **Investigate and resolve the incident**. The Incident Commander should follow the structure of the incident issue opened in the step above.
-8. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.[^note-on-delegation]
-9. **Communicate our status every few hours**. The {term}`External Liason` is expected to communicate incident status and plan with the {term}`Community Representative`s. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started:
+2. **Trigger an incident in PagerDuty**. Below are instructions for doing so via [the 2i2c slack](incidents:communications).


We should have a team learning session for how to use the PagerDuty interface. Could somebody record themselves going through the full PagerDuty workflow as part of these instructions? (maybe we could use something like https://loom.com/ ?)

choldgraf · 2022-09-12T13:12:25Z

projects/managed-hubs/incidents.md

-
+- [`2i2c-org.pagerduty.com`](https://2i2c-org.pagerduty.com/) is a dashboard for managing incidents.
+  This is the "source of truth" for any active or historical incidents.
+- [The `#pagerduty-notifications` Slack channel](https://2i2c.slack.com/archives/C041E05LVHB) is where we control PagerDuty and have discussion about an incident. This allows us to have an easily-accessible communication channel for incidents. In general, most interactions with PagerDuty should be via this channel.


Ah we should also add a bullet-point for incident-specific channels since we create those below.

Done, and reworded this a little

choldgraf · 2022-09-12T13:13:15Z

projects/managed-hubs/incidents.md

+
+3. **Try resolving the issue** and communicate on the incident-specific channel while you gather information and perform actions - even if only to mark these as notes to yourself.
+4. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.[^note-on-delegation]
+5. **Communicate our status every few hours**. The {term}`External Liason` is expected to communicate incident status and plan with the {term}`Community Representative`s. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started:


How do we define the External Liason? Before we said "use the issue" for this but I don't think there's a way to explicitly say this in PagerDuty

@choldgraf I've updated with some suggestions

choldgraf · 2022-09-12T13:13:35Z

projects/managed-hubs/incidents.md

+   - Mark the incident as "Resolved" in pagerduty.
+   - Mark the FreshDesk ticket as {guilabel}`Closed`.
+7. **Create an incident report**.
+   See [](incidents:create-report) for more information.


I moved all of this into a dedicated section so it didn't clutter these to-do lists too much.

choldgraf · 2022-09-12T13:14:17Z

projects/managed-hubs/incidents.md

+## Create an Incident Report
+
+Once the incident is resolved, we must create an {term}`Incident Report`.
+The **Incident Commander** is responsible for making sure the Incident Report is completed, even though they may not be the person doing it.


When can the incident commander assign reporting duties to somebody else? I think if we don't define this then they will always be the ones that end up doing it. If we have this documented elsewhere we should cross-link it

@choldgraf tried to clarify this.

This looks good to me, but I think an unanswered question is when and why would a person other than the incident commander be the one to fill this out? Should we encourage the engineer most-involved in resolving the incident to fix this? Should we try to rotate between members? Should we encourage the incident commander to do this most of the time but only ask somebody else if they really can't do it in a timely fashion?

@choldgraf I think our current experiences have been that all the 3 roles have been basically the same person, and I think our wide timezone spread has made it difficult for delegation to happen as well. I think this is an important question to answer, but heavily constrained by our timezone setup. Do you think this should block this PR?

choldgraf · 2022-09-12T13:15:01Z

projects/managed-hubs/incidents.md

+Incidents are **always** caused by systemic issues, and hence solutions must be systemic too.
+Go out of your way to make sure there is no finger-pointing.
+
+We use [PagerDuty's postmortem feature](https://support.pagerduty.com/docs/postmortems) to create the Incident Report.


This documentation suggests that this requires a paid plan. Is that true for us? If so we should make an explicit decision that we wish to pay for this: https://support.pagerduty.com/docs/postmortems

@choldgraf I opened https://github.com/2i2c-org/meta/issues/374

Nice! One question: what if we began by using PagerDuty for the incident itself, and GitHub to define the post-mortem content?

The reason I mention this is because (aside from the cost questions), it might solve two other problems:

Making the post-mortem easily accessible. If we used GitHub issues for this (as we do now), it is quite easy to search through all of our historical records of post-mortems.

Tying the post-mortem to the issues we create. The post-mortem almost always involves the creation of a bunch of new github issues etc for follow-up, and if we used GitHub for the post-mortem content itself, it would allow us to cross-reference to issues / PRs more easily.

If over time it seems like this doesn't work for us, we could always upgrade to a business plan.

If we used GitHub issues for this (as we do now), it is quite easy to search through all of our historical records of post-mortems.

I would push back on this being "easy" a little bit. We do have to sift them out from all of the other issues going on in the infrastructure repo, and it's a very busy repo. We could transfer those issues to https://github.com/2i2c-org/incident-reports though.

When I do this I just filter by the Hub Incident label. At least this restricts to only the "incident" issues...agree it would be nicer if it were like a little Jupyter Book site or something.

https://github.com/2i2c-org/infrastructure/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc+label%3A%22type%3A+Hub+Incident%22

I agree with @sgibson91! IMO, it gets completely lost in GitHub and it's really difficult to find - especially as it is an issue comment that gets edited. It is also an incredibly manual process right now, and after trying out the process once in pagerduty I think using it is going to make postmortems far more likely. Also you can go to pagerduty and see a list of all incident reports - https://2i2c-org.pagerduty.com/postmortems.

In the future, I think we can use pagerduty for escalations and automated alerts as well. So regardless of what we do, I really want to move away from 'edit a github issue comment' as the model for how we do incident reports.

choldgraf · 2022-09-12T13:15:38Z

projects/managed-hubs/incidents.md

+9.  After sufficient review, and if the Incident Commander is happy with its completeness, **mark the Status dropdown as "Reviewed"**.
+10. Download the PDF, and add it to the `2i2c/infrastrtucture` repository under the `incidents/` directory. This make sure our incidents are all *public*, so others can learn from them as well.
+
+% Is there a way to share incidents in a way that doesn't require adding a binary blob to our repository? I think this generates extra toil in a process that already has a lot of toil, and also adds some clunkiness to git-based workflows. For example, could we have a public Google Drive folder where we drag/drop incident reports?


Signal-boosting this question from when I was making edits. What do you think about a Google Drive folder that exists just for the purpose of sharing incident reports?

If not that, maybe we could have a dedicated documentation site for this instead of using infrastructure/?

I'm ok with it being in a different repo. Let's just do that, I don't want to use Google Drive for this - it's far less public.

@choldgraf I have:

created https://github.com/2i2c-org/incident-reports

created Remove incident reports infrastructure#1703 to move the existing 1 incident report we have in infrastructure repo to there

modified this PR to point to that.

choldgraf · 2022-09-12T13:16:32Z

projects/managed-hubs/incidents.md

+
+Below are some tips and crucial information that is needed for a useful and thorough incident timeline.
+
+The timeline should include:


I think the easiest thing we can do is link to pre-existing timelines that are well-written. Then people can just riff off of the structure of those

@choldgraf i agree! once we have a few more incidents we can incorporate that here.

Moved to https://github.com/2i2c-org/incident-reports as part of 2i2c-org/team-compass#508

yuvipanda · 2022-09-13T07:29:06Z

Thanks a lot @sgibson91 and @choldgraf!

@choldgraf I think I've addressed all your concerns.

Re: learning to use pagerduty, there is a lot of documentation they maintain at https://university.pagerduty.com/ - would that be enough? I'm also happy to try make a video, but do not want to block this PR on that.

This process IMO is actually simpler than what we have now, because:

during an incident, all interactions happen via slack (which we already use!)
Constructing a timeline becomes much easier - it's a fully manual process now
Easier to see when an incident can be marked as 'completed', a bit more difficult with GitHub now.

We will adjust this as we go along! However, I wrote this document so obviously I'll feel this way :D Would love to hear from others if they think this is too complex!

GeorgianaElena

We should have a team learning session for how to use the PagerDuty interface. Could somebody record themselves going through the full PagerDuty workflow as part of these instructions? (maybe we could use something like https://loom.com/ ?)

Maybe a demo during the next monthly team meeting?

projects/managed-hubs/incidents.md

GeorgianaElena · 2022-09-13T07:45:59Z

projects/managed-hubs/incidents.md

+   - `Opened the cloud console and discovered notifications about quota`.
+
+   Pasting in commands is very helpful!
+   This is an important way for team members to learn from each other - what you take for granted is perhaps news to someone else, or you might learn alternate ways of doing things!


Co-authored-by: Georgiana Elena <[email protected]>

choldgraf

I took another pass and left some short-ish comments. I think this looks pretty good to me, provided that the @2i2c-org/tech-team is on board!

My main question is whether we could try using PagerDuty just for the incident handling, and stick with our current GitHub Issues-based postmortem process - see below for some rationale for this, I'm curious what folks think.

choldgraf · 2022-09-13T11:04:17Z

projects/managed-hubs/incidents.md

+Incidents are **always** caused by systemic issues, and hence solutions must be systemic too.
+Go out of your way to make sure there is no finger-pointing.
+
+We use [PagerDuty's postmortem feature](https://support.pagerduty.com/docs/postmortems) to create the Incident Report.


Nice! One question: what if we began by using PagerDuty for the incident itself, and GitHub to define the post-mortem content?

The reason I mention this is because (aside from the cost questions), it might solve two other problems:

Making the post-mortem easily accessible. If we used GitHub issues for this (as we do now), it is quite easy to search through all of our historical records of post-mortems.

Tying the post-mortem to the issues we create. The post-mortem almost always involves the creation of a bunch of new github issues etc for follow-up, and if we used GitHub for the post-mortem content itself, it would allow us to cross-reference to issues / PRs more easily.

If over time it seems like this doesn't work for us, we could always upgrade to a business plan.

projects/managed-hubs/incidents.md

yuvipanda · 2022-09-14T15:52:40Z

@choldgraf IMO, the postmortem process is one of the main reasons to use pagerduty, and I'd really like to try it out this way! In the future, I think automated alerts (and alerting / escalation) should also go through here I think

Co-authored-by: Chris Holdgraf <[email protected]>

choldgraf · 2022-09-14T18:30:54Z

Sounds good - I defer to the @2i2c-org/tech-team's wishes on this one. If y'all think this is the right system then we should go for it.

yuvipanda · 2022-09-15T01:52:18Z

@choldgraf I've added a link to examples too

choldgraf

I think that we are all in agreement that this is a good direction to move towards. We've also already paid for a years-worth of accounts for PagerDuty. I think that we should merge this PR in and begin iterating from there.

yuvipanda · 2022-10-04T04:24:59Z

I agree @choldgraf! Merged!

Inital rewriting of the incident process to use pagerduty

3d0eb11

yuvipanda mentioned this pull request Sep 8, 2022

Define escalation practices when there are hub outages 2i2c-org/infrastructure#1118

Open

yuvipanda marked this pull request as ready for review September 8, 2022 06:34

yuvipanda mentioned this pull request Sep 8, 2022

[Incident] University of Toronto 500 errors 2i2c-org/infrastructure#1687

Closed

5 tasks

Use PagerDuty for incident reports

957d409

Add more details about making incident reports

dc5a9bb

Add a process item for making sure review happens

a7d90a3

yuvipanda requested review from choldgraf and a team September 9, 2022 19:49

sgibson91 reviewed Sep 12, 2022

View reviewed changes

projects/managed-hubs/incidents.md Outdated Show resolved Hide resolved

sgibson91 reviewed Sep 12, 2022

View reviewed changes

projects/managed-hubs/incidents.md Outdated Show resolved Hide resolved

sgibson91 reviewed Sep 12, 2022

View reviewed changes

projects/managed-hubs/incidents.md Outdated Show resolved Hide resolved

sgibson91 approved these changes Sep 12, 2022

View reviewed changes

choldgraf and others added 2 commits September 12, 2022 05:30

Apply suggestions from code review

e27bda6

Co-authored-by: Sarah Gibson <[email protected]>

Edits to incident response

39140ac

choldgraf reviewed Sep 12, 2022

View reviewed changes

Use newly created repo for incident reports

a496228

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this pull request Sep 12, 2022

Remove incident reports

9cfdb6c

Moved to https://github.com/2i2c-org/incident-reports as part of 2i2c-org/team-compass#508

yuvipanda mentioned this pull request Sep 12, 2022

Remove incident reports 2i2c-org/infrastructure#1703

Merged

damianavila assigned yuvipanda Sep 13, 2022

yuvipanda added 3 commits September 13, 2022 00:15

Note when and how incident commander can assign reporting duties

5249fb7

Update how EL can be delegated

220bbe7

Update internal comms slack channels

e122e09

GeorgianaElena approved these changes Sep 13, 2022

View reviewed changes

Apply suggestions from code review

7ca2a8f

Co-authored-by: Georgiana Elena <[email protected]>

choldgraf reviewed Sep 13, 2022

View reviewed changes

Apply suggested edits

5648f43

Co-authored-by: Chris Holdgraf <[email protected]>

Add example of incident reports

767bfbf

yuvipanda added 3 commits September 16, 2022 00:55

Describe what counts as an incident

edf52ca

Add note about not requiring review when adding incident-reports

f299b58

Add note about emailing the incident report to community rep

7d6af89

jmunroe mentioned this pull request Sep 28, 2022

Sprint Planning Meeting: Wednesday, September 28th #520

Closed

choldgraf approved these changes Oct 3, 2022

View reviewed changes

yuvipanda merged commit e1dba00 into main Oct 4, 2022

damianavila deleted the pagerduty-team branch October 11, 2022 21:24

damianavila mentioned this pull request Oct 14, 2022

[blog] Crowdsource a Quarter 3 Community Update #515

Closed

5 tasks


		Below are some tips and crucial information that is needed for a useful and thorough incident timeline.

		The timeline should include:

Replace GitHub with PagerDuty in our Incident Response process #508

Replace GitHub with PagerDuty in our Incident Response process #508

Conversation

yuvipanda commented Sep 8, 2022 • edited Loading

yuvipanda commented Sep 8, 2022 • edited Loading

yuvipanda commented Sep 8, 2022

yuvipanda commented Sep 8, 2022

yuvipanda commented Sep 9, 2022

choldgraf left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuvipanda commented Sep 13, 2022

GeorgianaElena left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

choldgraf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuvipanda commented Sep 14, 2022

choldgraf commented Sep 14, 2022

yuvipanda commented Sep 15, 2022

choldgraf left a comment

Choose a reason for hiding this comment

yuvipanda commented Oct 4, 2022

yuvipanda commented Sep 8, 2022 •

edited

Loading

yuvipanda commented Sep 8, 2022 •

edited

Loading

choldgraf left a comment •

edited

Loading