Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace GitHub with PagerDuty in our Incident Response process #508

Merged
merged 16 commits into from
Oct 4, 2022

Conversation

yuvipanda
Copy link
Member

@yuvipanda yuvipanda commented Sep 8, 2022

PagerDuty is specifically tailored towards handling incidents, so let's use that rather
than try to rig an incident response process on top of GitHub ourselves. I also want us
to focus on the Incident response team primarily accessing PagerDuty via their slack
integration, rather than having to swap out to GitHub.

This PR only seeks to replace GitHub with PagerDuty / Slack, and makes no other
changes for now.

A summary of the changes are:

  1. New incidents are triggered by creating an Incident in PagerDuty
  2. During an incident, the incident response team interacts with the incident
    primarily via slack, although they can do so via the web interface too
  3. After the incident, we use the pagerduty postmortems
    feature to collect information and make a postmortem. This has several features that
    make this process much easier than editing comments on GitHub.

Ref 2i2c-org/infrastructure#1118

@yuvipanda
Copy link
Member Author

yuvipanda commented Sep 8, 2022

Update: Moved to 2i2c-org/infrastructure#1118 (comment) as it is unrelated to this PR

@yuvipanda
Copy link
Member Author

This made me realize that I could reduce the scope of this PR to just replacing GitHub with Pagerduty - that alone would've caught that the incident from last week wasn't actually 'over', and would have prevented today's UToronto outage!

@yuvipanda
Copy link
Member Author

2i2c-org/infrastructure#1118 (comment) has follow-up tasks

@yuvipanda
Copy link
Member Author

I've tried to follow the incident response suggestions here, and created one for the UToronto outage yesterday https://2i2c-org.pagerduty.com/postmortems/171317d6-5f19-7511-7d3a-117b13f62584

@yuvipanda yuvipanda requested review from choldgraf and a team September 9, 2022 19:49
Copy link
Member

@choldgraf choldgraf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a really helpful service to use in order to provide some structure to our incident response process. In general, it looks good to me. I had a few comments and suggestions throughout.

One concern I have is that this is a fairly involved process. How can we ensure that we reliably and diligently follow this process?

Argh I took a pass and added some suggested edits to up the lists into different sections, but accidentally pushed to your branch instead of making a PR. Happy to discuss and I'll provide some comments below to focus discussion.

7. **Investigate and resolve the incident**. The Incident Commander should follow the structure of the incident issue opened in the step above.
8. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.[^note-on-delegation]
9. **Communicate our status every few hours**. The {term}`External Liason` is expected to communicate incident status and plan with the {term}`Community Representative`s. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started:
2. **Trigger an incident in PagerDuty**. Below are instructions for doing so via [the 2i2c slack](incidents:communications).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a team learning session for how to use the PagerDuty interface. Could somebody record themselves going through the full PagerDuty workflow as part of these instructions? (maybe we could use something like https://loom.com/ ?)


- [`2i2c-org.pagerduty.com`](https://2i2c-org.pagerduty.com/) is a dashboard for managing incidents.
This is the "source of truth" for any active or historical incidents.
- [The `#pagerduty-notifications` Slack channel](https://2i2c.slack.com/archives/C041E05LVHB) is where we control PagerDuty and have discussion about an incident. This allows us to have an easily-accessible communication channel for incidents. In general, most interactions with PagerDuty should be via this channel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah we should also add a bullet-point for incident-specific channels since we create those below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and reworded this a little


3. **Try resolving the issue** and communicate on the incident-specific channel while you gather information and perform actions - even if only to mark these as notes to yourself.
4. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.[^note-on-delegation]
5. **Communicate our status every few hours**. The {term}`External Liason` is expected to communicate incident status and plan with the {term}`Community Representative`s. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we define the External Liason? Before we said "use the issue" for this but I don't think there's a way to explicitly say this in PagerDuty

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@choldgraf I've updated with some suggestions

- Mark the incident as "Resolved" in pagerduty.
- Mark the FreshDesk ticket as {guilabel}`Closed`.
7. **Create an incident report**.
See [](incidents:create-report) for more information.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved all of this into a dedicated section so it didn't clutter these to-do lists too much.

## Create an Incident Report

Once the incident is resolved, we must create an {term}`Incident Report`.
The **Incident Commander** is responsible for making sure the Incident Report is completed, even though they may not be the person doing it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can the incident commander assign reporting duties to somebody else? I think if we don't define this then they will always be the ones that end up doing it. If we have this documented elsewhere we should cross-link it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@choldgraf tried to clarify this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but I think an unanswered question is when and why would a person other than the incident commander be the one to fill this out? Should we encourage the engineer most-involved in resolving the incident to fix this? Should we try to rotate between members? Should we encourage the incident commander to do this most of the time but only ask somebody else if they really can't do it in a timely fashion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@choldgraf I think our current experiences have been that all the 3 roles have been basically the same person, and I think our wide timezone spread has made it difficult for delegation to happen as well. I think this is an important question to answer, but heavily constrained by our timezone setup. Do you think this should block this PR?

Incidents are **always** caused by systemic issues, and hence solutions must be systemic too.
Go out of your way to make sure there is no finger-pointing.

We use [PagerDuty's postmortem feature](https://support.pagerduty.com/docs/postmortems) to create the Incident Report.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation suggests that this requires a paid plan. Is that true for us? If so we should make an explicit decision that we wish to pay for this: https://support.pagerduty.com/docs/postmortems

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! One question: what if we began by using PagerDuty for the incident itself, and GitHub to define the post-mortem content?

The reason I mention this is because (aside from the cost questions), it might solve two other problems:

  1. Making the post-mortem easily accessible. If we used GitHub issues for this (as we do now), it is quite easy to search through all of our historical records of post-mortems.
  2. Tying the post-mortem to the issues we create. The post-mortem almost always involves the creation of a bunch of new github issues etc for follow-up, and if we used GitHub for the post-mortem content itself, it would allow us to cross-reference to issues / PRs more easily.

If over time it seems like this doesn't work for us, we could always upgrade to a business plan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we used GitHub issues for this (as we do now), it is quite easy to search through all of our historical records of post-mortems.

I would push back on this being "easy" a little bit. We do have to sift them out from all of the other issues going on in the infrastructure repo, and it's a very busy repo. We could transfer those issues to https://github.com/2i2c-org/incident-reports though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I do this I just filter by the Hub Incident label. At least this restricts to only the "incident" issues...agree it would be nicer if it were like a little Jupyter Book site or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @sgibson91! IMO, it gets completely lost in GitHub and it's really difficult to find - especially as it is an issue comment that gets edited. It is also an incredibly manual process right now, and after trying out the process once in pagerduty I think using it is going to make postmortems far more likely. Also you can go to pagerduty and see a list of all incident reports - https://2i2c-org.pagerduty.com/postmortems.

In the future, I think we can use pagerduty for escalations and automated alerts as well. So regardless of what we do, I really want to move away from 'edit a github issue comment' as the model for how we do incident reports.

9. After sufficient review, and if the Incident Commander is happy with its completeness, **mark the Status dropdown as "Reviewed"**.
10. Download the PDF, and add it to the `2i2c/infrastrtucture` repository under the `incidents/` directory. This make sure our incidents are all *public*, so others can learn from them as well.

% Is there a way to share incidents in a way that doesn't require adding a binary blob to our repository? I think this generates extra toil in a process that already has a lot of toil, and also adds some clunkiness to git-based workflows. For example, could we have a public Google Drive folder where we drag/drop incident reports?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signal-boosting this question from when I was making edits. What do you think about a Google Drive folder that exists just for the purpose of sharing incident reports?

If not that, maybe we could have a dedicated documentation site for this instead of using infrastructure/?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with it being in a different repo. Let's just do that, I don't want to use Google Drive for this - it's far less public.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@choldgraf I have:

  1. created https://github.com/2i2c-org/incident-reports
  2. created Remove incident reports infrastructure#1703 to move the existing 1 incident report we have in infrastructure repo to there
  3. modified this PR to point to that.


Below are some tips and crucial information that is needed for a useful and thorough incident timeline.

The timeline should include:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the easiest thing we can do is link to pre-existing timelines that are well-written. Then people can just riff off of the structure of those

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@choldgraf i agree! once we have a few more incidents we can incorporate that here.

@yuvipanda
Copy link
Member Author

Thanks a lot @sgibson91 and @choldgraf!

@choldgraf I think I've addressed all your concerns.

Re: learning to use pagerduty, there is a lot of documentation they maintain at https://university.pagerduty.com/ - would that be enough? I'm also happy to try make a video, but do not want to block this PR on that.

This process IMO is actually simpler than what we have now, because:

  1. during an incident, all interactions happen via slack (which we already use!)
  2. Constructing a timeline becomes much easier - it's a fully manual process now
  3. Easier to see when an incident can be marked as 'completed', a bit more difficult with GitHub now.

We will adjust this as we go along! However, I wrote this document so obviously I'll feel this way :D Would love to hear from others if they think this is too complex!

Copy link
Member

@GeorgianaElena GeorgianaElena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a team learning session for how to use the PagerDuty interface. Could somebody record themselves going through the full PagerDuty workflow as part of these instructions? (maybe we could use something like https://loom.com/ ?)

Maybe a demo during the next monthly team meeting?

projects/managed-hubs/incidents.md Outdated Show resolved Hide resolved
- `Opened the cloud console and discovered notifications about quota`.

Pasting in commands is very helpful!
This is an important way for team members to learn from each other - what you take for granted is perhaps news to someone else, or you might learn alternate ways of doing things!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Member

@choldgraf choldgraf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another pass and left some short-ish comments. I think this looks pretty good to me, provided that the @2i2c-org/tech-team is on board!

My main question is whether we could try using PagerDuty just for the incident handling, and stick with our current GitHub Issues-based postmortem process - see below for some rationale for this, I'm curious what folks think.

Incidents are **always** caused by systemic issues, and hence solutions must be systemic too.
Go out of your way to make sure there is no finger-pointing.

We use [PagerDuty's postmortem feature](https://support.pagerduty.com/docs/postmortems) to create the Incident Report.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! One question: what if we began by using PagerDuty for the incident itself, and GitHub to define the post-mortem content?

The reason I mention this is because (aside from the cost questions), it might solve two other problems:

  1. Making the post-mortem easily accessible. If we used GitHub issues for this (as we do now), it is quite easy to search through all of our historical records of post-mortems.
  2. Tying the post-mortem to the issues we create. The post-mortem almost always involves the creation of a bunch of new github issues etc for follow-up, and if we used GitHub for the post-mortem content itself, it would allow us to cross-reference to issues / PRs more easily.

If over time it seems like this doesn't work for us, we could always upgrade to a business plan.

projects/managed-hubs/incidents.md Outdated Show resolved Hide resolved
projects/managed-hubs/incidents.md Show resolved Hide resolved
projects/managed-hubs/incidents.md Outdated Show resolved Hide resolved
@yuvipanda
Copy link
Member Author

@choldgraf IMO, the postmortem process is one of the main reasons to use pagerduty, and I'd really like to try it out this way! In the future, I think automated alerts (and alerting / escalation) should also go through here I think

Co-authored-by: Chris Holdgraf <[email protected]>
@choldgraf
Copy link
Member

Sounds good - I defer to the @2i2c-org/tech-team's wishes on this one. If y'all think this is the right system then we should go for it.

@yuvipanda
Copy link
Member Author

@choldgraf I've added a link to examples too

Copy link
Member

@choldgraf choldgraf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we are all in agreement that this is a good direction to move towards. We've also already paid for a years-worth of accounts for PagerDuty. I think that we should merge this PR in and begin iterating from there.

@yuvipanda yuvipanda merged commit e1dba00 into main Oct 4, 2022
@yuvipanda
Copy link
Member Author

I agree @choldgraf! Merged!

@damianavila damianavila deleted the pagerduty-team branch October 11, 2022 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants