-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace GitHub with PagerDuty in our Incident Response process #508
Conversation
Update: Moved to 2i2c-org/infrastructure#1118 (comment) as it is unrelated to this PR |
This made me realize that I could reduce the scope of this PR to just replacing GitHub with Pagerduty - that alone would've caught that the incident from last week wasn't actually 'over', and would have prevented today's UToronto outage! |
2i2c-org/infrastructure#1118 (comment) has follow-up tasks |
I've tried to follow the incident response suggestions here, and created one for the UToronto outage yesterday https://2i2c-org.pagerduty.com/postmortems/171317d6-5f19-7511-7d3a-117b13f62584 |
Co-authored-by: Sarah Gibson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a really helpful service to use in order to provide some structure to our incident response process. In general, it looks good to me. I had a few comments and suggestions throughout.
One concern I have is that this is a fairly involved process. How can we ensure that we reliably and diligently follow this process?
Argh I took a pass and added some suggested edits to up the lists into different sections, but accidentally pushed to your branch instead of making a PR. Happy to discuss and I'll provide some comments below to focus discussion.
7. **Investigate and resolve the incident**. The Incident Commander should follow the structure of the incident issue opened in the step above. | ||
8. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.[^note-on-delegation] | ||
9. **Communicate our status every few hours**. The {term}`External Liason` is expected to communicate incident status and plan with the {term}`Community Representative`s. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started: | ||
2. **Trigger an incident in PagerDuty**. Below are instructions for doing so via [the 2i2c slack](incidents:communications). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a team learning session for how to use the PagerDuty interface. Could somebody record themselves going through the full PagerDuty workflow as part of these instructions? (maybe we could use something like https://loom.com/ ?)
projects/managed-hubs/incidents.md
Outdated
|
||
- [`2i2c-org.pagerduty.com`](https://2i2c-org.pagerduty.com/) is a dashboard for managing incidents. | ||
This is the "source of truth" for any active or historical incidents. | ||
- [The `#pagerduty-notifications` Slack channel](https://2i2c.slack.com/archives/C041E05LVHB) is where we control PagerDuty and have discussion about an incident. This allows us to have an easily-accessible communication channel for incidents. In general, most interactions with PagerDuty should be via this channel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah we should also add a bullet-point for incident-specific
channels since we create those below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and reworded this a little
projects/managed-hubs/incidents.md
Outdated
|
||
3. **Try resolving the issue** and communicate on the incident-specific channel while you gather information and perform actions - even if only to mark these as notes to yourself. | ||
4. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.[^note-on-delegation] | ||
5. **Communicate our status every few hours**. The {term}`External Liason` is expected to communicate incident status and plan with the {term}`Community Representative`s. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we define the External Liason? Before we said "use the issue" for this but I don't think there's a way to explicitly say this in PagerDuty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@choldgraf I've updated with some suggestions
- Mark the incident as "Resolved" in pagerduty. | ||
- Mark the FreshDesk ticket as {guilabel}`Closed`. | ||
7. **Create an incident report**. | ||
See [](incidents:create-report) for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved all of this into a dedicated section so it didn't clutter these to-do lists too much.
projects/managed-hubs/incidents.md
Outdated
## Create an Incident Report | ||
|
||
Once the incident is resolved, we must create an {term}`Incident Report`. | ||
The **Incident Commander** is responsible for making sure the Incident Report is completed, even though they may not be the person doing it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When can the incident commander assign reporting duties to somebody else? I think if we don't define this then they will always be the ones that end up doing it. If we have this documented elsewhere we should cross-link it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@choldgraf tried to clarify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, but I think an unanswered question is when and why would a person other than the incident commander be the one to fill this out? Should we encourage the engineer most-involved in resolving the incident to fix this? Should we try to rotate between members? Should we encourage the incident commander to do this most of the time but only ask somebody else if they really can't do it in a timely fashion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@choldgraf I think our current experiences have been that all the 3 roles have been basically the same person, and I think our wide timezone spread has made it difficult for delegation to happen as well. I think this is an important question to answer, but heavily constrained by our timezone setup. Do you think this should block this PR?
Incidents are **always** caused by systemic issues, and hence solutions must be systemic too. | ||
Go out of your way to make sure there is no finger-pointing. | ||
|
||
We use [PagerDuty's postmortem feature](https://support.pagerduty.com/docs/postmortems) to create the Incident Report. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This documentation suggests that this requires a paid plan. Is that true for us? If so we should make an explicit decision that we wish to pay for this: https://support.pagerduty.com/docs/postmortems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! One question: what if we began by using PagerDuty for the incident itself, and GitHub to define the post-mortem content?
The reason I mention this is because (aside from the cost questions), it might solve two other problems:
- Making the post-mortem easily accessible. If we used GitHub issues for this (as we do now), it is quite easy to search through all of our historical records of post-mortems.
- Tying the post-mortem to the issues we create. The post-mortem almost always involves the creation of a bunch of new github issues etc for follow-up, and if we used GitHub for the post-mortem content itself, it would allow us to cross-reference to issues / PRs more easily.
If over time it seems like this doesn't work for us, we could always upgrade to a business plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we used GitHub issues for this (as we do now), it is quite easy to search through all of our historical records of post-mortems.
I would push back on this being "easy" a little bit. We do have to sift them out from all of the other issues going on in the infrastructure repo, and it's a very busy repo. We could transfer those issues to https://github.com/2i2c-org/incident-reports though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I do this I just filter by the Hub Incident label. At least this restricts to only the "incident" issues...agree it would be nicer if it were like a little Jupyter Book site or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @sgibson91! IMO, it gets completely lost in GitHub and it's really difficult to find - especially as it is an issue comment that gets edited. It is also an incredibly manual process right now, and after trying out the process once in pagerduty I think using it is going to make postmortems far more likely. Also you can go to pagerduty and see a list of all incident reports - https://2i2c-org.pagerduty.com/postmortems.
In the future, I think we can use pagerduty for escalations and automated alerts as well. So regardless of what we do, I really want to move away from 'edit a github issue comment' as the model for how we do incident reports.
projects/managed-hubs/incidents.md
Outdated
9. After sufficient review, and if the Incident Commander is happy with its completeness, **mark the Status dropdown as "Reviewed"**. | ||
10. Download the PDF, and add it to the `2i2c/infrastrtucture` repository under the `incidents/` directory. This make sure our incidents are all *public*, so others can learn from them as well. | ||
|
||
% Is there a way to share incidents in a way that doesn't require adding a binary blob to our repository? I think this generates extra toil in a process that already has a lot of toil, and also adds some clunkiness to git-based workflows. For example, could we have a public Google Drive folder where we drag/drop incident reports? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signal-boosting this question from when I was making edits. What do you think about a Google Drive folder that exists just for the purpose of sharing incident reports?
If not that, maybe we could have a dedicated documentation site for this instead of using infrastructure/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with it being in a different repo. Let's just do that, I don't want to use Google Drive for this - it's far less public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@choldgraf I have:
- created https://github.com/2i2c-org/incident-reports
- created Remove incident reports infrastructure#1703 to move the existing 1 incident report we have in infrastructure repo to there
- modified this PR to point to that.
|
||
Below are some tips and crucial information that is needed for a useful and thorough incident timeline. | ||
|
||
The timeline should include: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the easiest thing we can do is link to pre-existing timelines that are well-written. Then people can just riff off of the structure of those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@choldgraf i agree! once we have a few more incidents we can incorporate that here.
Thanks a lot @sgibson91 and @choldgraf! @choldgraf I think I've addressed all your concerns. Re: learning to use pagerduty, there is a lot of documentation they maintain at https://university.pagerduty.com/ - would that be enough? I'm also happy to try make a video, but do not want to block this PR on that. This process IMO is actually simpler than what we have now, because:
We will adjust this as we go along! However, I wrote this document so obviously I'll feel this way :D Would love to hear from others if they think this is too complex! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a team learning session for how to use the PagerDuty interface. Could somebody record themselves going through the full PagerDuty workflow as part of these instructions? (maybe we could use something like https://loom.com/ ?)
Maybe a demo during the next monthly team meeting?
- `Opened the cloud console and discovered notifications about quota`. | ||
|
||
Pasting in commands is very helpful! | ||
This is an important way for team members to learn from each other - what you take for granted is perhaps news to someone else, or you might learn alternate ways of doing things! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Co-authored-by: Georgiana Elena <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took another pass and left some short-ish comments. I think this looks pretty good to me, provided that the @2i2c-org/tech-team is on board!
My main question is whether we could try using PagerDuty just for the incident handling, and stick with our current GitHub Issues-based postmortem process - see below for some rationale for this, I'm curious what folks think.
Incidents are **always** caused by systemic issues, and hence solutions must be systemic too. | ||
Go out of your way to make sure there is no finger-pointing. | ||
|
||
We use [PagerDuty's postmortem feature](https://support.pagerduty.com/docs/postmortems) to create the Incident Report. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! One question: what if we began by using PagerDuty for the incident itself, and GitHub to define the post-mortem content?
The reason I mention this is because (aside from the cost questions), it might solve two other problems:
- Making the post-mortem easily accessible. If we used GitHub issues for this (as we do now), it is quite easy to search through all of our historical records of post-mortems.
- Tying the post-mortem to the issues we create. The post-mortem almost always involves the creation of a bunch of new github issues etc for follow-up, and if we used GitHub for the post-mortem content itself, it would allow us to cross-reference to issues / PRs more easily.
If over time it seems like this doesn't work for us, we could always upgrade to a business plan.
@choldgraf IMO, the postmortem process is one of the main reasons to use pagerduty, and I'd really like to try it out this way! In the future, I think automated alerts (and alerting / escalation) should also go through here I think |
Co-authored-by: Chris Holdgraf <[email protected]>
Sounds good - I defer to the @2i2c-org/tech-team's wishes on this one. If y'all think this is the right system then we should go for it. |
@choldgraf I've added a link to examples too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we are all in agreement that this is a good direction to move towards. We've also already paid for a years-worth of accounts for PagerDuty. I think that we should merge this PR in and begin iterating from there.
I agree @choldgraf! Merged! |
PagerDuty is specifically tailored towards handling incidents, so let's use that rather
than try to rig an incident response process on top of GitHub ourselves. I also want us
to focus on the Incident response team primarily accessing PagerDuty via their slack
integration, rather than having to swap out to GitHub.
This PR only seeks to replace GitHub with PagerDuty / Slack, and makes no other
changes for now.
A summary of the changes are:
primarily via slack, although they can do so via the web interface too
feature to collect information and make a postmortem. This has several features that
make this process much easier than editing comments on GitHub.
Ref 2i2c-org/infrastructure#1118