From 3d0eb1189d4476bf9eecdffebaf0c3ce136b945f Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Wed, 7 Sep 2022 18:26:06 -0700 Subject: [PATCH 01/16] Inital rewriting of the incident process to use pagerduty --- projects/managed-hubs/incidents.md | 53 +++++++++++++++++++----------- 1 file changed, 34 insertions(+), 19 deletions(-) diff --git a/projects/managed-hubs/incidents.md b/projects/managed-hubs/incidents.md index 34293ef0..8041b83d 100644 --- a/projects/managed-hubs/incidents.md +++ b/projects/managed-hubs/incidents.md @@ -75,19 +75,31 @@ Here is the process that we follow for incidents: Incident first response template ``` -2. **Open an incident issue**. - For each {term}`Incident` we create a dedicated issue to track its progress. [{bdg-primary}`open an incident issue`](https://github.com/2i2c-org/infrastructure/issues/new?assignees=&labels=type%3A+Hub+Incident%2Csupport&template=3_incident-report.md&title=%5BIncident%5D+%7B%7B+TITLE+%7D%7D) and notify our engineering team via Slack. -3. **Try resolving the issue** and take notes while you gather information about it. -4. **If after 30 minutes the issue is not solved or you know you cannot resolve it** - - Ping our engineering team and our Project Manager in the {guilabel}`#support-freshdesk` channel so that they are aware of the incident. - - Add the incident issue to [our team backlog](https://github.com/orgs/2i2c-org/projects/22/). -5. **Designate an {term}`Incident Commander`**. Do this in the Incident issue. By default, this is the Support Steward. - - Confirm that the Incident Commander has the bandwidth and ability to serve in this role. - - If not, delegate this to another team member.[^note-on-delegation] -6. **Designate an {term}`External Liason`**. Do this in the Incident issue. By default, this is the Incident Commander, though they may delegate this to others.[^note-on-delegation] -7. **Investigate and resolve the incident**. The Incident Commander should follow the structure of the incident issue opened in the step above. -8. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.[^note-on-delegation] -9. **Communicate our status every few hours**. The {term}`External Liason` is expected to communicate incident status and plan with the {term}`Community Representative`s. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started: +2. **Trigger an incident in PagerDuty**, using the 2i2c slack so we have a central location to discuss the incident. + Use `/pd trigger` in the {guilabel}`#pagerduty-notifications` channel on the 2i2c slack to trigger the incident - + after you type the command and hit `enter`, you should get a dialog box with options. + + For "Impacted Service", select "Managed JupyterHubs". We can have more fine-grained services here later if we wish. + + Assign it to whoever is the **Incident Commander**. This is by default one of the support stewards or whoever is + triggering the event, but not necessarily[^note-on-delegation]! + + Provide a descriptive but short Title, but don't sweat it too much! + + If there is a freshdesk ticket for this, provide a link to that in the description. + + Check the box for "Create a dedicated Public Slack channel for this incident" to create a *new slack channel* + for discussing the incident. This helps keep chatter off other channels *and* provides an easy location to gather + information for the incident report afte the fact. + + This officially marks the beginning of an incident, and will help make sure we don't accidentally miss steps during + or after the incident. + +3. **Try resolving the issue** and communicate on the incident specific channel while you gather information and perform + actions - even if only to mark these as notes to yourself. +4. **Designate an {term}`External Liason`**. Do this in the Incident issue. By default, this is the Incident Commander, though they may delegate this to others.[^note-on-delegation] +5. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.[^note-on-delegation] +6. **Communicate our status every few hours**. The {term}`External Liason` is expected to communicate incident status and plan with the {term}`Community Representative`s. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started: ```{button-link} https://2i2c.freshdesk.com/a/admin/canned_responses/folders/80000143608/responses/80000247492/edit :color: primary @@ -95,10 +107,13 @@ Here is the process that we follow for incidents: Incident update template ``` -9. **Communicate when the incident is resolved**. When we believe the incident is resolved, communicate with the Community Representative that things should be back to normal. Mark the FreshDesk ticket as {guilabel}`Resolved`. -10. **Fill in the {term}`Incident Report`**. The Incident Commander should do this in partnership with the Incident Response Team. -11. **Close the incident ticket**. Once we have confirmation from the community (or no response after 48 working hours), and have filled in the incident {term}`Incident Report`, then close the incident by: - - Closing the incident issue on GitHub +7. **Communicate when the incident is resolved**. When we believe the incident + is resolved, communicate with the Community Representative that things should be + back to normal. Mark the FreshDesk ticket as {guilabel}`Resolved`. +8. **Fill in the {term}`Incident Report`**. The Incident Commander should do this in partnership with the Incident Response Team. Use the + messages in the slack channel to gather information about what happened when. +9. **Mark the incident as resolved**. Once we have confirmation from the community (or no response after 48 working hours), and have filled in the incident {term}`Incident Report`, then close the incident by: + - Marking the incident as "Resolved" in pagerduty. - Marking the FreshDesk ticket as {guilabel}`Closed` [^note-on-delegation]: If you cannot find somebody to take on this work, or feel uncomfortable delegating, the {term}`Project Manager` should help you, and is empowered to delegate on your behalf. @@ -112,8 +127,8 @@ This is encouraged and expected, especially for more complex or longer incidents To designate another team member as the Incident Commander, follow these steps: 1. **Confirm with them** that they are able and willing to serve as the Incident Commander. -2. **Update the Incident Report issue** by updating the Incident Commander name in the top comment. -3. **Notify the team** with a comment in the Incident Report issue. +2. **Reassign the incident on PagerDuty** to the new commander. This should produce a message in the slack channel for this event, + thus communicating this change to the rest of the team. ## Key terms From 957d4096390b55fcb34e228e1c88846cf26bda1f Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Thu, 8 Sep 2022 00:30:15 -0700 Subject: [PATCH 02/16] Use PagerDuty for incident reports --- projects/managed-hubs/incidents.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/projects/managed-hubs/incidents.md b/projects/managed-hubs/incidents.md index 8041b83d..20bd7d06 100644 --- a/projects/managed-hubs/incidents.md +++ b/projects/managed-hubs/incidents.md @@ -110,8 +110,10 @@ Here is the process that we follow for incidents: 7. **Communicate when the incident is resolved**. When we believe the incident is resolved, communicate with the Community Representative that things should be back to normal. Mark the FreshDesk ticket as {guilabel}`Resolved`. -8. **Fill in the {term}`Incident Report`**. The Incident Commander should do this in partnership with the Incident Response Team. Use the - messages in the slack channel to gather information about what happened when. +8. **Fill in the {term}`Incident Report`**. The Incident Commander should do this in partnership with the Incident Response Team. We + use PagerDuty's [postmortem](https://support.pagerduty.com/docs/postmortems) functionality to create the incident report. This + allows us to easily incorporate notes and slack messages sent to pagerduty during the course of the incident, drastically reducing + the amount of effort required to create the incident report. 9. **Mark the incident as resolved**. Once we have confirmation from the community (or no response after 48 working hours), and have filled in the incident {term}`Incident Report`, then close the incident by: - Marking the incident as "Resolved" in pagerduty. - Marking the FreshDesk ticket as {guilabel}`Closed` From dc5a9bb05844df51b34a74509fa67ff10379811d Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Thu, 8 Sep 2022 12:07:52 -0700 Subject: [PATCH 03/16] Add more details about making incident reports --- projects/managed-hubs/incidents.md | 69 ++++++++++++++++++++++++++---- 1 file changed, 61 insertions(+), 8 deletions(-) diff --git a/projects/managed-hubs/incidents.md b/projects/managed-hubs/incidents.md index 20bd7d06..5dab80ee 100644 --- a/projects/managed-hubs/incidents.md +++ b/projects/managed-hubs/incidents.md @@ -109,17 +109,70 @@ Here is the process that we follow for incidents: 7. **Communicate when the incident is resolved**. When we believe the incident is resolved, communicate with the Community Representative that things should be - back to normal. Mark the FreshDesk ticket as {guilabel}`Resolved`. -8. **Fill in the {term}`Incident Report`**. The Incident Commander should do this in partnership with the Incident Response Team. We - use PagerDuty's [postmortem](https://support.pagerduty.com/docs/postmortems) functionality to create the incident report. This - allows us to easily incorporate notes and slack messages sent to pagerduty during the course of the incident, drastically reducing - the amount of effort required to create the incident report. -9. **Mark the incident as resolved**. Once we have confirmation from the community (or no response after 48 working hours), and have filled in the incident {term}`Incident Report`, then close the incident by: - - Marking the incident as "Resolved" in pagerduty. - - Marking the FreshDesk ticket as {guilabel}`Closed` + back to normal. + - Marking the incident as "Resolved" in pagerduty. + - Marking the FreshDesk ticket as {guilabel}`Closed` [^note-on-delegation]: If you cannot find somebody to take on this work, or feel uncomfortable delegating, the {term}`Project Manager` should help you, and is empowered to delegate on your behalf. +## Creating the Incident Report + +Once the incident is resolved, we must create an {term}`Incident Report`. This helps us understand what went wrong, +and how we can improve our systems to prevent a recurrance. This is a *very important* part of making our infrastructure +and human processes more stable and stress free over time, so we should try to do this after each incident. The +**Incident Commander** is responsible for making sure the Incident Report is done, even though they may not be the +person doing it. + +Note that we *must* practice a [blameless culture](https://www.blameless.com/sre/what-are-blameless-postmortems-do-they-work-how) +around incident reports - Incidents are *always* caused by systemic issues, and hence solutions must be systemic +too. Go out of your way to make sure there is no finger-pointing. + +We use PagerDuty's [postmortem](https://support.pagerduty.com/docs/postmortems) feature to create the Incident Report. +This lets us use notes, status updates from pagerduty as well as messages from Slack easily in the incident report! + +1. Open the incident in the PagerDuty web interface, and Click the "New Postmortem Report" button on top. The incident + needs to be already resolved before this feature is available. + +2. The "Owner of the Review Process" should be set to the Incident Commander, or someone else they delegate to explicitly. + +3. Fill out the "Impact Start Time" to be our best guess for when the incident started (*not* when the report came in), and + the "Impact End Time" to be when service was restored. Best guesses will do! + +4. Add the slack channel we created for this incident as a "Data Source", filled in with an appropriate time to cover all + the messages there. You can add other channels too if there was conversation there about the incident. Click "Save Data Sources" + to populate the timeline below with messages from the slack channels. + +5. Fill out the timeline! The goal is to be concise but make it possible for someone reading it to answer "what happened, and when?". + The timeline should include: + + 1. The beginning of the impact. + 2. When the incident was brought to our attention, with a link to the source (Freshdesk ticket, slack message, etc). + 3. When we responded to the incident. This would coincide with the creation of the PagerDuty incident. + 4. Various debugging actions performed to ascertain the cause of the issue. Talking to yourself as you do this on the + slack channel helps a lot here, as it helps communicate your methods to others on the team as well as help improve + processes in the future more easily. + + Examples here would be things like `Looked at hub logs with "kubectl logs -n temple -l component=hub" and found ` or + `Opened the cloud console and discovered notifications about quota". Pasting in commands is very helpful! This is an + important way for team members to learn from each other - what you take for granted is perhaps news to someone else, or + you might learn alternate ways of doing things! + 5. Actions taken to attempt to fix the issue, and their outcome. Paste commands executed if possible, as well as any + GitHub PRs made. Putting this in Slack again helps. + 6. Any extra communication from the community affected that helped. + 7. Whenever the impact was fixed, and how that was verified. + 8. Whatever else you think would be helpful to someone who finds this incident report a few months from now, trying to fix a + similar incident. + +6. Fill out the "Analysis" section to the extent possible. In particular, the "Action Items" should be a list with items + linked out to GitHub issues created for follow-up. Perfection is the enemy of the good here. + +7. Review the report, and if the Incident Commander is happy with its completeness, mark the Status dropdown up top as "Reviewed". + +8. Click "Save & View Report" button. + +9. Download the PDF, and add it to the `2i2c/infrastrtucture` repository under the `incidents/` directory. This make sure our + incidents are all *public*, so others can learn from them as well. + ## Handing off Incident Commander status During an incident, it may be necessary to designate another person to be the Incident Commander. From a7d90a3407f37f4aaf1d88ab7999b6a316f7a005 Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Fri, 9 Sep 2022 11:11:12 -0700 Subject: [PATCH 04/16] Add a process item for making sure review happens --- projects/managed-hubs/incidents.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/projects/managed-hubs/incidents.md b/projects/managed-hubs/incidents.md index 5dab80ee..cac2310b 100644 --- a/projects/managed-hubs/incidents.md +++ b/projects/managed-hubs/incidents.md @@ -164,11 +164,13 @@ This lets us use notes, status updates from pagerduty as well as messages from S similar incident. 6. Fill out the "Analysis" section to the extent possible. In particular, the "Action Items" should be a list with items - linked out to GitHub issues created for follow-up. Perfection is the enemy of the good here. + linked out to GitHub issues created for follow-up. Perfection is the enemy of the good here. Save as you go. -7. Review the report, and if the Incident Commander is happy with its completeness, mark the Status dropdown up top as "Reviewed". +7. Click "Save & View Report* when you are done, and ask other members of the incident response team to review the incident report. + They might add missing context, additional action items / summary details, or redact information. The person listed as + the "Owner of the Review Process" is still responsible for making sure the rest of the process is completed. -8. Click "Save & View Report" button. +8. After sufficient review, and if the Incident Commander is happy with its completeness, mark the Status dropdown up top as "Reviewed". 9. Download the PDF, and add it to the `2i2c/infrastrtucture` repository under the `incidents/` directory. This make sure our incidents are all *public*, so others can learn from them as well. From e27bda697d821685d8bb1db7acfce75c54ae2bac Mon Sep 17 00:00:00 2001 From: Chris Holdgraf Date: Mon, 12 Sep 2022 05:30:06 -0700 Subject: [PATCH 05/16] Apply suggestions from code review Co-authored-by: Sarah Gibson <44771837+sgibson91@users.noreply.github.com> --- projects/managed-hubs/incidents.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/projects/managed-hubs/incidents.md b/projects/managed-hubs/incidents.md index cac2310b..396d0ddd 100644 --- a/projects/managed-hubs/incidents.md +++ b/projects/managed-hubs/incidents.md @@ -90,7 +90,7 @@ Here is the process that we follow for incidents: Check the box for "Create a dedicated Public Slack channel for this incident" to create a *new slack channel* for discussing the incident. This helps keep chatter off other channels *and* provides an easy location to gather - information for the incident report afte the fact. + information for the incident report after the fact. This officially marks the beginning of an incident, and will help make sure we don't accidentally miss steps during or after the incident. @@ -153,7 +153,7 @@ This lets us use notes, status updates from pagerduty as well as messages from S processes in the future more easily. Examples here would be things like `Looked at hub logs with "kubectl logs -n temple -l component=hub" and found ` or - `Opened the cloud console and discovered notifications about quota". Pasting in commands is very helpful! This is an + `Opened the cloud console and discovered notifications about quota`. Pasting in commands is very helpful! This is an important way for team members to learn from each other - what you take for granted is perhaps news to someone else, or you might learn alternate ways of doing things! 5. Actions taken to attempt to fix the issue, and their outcome. Paste commands executed if possible, as well as any @@ -166,7 +166,7 @@ This lets us use notes, status updates from pagerduty as well as messages from S 6. Fill out the "Analysis" section to the extent possible. In particular, the "Action Items" should be a list with items linked out to GitHub issues created for follow-up. Perfection is the enemy of the good here. Save as you go. -7. Click "Save & View Report* when you are done, and ask other members of the incident response team to review the incident report. +7. Click "Save & View Report" when you are done, and ask other members of the incident response team to review the incident report. They might add missing context, additional action items / summary details, or redact information. The person listed as the "Owner of the Review Process" is still responsible for making sure the rest of the process is completed. From 39140acf4b2442f3503f592e282b4a036002a67b Mon Sep 17 00:00:00 2001 From: Chris Holdgraf Date: Sat, 10 Sep 2022 20:25:25 +0200 Subject: [PATCH 06/16] Edits to incident response --- projects/managed-hubs/incidents.md | 160 +++++++++++++++-------------- 1 file changed, 81 insertions(+), 79 deletions(-) diff --git a/projects/managed-hubs/incidents.md b/projects/managed-hubs/incidents.md index 396d0ddd..264571a4 100644 --- a/projects/managed-hubs/incidents.md +++ b/projects/managed-hubs/incidents.md @@ -53,11 +53,12 @@ Subject Matter Experts - They may **delegate** this responsibilitiy to another team member if they wish (e.g., to the {term}`Support Steward` team.) - We may interact with external stakeholders via comments in Incident Response issues if it helps resolve the incident more quickly. +(incidents:communications)= ### Internal communication -- The Slack channel [{guilabel}`#support-freshdesk`](https://2i2c.slack.com/archives/C028WU9PFBN) contains real-time communication about support issues. Use this to signal-boost support requests related to {term}`Incidents`. -- [Issues with the {guilabel}`incident` label](https://github.com/2i2c-org/infrastructure/issues?q=is%3Aopen+label%3A%22type%3A+Hub+Incident%22+sort%3Aupdated-desc) are where we track progress when [resolving incidents](support:incident-response). - +- [`2i2c-org.pagerduty.com`](https://2i2c-org.pagerduty.com/) is a dashboard for managing incidents. + This is the "source of truth" for any active or historical incidents. +- [The `#pagerduty-notifications` Slack channel](https://2i2c.slack.com/archives/C041E05LVHB) is where we control PagerDuty and have discussion about an incident. This allows us to have an easily-accessible communication channel for incidents. In general, most interactions with PagerDuty should be via this channel. (support:incident-response)= ## Incident response process @@ -75,31 +76,21 @@ Here is the process that we follow for incidents: Incident first response template ``` -2. **Trigger an incident in PagerDuty**, using the 2i2c slack so we have a central location to discuss the incident. - Use `/pd trigger` in the {guilabel}`#pagerduty-notifications` channel on the 2i2c slack to trigger the incident - - after you type the command and hit `enter`, you should get a dialog box with options. - - For "Impacted Service", select "Managed JupyterHubs". We can have more fine-grained services here later if we wish. - - Assign it to whoever is the **Incident Commander**. This is by default one of the support stewards or whoever is - triggering the event, but not necessarily[^note-on-delegation]! - - Provide a descriptive but short Title, but don't sweat it too much! +2. **Trigger an incident in PagerDuty**. Below are instructions for doing so via [the 2i2c slack](incidents:communications). + - **Type `/pd trigger` and hit `enter`** to trigger the incident. + After you hit `enter`, you should get a dialog box with options. + - For "Impacted Service", **select `Managed JupyterHubs`**. + - **Assign it to the Incident Commander**. By default this is one of the {term}`Support Stewards` or the person triggering the event, but may be delegated to others[^note-on-delegation]! + - **Provide a descriptive but short title**, but don't sweat it too much! + - **Add a link to the FreshDesk ticket** in the description (if there is one). + - **Create a new Slack channel** by checking the box for `Create a dedicated Public Slack channel for this incident`. + Use this channel for all conversations about the incident. - If there is a freshdesk ticket for this, provide a link to that in the description. + This officially marks the beginning of an incident, and will help make sure we don't accidentally miss steps during or after the incident. - Check the box for "Create a dedicated Public Slack channel for this incident" to create a *new slack channel* - for discussing the incident. This helps keep chatter off other channels *and* provides an easy location to gather - information for the incident report after the fact. - - This officially marks the beginning of an incident, and will help make sure we don't accidentally miss steps during - or after the incident. - -3. **Try resolving the issue** and communicate on the incident specific channel while you gather information and perform - actions - even if only to mark these as notes to yourself. -4. **Designate an {term}`External Liason`**. Do this in the Incident issue. By default, this is the Incident Commander, though they may delegate this to others.[^note-on-delegation] -5. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.[^note-on-delegation] -6. **Communicate our status every few hours**. The {term}`External Liason` is expected to communicate incident status and plan with the {term}`Community Representative`s. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started: +3. **Try resolving the issue** and communicate on the incident-specific channel while you gather information and perform actions - even if only to mark these as notes to yourself. +4. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.[^note-on-delegation] +5. **Communicate our status every few hours**. The {term}`External Liason` is expected to communicate incident status and plan with the {term}`Community Representative`s. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started: ```{button-link} https://2i2c.freshdesk.com/a/admin/canned_responses/folders/80000143608/responses/80000247492/edit :color: primary @@ -107,73 +98,81 @@ Here is the process that we follow for incidents: Incident update template ``` -7. **Communicate when the incident is resolved**. When we believe the incident - is resolved, communicate with the Community Representative that things should be - back to normal. - - Marking the incident as "Resolved" in pagerduty. - - Marking the FreshDesk ticket as {guilabel}`Closed` +6. **Communicate when the incident is resolved**. When we believe the incident is resolved, communicate with the Community Representative that things should be back to normal. + - Mark the incident as "Resolved" in pagerduty. + - Mark the FreshDesk ticket as {guilabel}`Closed`. +7. **Create an incident report**. + See [](incidents:create-report) for more information. [^note-on-delegation]: If you cannot find somebody to take on this work, or feel uncomfortable delegating, the {term}`Project Manager` should help you, and is empowered to delegate on your behalf. -## Creating the Incident Report +(incidents:create-report)= +## Create an Incident Report -Once the incident is resolved, we must create an {term}`Incident Report`. This helps us understand what went wrong, -and how we can improve our systems to prevent a recurrance. This is a *very important* part of making our infrastructure -and human processes more stable and stress free over time, so we should try to do this after each incident. The -**Incident Commander** is responsible for making sure the Incident Report is done, even though they may not be the -person doing it. +Once the incident is resolved, we must create an {term}`Incident Report`. +The **Incident Commander** is responsible for making sure the Incident Report is completed, even though they may not be the person doing it. -Note that we *must* practice a [blameless culture](https://www.blameless.com/sre/what-are-blameless-postmortems-do-they-work-how) -around incident reports - Incidents are *always* caused by systemic issues, and hence solutions must be systemic -too. Go out of your way to make sure there is no finger-pointing. +We practice a [blameless culture](https://www.blameless.com/sre/what-are-blameless-postmortems-do-they-work-how) around incident reports. +Incidents are **always** caused by systemic issues, and hence solutions must be systemic too. +Go out of your way to make sure there is no finger-pointing. -We use PagerDuty's [postmortem](https://support.pagerduty.com/docs/postmortems) feature to create the Incident Report. +We use [PagerDuty's postmortem feature](https://support.pagerduty.com/docs/postmortems) to create the Incident Report. This lets us use notes, status updates from pagerduty as well as messages from Slack easily in the incident report! -1. Open the incident in the PagerDuty web interface, and Click the "New Postmortem Report" button on top. The incident - needs to be already resolved before this feature is available. - -2. The "Owner of the Review Process" should be set to the Incident Commander, or someone else they delegate to explicitly. +1. **Ensure that the incident is resolved**. + If not, refer to the proper step in [](support:incident-response). + The incident needs to be resolved before a report can be generated. +2. **Open the incident** in the PagerDuty web interface, and click the `New Postmortem Report` button on top. +3. `Owner of the Review Process` should be set to the Incident Commander, or someone else they delegate to explicitly. +4. `Impact Start Time` is our best guess for when the incident started (*not* when the report came in). + `Impact End Time` is when service was restored. + Best guesses will do! +5. **Add Data Sources** that we will use to keep track of the actions that happened around the incident. + - Link to the slack channel we created for this incident as a "Data Source", filled in with an appropriate time to cover all the messages there. + - Add any other channels where there was conversation there about the incident (e.g., GitHub Issues or Pull Requests). + + Click `Save Data Sources` to populate the timeline below with messages from the slack channels. +6. **Fill out the timeline**. The goal is to be concise but make it possible for someone reading it to answer "what happened, and when?". + See [](incidents:postmortem-timeline) for more information. +7. **Fill out the "Analysis" section** to the extent possible. + In particular, the "Action Items" should be a list with items linked out to GitHub issues created for follow-up. + Perfection is the enemy of the good here. Save as you go. +8. **Click "Save & View Report"** when you are done, and ask other members of the incident response team to review the incident report. + They might add missing context, additional action items / summary details, or redact information. The person listed as + the "Owner of the Review Process" is still responsible for making sure the rest of the process is completed. +9. After sufficient review, and if the Incident Commander is happy with its completeness, **mark the Status dropdown as "Reviewed"**. +10. Download the PDF, and add it to the `2i2c/infrastrtucture` repository under the `incidents/` directory. This make sure our incidents are all *public*, so others can learn from them as well. -3. Fill out the "Impact Start Time" to be our best guess for when the incident started (*not* when the report came in), and - the "Impact End Time" to be when service was restored. Best guesses will do! +% Is there a way to share incidents in a way that doesn't require adding a binary blob to our repository? I think this generates extra toil in a process that already has a lot of toil, and also adds some clunkiness to git-based workflows. For example, could we have a public Google Drive folder where we drag/drop incident reports? -4. Add the slack channel we created for this incident as a "Data Source", filled in with an appropriate time to cover all - the messages there. You can add other channels too if there was conversation there about the incident. Click "Save Data Sources" - to populate the timeline below with messages from the slack channels. -5. Fill out the timeline! The goal is to be concise but make it possible for someone reading it to answer "what happened, and when?". - The timeline should include: +(incidents:postmortem-timeline)= +### Writing an incident timeline - 1. The beginning of the impact. - 2. When the incident was brought to our attention, with a link to the source (Freshdesk ticket, slack message, etc). - 3. When we responded to the incident. This would coincide with the creation of the PagerDuty incident. - 4. Various debugging actions performed to ascertain the cause of the issue. Talking to yourself as you do this on the - slack channel helps a lot here, as it helps communicate your methods to others on the team as well as help improve - processes in the future more easily. +Below are some tips and crucial information that is needed for a useful and thorough incident timeline. - Examples here would be things like `Looked at hub logs with "kubectl logs -n temple -l component=hub" and found ` or - `Opened the cloud console and discovered notifications about quota`. Pasting in commands is very helpful! This is an - important way for team members to learn from each other - what you take for granted is perhaps news to someone else, or - you might learn alternate ways of doing things! - 5. Actions taken to attempt to fix the issue, and their outcome. Paste commands executed if possible, as well as any - GitHub PRs made. Putting this in Slack again helps. - 6. Any extra communication from the community affected that helped. - 7. Whenever the impact was fixed, and how that was verified. - 8. Whatever else you think would be helpful to someone who finds this incident report a few months from now, trying to fix a - similar incident. +The timeline should include: -6. Fill out the "Analysis" section to the extent possible. In particular, the "Action Items" should be a list with items - linked out to GitHub issues created for follow-up. Perfection is the enemy of the good here. Save as you go. +1. The beginning of the impact. +2. When the incident was brought to our attention, with a link to the source (Freshdesk ticket, slack message, etc). +3. When we responded to the incident. This would coincide with the creation of the PagerDuty incident. +4. Various debugging actions performed to ascertain the cause of the issue. + Talking to yourself as you do this on the slack channel helps a lot here, as it helps communicate your methods to others on the team as well as help improve + processes in the future more easily. -7. Click "Save & View Report" when you are done, and ask other members of the incident response team to review the incident report. - They might add missing context, additional action items / summary details, or redact information. The person listed as - the "Owner of the Review Process" is still responsible for making sure the rest of the process is completed. - -8. After sufficient review, and if the Incident Commander is happy with its completeness, mark the Status dropdown up top as "Reviewed". - -9. Download the PDF, and add it to the `2i2c/infrastrtucture` repository under the `incidents/` directory. This make sure our - incidents are all *public*, so others can learn from them as well. + For example: + + - `Looked at hub logs with "kubectl logs -n temple -l component=hub" and found ` + - `Opened the cloud console and discovered notifications about quota`. + + Pasting in commands is very helpful! + This is an important way for team members to learn from each other - what you take for granted is perhaps news to someone else, or you might learn alternate ways of doing things! +5. Actions taken to attempt to fix the issue, and their outcome. + Paste commands executed if possible, as well as any GitHub PRs made. + If you've already done this in the incident Slack channel you may simply copy/paste text here. +6. Any extra communication from the community affected that helped. +7. Whenever the incident was fixed, and how that was verified. +8. Whatever else you think would be helpful to someone who finds this incident report a few months from now, trying to fix a similar incident. ## Handing off Incident Commander status @@ -193,7 +192,10 @@ To designate another team member as the Incident Commander, follow these steps: Incident Report Incident Reports A document that describes what went wrong during an incident and what we'll do to avoid it in the future. When we have an {term}`Incident`, we create an Incident Report issue. - This helps us explain what went wrong, and directs actions to avoid the incident in the future. Its goal is to identify improvements to process, technology, and team dynamics that can avoid incidents like this in the future. It is **not** meant to point fingers at anybody and care should be taken to avoid making it seem like any one person is at fault[^post-mortems]. + + This helps us understand what went wrong, and how we can improve our systems to prevent a recurrance. Its goal is to identify improvements to process, technology, and team dynamics that can avoid incidents like this in the future. It is **not** meant to point fingers at anybody and care should be taken to avoid making it seem like any one person is at fault. + + This is a *very important* part of making our infrastructure and human processes more stable and stress-free over time, so we should do this after each incident.[^post-mortems]. ``` [^post-mortems]: See the [Google SRE post-mortem culture](https://sre.google/sre-book/postmortem-culture/) and the [Blameless guide to post-mortems](https://www.blameless.com/sre/what-are-blameless-postmortems-do-they-work-how) for some guidelines. From a49622810e2b9001f2a3c1e312a0df9e9a5b1488 Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Mon, 12 Sep 2022 15:35:55 -0700 Subject: [PATCH 07/16] Use newly created repo for incident reports --- projects/managed-hubs/incidents.md | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/projects/managed-hubs/incidents.md b/projects/managed-hubs/incidents.md index 264571a4..3a9fe33e 100644 --- a/projects/managed-hubs/incidents.md +++ b/projects/managed-hubs/incidents.md @@ -130,7 +130,7 @@ This lets us use notes, status updates from pagerduty as well as messages from S 5. **Add Data Sources** that we will use to keep track of the actions that happened around the incident. - Link to the slack channel we created for this incident as a "Data Source", filled in with an appropriate time to cover all the messages there. - Add any other channels where there was conversation there about the incident (e.g., GitHub Issues or Pull Requests). - + Click `Save Data Sources` to populate the timeline below with messages from the slack channels. 6. **Fill out the timeline**. The goal is to be concise but make it possible for someone reading it to answer "what happened, and when?". See [](incidents:postmortem-timeline) for more information. @@ -140,10 +140,11 @@ This lets us use notes, status updates from pagerduty as well as messages from S 8. **Click "Save & View Report"** when you are done, and ask other members of the incident response team to review the incident report. They might add missing context, additional action items / summary details, or redact information. The person listed as the "Owner of the Review Process" is still responsible for making sure the rest of the process is completed. -9. After sufficient review, and if the Incident Commander is happy with its completeness, **mark the Status dropdown as "Reviewed"**. -10. Download the PDF, and add it to the `2i2c/infrastrtucture` repository under the `incidents/` directory. This make sure our incidents are all *public*, so others can learn from them as well. +9. After sufficient review, and if the Incident Commander is happy with its completeness, **mark the Status dropdown as "Reviewed"**. +10. Download the PDF, and add it to the [`2i2c/infrastrtucture`](https://github.com/2i2c-org/incident-reports) repository under + the `reports/` directory. This make sure our incidents are all *public*, so + others can learn from them as well. -% Is there a way to share incidents in a way that doesn't require adding a binary blob to our repository? I think this generates extra toil in a process that already has a lot of toil, and also adds some clunkiness to git-based workflows. For example, could we have a public Google Drive folder where we drag/drop incident reports? (incidents:postmortem-timeline)= @@ -161,10 +162,10 @@ The timeline should include: processes in the future more easily. For example: - + - `Looked at hub logs with "kubectl logs -n temple -l component=hub" and found ` - `Opened the cloud console and discovered notifications about quota`. - + Pasting in commands is very helpful! This is an important way for team members to learn from each other - what you take for granted is perhaps news to someone else, or you might learn alternate ways of doing things! 5. Actions taken to attempt to fix the issue, and their outcome. @@ -194,7 +195,7 @@ Incident Reports A document that describes what went wrong during an incident and what we'll do to avoid it in the future. When we have an {term}`Incident`, we create an Incident Report issue. This helps us understand what went wrong, and how we can improve our systems to prevent a recurrance. Its goal is to identify improvements to process, technology, and team dynamics that can avoid incidents like this in the future. It is **not** meant to point fingers at anybody and care should be taken to avoid making it seem like any one person is at fault. - + This is a *very important* part of making our infrastructure and human processes more stable and stress-free over time, so we should do this after each incident.[^post-mortems]. ``` From 5249fb71a72ed5df6b155e52d9e83919b8137acf Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Tue, 13 Sep 2022 00:15:23 -0700 Subject: [PATCH 08/16] Note when and how incident commander can assign reporting duties --- projects/managed-hubs/incidents.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/projects/managed-hubs/incidents.md b/projects/managed-hubs/incidents.md index 3a9fe33e..d126dc5c 100644 --- a/projects/managed-hubs/incidents.md +++ b/projects/managed-hubs/incidents.md @@ -111,6 +111,8 @@ Here is the process that we follow for incidents: Once the incident is resolved, we must create an {term}`Incident Report`. The **Incident Commander** is responsible for making sure the Incident Report is completed, even though they may not be the person doing it. +If they are *not* the person doing it, they should still creat the incident report, but assign `Owner of the Review Process` +to be someone else (after checking with the other person). See more detailed steps below. We practice a [blameless culture](https://www.blameless.com/sre/what-are-blameless-postmortems-do-they-work-how) around incident reports. Incidents are **always** caused by systemic issues, and hence solutions must be systemic too. From 220bbe7cc12e6fd3aa34fa52658c0c56cef9032b Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Tue, 13 Sep 2022 00:23:56 -0700 Subject: [PATCH 09/16] Update how EL can be delegated --- projects/managed-hubs/incidents.md | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/projects/managed-hubs/incidents.md b/projects/managed-hubs/incidents.md index d126dc5c..67895f40 100644 --- a/projects/managed-hubs/incidents.md +++ b/projects/managed-hubs/incidents.md @@ -90,7 +90,19 @@ Here is the process that we follow for incidents: 3. **Try resolving the issue** and communicate on the incident-specific channel while you gather information and perform actions - even if only to mark these as notes to yourself. 4. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.[^note-on-delegation] -5. **Communicate our status every few hours**. The {term}`External Liason` is expected to communicate incident status and plan with the {term}`Community Representative`s. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started: +5. **Communicate our status every few hours**. The {term}`External Liason` is + expected to communicate incident status and plan with the {term}`Community + Representative`s. If the incident commander wants to delegate External Liason duties + to someone else, they should: + + 1. Assign the *Freshdesk* ticket to the external liason, as that is the default point of + communication with community representatives. + 2. Make a note on the PagerDuty incident as well. + + + The externl liason should provide periodic updates that describe the current + state of the incident, what we have tried, and our intended next steps. Here is + a canned response to get started: ```{button-link} https://2i2c.freshdesk.com/a/admin/canned_responses/folders/80000143608/responses/80000247492/edit :color: primary From e122e09594948c8f3316060f965d0be341d93a15 Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Tue, 13 Sep 2022 00:25:52 -0700 Subject: [PATCH 10/16] Update internal comms slack channels --- projects/managed-hubs/incidents.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/projects/managed-hubs/incidents.md b/projects/managed-hubs/incidents.md index 67895f40..5e43c939 100644 --- a/projects/managed-hubs/incidents.md +++ b/projects/managed-hubs/incidents.md @@ -56,9 +56,12 @@ Subject Matter Experts (incidents:communications)= ### Internal communication +- A channel *dedicated* to each incident will be created by pagerduty once an incident is created. This is where most of the + discussion about the incident should happen. - [`2i2c-org.pagerduty.com`](https://2i2c-org.pagerduty.com/) is a dashboard for managing incidents. This is the "source of truth" for any active or historical incidents. -- [The `#pagerduty-notifications` Slack channel](https://2i2c.slack.com/archives/C041E05LVHB) is where we control PagerDuty and have discussion about an incident. This allows us to have an easily-accessible communication channel for incidents. In general, most interactions with PagerDuty should be via this channel. +- [The `#pagerduty-notifications` Slack channel](https://2i2c.slack.com/archives/C041E05LVHB) is *primarily* used to trigger + new incidents and control pagerduty in other ways. Discussion of *specific* incidents should not happen here. (support:incident-response)= ## Incident response process From 7ca2a8f961592a3dff764fd8c023a8f0dc1a4f68 Mon Sep 17 00:00:00 2001 From: Chris Holdgraf Date: Tue, 13 Sep 2022 04:05:57 -0700 Subject: [PATCH 11/16] Apply suggestions from code review Co-authored-by: Georgiana Elena --- projects/managed-hubs/incidents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/projects/managed-hubs/incidents.md b/projects/managed-hubs/incidents.md index 5e43c939..99177a2f 100644 --- a/projects/managed-hubs/incidents.md +++ b/projects/managed-hubs/incidents.md @@ -126,7 +126,7 @@ Here is the process that we follow for incidents: Once the incident is resolved, we must create an {term}`Incident Report`. The **Incident Commander** is responsible for making sure the Incident Report is completed, even though they may not be the person doing it. -If they are *not* the person doing it, they should still creat the incident report, but assign `Owner of the Review Process` +If they are *not* the person doing it, they should still create the incident report, but assign `Owner of the Review Process` to be someone else (after checking with the other person). See more detailed steps below. We practice a [blameless culture](https://www.blameless.com/sre/what-are-blameless-postmortems-do-they-work-how) around incident reports. From 5648f4310cf7efaf5852385cf2b04415392279cc Mon Sep 17 00:00:00 2001 From: Yuvi Panda Date: Wed, 14 Sep 2022 08:59:47 -0700 Subject: [PATCH 12/16] Apply suggested edits Co-authored-by: Chris Holdgraf --- projects/managed-hubs/incidents.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/projects/managed-hubs/incidents.md b/projects/managed-hubs/incidents.md index 99177a2f..38687ca6 100644 --- a/projects/managed-hubs/incidents.md +++ b/projects/managed-hubs/incidents.md @@ -102,7 +102,6 @@ Here is the process that we follow for incidents: communication with community representatives. 2. Make a note on the PagerDuty incident as well. - The externl liason should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started: @@ -125,9 +124,10 @@ Here is the process that we follow for incidents: ## Create an Incident Report Once the incident is resolved, we must create an {term}`Incident Report`. -The **Incident Commander** is responsible for making sure the Incident Report is completed, even though they may not be the person doing it. -If they are *not* the person doing it, they should still create the incident report, but assign `Owner of the Review Process` -to be someone else (after checking with the other person). See more detailed steps below. +The **Incident Commander** is responsible for **starting the incident report process**, and **making sure the Incident Report is completed**. +They are not required to fill out all of the information in the report, though they may do so if they wish. +If another person will fill out the report, check with them first and then assign them as `Owner of the Review Process`. +See more detailed steps below. We practice a [blameless culture](https://www.blameless.com/sre/what-are-blameless-postmortems-do-they-work-how) around incident reports. Incidents are **always** caused by systemic issues, and hence solutions must be systemic too. @@ -169,6 +169,7 @@ This lets us use notes, status updates from pagerduty as well as messages from S Below are some tips and crucial information that is needed for a useful and thorough incident timeline. +% TODO: Add example incident reports for reference when they exist The timeline should include: 1. The beginning of the impact. From 767bfbfb4b6ba7613b84d8df4683312f1ef7d729 Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Wed, 14 Sep 2022 18:50:46 -0700 Subject: [PATCH 13/16] Add example of incident reports --- projects/managed-hubs/incidents.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/projects/managed-hubs/incidents.md b/projects/managed-hubs/incidents.md index 38687ca6..1ae7b075 100644 --- a/projects/managed-hubs/incidents.md +++ b/projects/managed-hubs/incidents.md @@ -167,9 +167,10 @@ This lets us use notes, status updates from pagerduty as well as messages from S (incidents:postmortem-timeline)= ### Writing an incident timeline -Below are some tips and crucial information that is needed for a useful and thorough incident timeline. +Below are some tips and crucial information that is needed for a useful and thorough incident timeline. You can see +examples of previous incident reports at the [2i2c-org/incident-reports](https://github.com/2i2c-org/incident-reports/tree/main/reports) +repository. -% TODO: Add example incident reports for reference when they exist The timeline should include: 1. The beginning of the impact. From edf52ca3aba52677b8f11a43460322e5b8655346 Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Fri, 16 Sep 2022 00:55:58 -0700 Subject: [PATCH 14/16] Describe what counts as an incident --- projects/managed-hubs/incidents.md | 22 ++++++++++++++++++++++ projects/managed-hubs/support.md | 14 ++++++-------- 2 files changed, 28 insertions(+), 8 deletions(-) diff --git a/projects/managed-hubs/incidents.md b/projects/managed-hubs/incidents.md index 1ae7b075..747ee6f6 100644 --- a/projects/managed-hubs/incidents.md +++ b/projects/managed-hubs/incidents.md @@ -45,6 +45,28 @@ Subject Matter Experts A member on the {term}`Incident Response Team` with expertise in an area of relevance to an Incident. SMEs have a variety of backgrounds and abilities, and they should be pulled in to the Response Team as-needed by the {term}`Incident Commander`. Their goal is to take actions as-directed by the {term}`Incident Commander` to resolve an incident. ``` +(incidents:what)= +## What counts as an incident? + +Eventually, we will have more nuanced and complete ways to track different kinds of incidents. However, for +now, we define an incident as one of: + +1. The hub is inaccessible to a number of users (N>=2). Specifically, this manifests in three ways: + a. They can not log in + b. They can not start their servers + c. They can not execute code (no kernels can be started) +3. A number of users (N>=2) cannot create or use Dask Gateway clusters. + +Everything else is considered a support ticket only, not an incident. This *will* change in +the future as our process matures. + +We do not have a limit on the support time we provide related to incidents (as +opposed to Change and Guidance requests, which have a {term}`Support Budget`). + +```{note} +PagerDuty has a 'Severity' field for incidents. We do not use this field currently. +``` + ## Communication channels ### External communication diff --git a/projects/managed-hubs/support.md b/projects/managed-hubs/support.md index af38233b..d0c85900 100644 --- a/projects/managed-hubs/support.md +++ b/projects/managed-hubs/support.md @@ -46,11 +46,9 @@ Support Requests Incident Incidents - An event that significantly degrades the JupyterHub service. Support requests that are related to incidents should be prioritized over all other work items. Here are a few common examples of incidents: + An event that significantly degrades the JupyterHub service. Support requests that are related to incidents should be prioritized over all other work items. - 1. The hub is inaccessible to a number of users. - 2. A number of users are unable to start their servers. - 3. A number of users cannot create Dask Gateway clusters. + [](incidents:what) defines the kind of incidents we respond to via PagerDuty and consider immediate issues to be resolved. We do not have a limit on the support time we provide related to incidents (as opposed to Change and Guidance requests, which have a {term}`Support Budget`). @@ -61,12 +59,12 @@ Incidents Change Request Change Requests A request for a desired change to a hub's infrastructure that is not related to an incident. For example: - + - Changing the user's software environment. - Changing the resources available to users. - Updating and deploying changes from upstream tools for a community. - Making an improvement to open source tools to benefit a community. - + Change Requests are generally non-urgent and should not be associated with significant diminished service. They are often things that communities _could_ carry out themselves with the proper guidance and infrastructure setup. We aim to make our hubs as configurable as possible _by the community_ so that we are not on the critical path for things like environment updates. Guidance Request @@ -166,7 +164,7 @@ This process is carried out in an ongoing basis by the {term}`Support Stewards`. (support:non-incident-response)= ### Non-incident response process -1. **Respond within 24 working hours**. Acknowledge receipt of the support request and let the {term}`Community Representative` know about any investigation we have done thus far. +1. **Respond within 24 working hours**. Acknowledge receipt of the support request and let the {term}`Community Representative` know about any investigation we have done thus far. 2. **Spend 30 minutes trying to resolve**. If you believe you can resolve the issue within 30 minutes, try resolving it yourself. 1. If you resolve the issue, then jump to the "Communicate resolution" step. 2. If you don't believe you can resolve the issue in 30 minutes, jump to the next step. @@ -220,7 +218,7 @@ Support Budget :::{note} We currently keep this term intentionally vague, and ask that communities are respectful of our time when making change requests. - We are investigating the support budget that we should give to each community, and will update here when we have specific numbers in mind. + We are investigating the support budget that we should give to each community, and will update here when we have specific numbers in mind. Here is a rough idea of the rationale we follow as we identify more specific numbers for support budget: From f299b58db99ed242e0a1908f06fe5ebc1a423393 Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Fri, 16 Sep 2022 13:26:07 -0700 Subject: [PATCH 15/16] Add note about not requiring review when adding incident-reports --- projects/managed-hubs/incidents.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/projects/managed-hubs/incidents.md b/projects/managed-hubs/incidents.md index 747ee6f6..021db492 100644 --- a/projects/managed-hubs/incidents.md +++ b/projects/managed-hubs/incidents.md @@ -182,7 +182,8 @@ This lets us use notes, status updates from pagerduty as well as messages from S 9. After sufficient review, and if the Incident Commander is happy with its completeness, **mark the Status dropdown as "Reviewed"**. 10. Download the PDF, and add it to the [`2i2c/infrastrtucture`](https://github.com/2i2c-org/incident-reports) repository under the `reports/` directory. This make sure our incidents are all *public*, so - others can learn from them as well. + others can learn from them as well. Given review is already completed in the pagerduty interface, you don't need to wait + for review to add the report here. From 7d6af899eda2b3fd73b31b7f7bfbb22665fed9e4 Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Fri, 16 Sep 2022 13:27:42 -0700 Subject: [PATCH 16/16] Add note about emailing the incident report to community rep --- projects/managed-hubs/incidents.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/projects/managed-hubs/incidents.md b/projects/managed-hubs/incidents.md index 021db492..2f20161a 100644 --- a/projects/managed-hubs/incidents.md +++ b/projects/managed-hubs/incidents.md @@ -184,6 +184,8 @@ This lets us use notes, status updates from pagerduty as well as messages from S the `reports/` directory. This make sure our incidents are all *public*, so others can learn from them as well. Given review is already completed in the pagerduty interface, you don't need to wait for review to add the report here. +11. Email a link to the incident report to the community representative, ideally via the Freshdesk ticket used to communicate with + them during the incident itself.