2i2c-org · yuvipanda · Oct 4, 2022 · Sep 8, 2022 · Sep 8, 2022 · Sep 8, 2022
diff --git a/projects/managed-hubs/incidents.md b/projects/managed-hubs/incidents.md
@@ -75,34 +75,106 @@ Here is the process that we follow for incidents:
    Incident first response template
    ```
 
-2. **Open an incident issue**.
-   For each {term}`Incident` we create a dedicated issue to track its progress. [{bdg-primary}`open an incident issue`](https://github.com/2i2c-org/infrastructure/issues/new?assignees=&labels=type%3A+Hub+Incident%2Csupport&template=3_incident-report.md&title=%5BIncident%5D+%7B%7B+TITLE+%7D%7D) and notify our engineering team via Slack.
-3. **Try resolving the issue** and take notes while you gather information about it.
-4. **If after 30 minutes the issue is not solved or you know you cannot resolve it**
-  - Ping our engineering team and our Project Manager in the {guilabel}`#support-freshdesk` channel so that they are aware of the incident.
-  - Add the incident issue to [our team backlog](https://github.com/orgs/2i2c-org/projects/22/).
-5. **Designate an {term}`Incident Commander`**. Do this in the Incident issue. By default, this is the Support Steward.
-  - Confirm that the Incident Commander has the bandwidth and ability to serve in this role.
-  - If not, delegate this to another team member.[^note-on-delegation]
-6. **Designate an {term}`External Liason`**. Do this in the Incident issue. By default, this is the Incident Commander, though they may delegate this to others.[^note-on-delegation]
-7. **Investigate and resolve the incident**. The Incident Commander should follow the structure of the incident issue opened in the step above.
-8. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.[^note-on-delegation]
-9. **Communicate our status every few hours**. The {term}`External Liason` is expected to communicate incident status and plan with the {term}`Community Representative`s. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started:
+2. **Trigger an incident in PagerDuty**, using the 2i2c slack so we have a central location to discuss the incident.
+   Use `/pd trigger` in the {guilabel}`#pagerduty-notifications` channel on the 2i2c slack to trigger the incident -
+   after you type the command and hit `enter`, you should get a dialog box with options.
+
+   For "Impacted Service", select "Managed JupyterHubs". We can have more fine-grained services here later if we wish.
+
+   Assign it to whoever is the **Incident Commander**. This is by default one of the support stewards or whoever is
+   triggering the event, but not necessarily[^note-on-delegation]!
+
+   Provide a descriptive but short Title, but don't sweat it too much!
+
+   If there is a freshdesk ticket for this, provide a link to that in the description.
+
+   Check the box for "Create a dedicated Public Slack channel for this incident" to create a *new slack channel*
+   for discussing the incident. This helps keep chatter off other channels *and* provides an easy location to gather
+   information for the incident report afte the fact.
+
+   This officially marks the beginning of an incident, and will help make sure we don't accidentally miss steps during
+   or after the incident.
+
+3. **Try resolving the issue** and communicate on the incident specific channel while you gather information and perform
+   actions - even if only to mark these as notes to yourself.
+4. **Designate an {term}`External Liason`**. Do this in the Incident issue. By default, this is the Incident Commander, though they may delegate this to others.[^note-on-delegation]
+5. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.[^note-on-delegation]
+6. **Communicate our status every few hours**. The {term}`External Liason` is expected to communicate incident status and plan with the {term}`Community Representative`s. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started:
 
    ```{button-link} https://2i2c.freshdesk.com/a/admin/canned_responses/folders/80000143608/responses/80000247492/edit
    :color: primary
 
    Incident update template
    ```
 
-9. **Communicate when the incident is resolved**. When we believe the incident is resolved, communicate with the Community Representative that things should be back to normal. Mark the FreshDesk ticket as {guilabel}`Resolved`.
-10. **Fill in the {term}`Incident Report`**. The Incident Commander should do this in partnership with the Incident Response Team.
-11. **Close the incident ticket**. Once we have confirmation from the community (or no response after 48 working hours), and have filled in the incident {term}`Incident Report`, then close the incident by:
-    - Closing the incident issue on GitHub
-    - Marking the FreshDesk ticket as {guilabel}`Closed`
+7. **Communicate when the incident is resolved**. When we believe the incident
+   is resolved, communicate with the Community Representative that things should be
+   back to normal.
+   - Marking the incident as "Resolved" in pagerduty.
+   - Marking the FreshDesk ticket as {guilabel}`Closed`
 
 [^note-on-delegation]: If you cannot find somebody to take on this work, or feel uncomfortable delegating, the {term}`Project Manager` should help you, and is empowered to delegate on your behalf.
 
+## Creating the Incident Report
+
+Once the incident is resolved, we must create an {term}`Incident Report`. This helps us understand what went wrong,
+and how we can improve our systems to prevent a recurrance. This is a *very important* part of making our infrastructure
+and human processes more stable and stress free over time, so we should try to do this after each incident. The
+**Incident Commander** is responsible for making sure the Incident Report is done, even though they may not be the
+person doing it.
+
+Note that we *must* practice a [blameless culture](https://www.blameless.com/sre/what-are-blameless-postmortems-do-they-work-how)
+around incident reports - Incidents are *always* caused by systemic issues, and hence solutions must be systemic
+too. Go out of your way to make sure there is no finger-pointing.
+
+We use PagerDuty's [postmortem](https://support.pagerduty.com/docs/postmortems) feature to create the Incident Report.
+This lets us use notes, status updates from pagerduty as well as messages from Slack easily in the incident report!
+
+1. Open the incident in the PagerDuty web interface, and Click the "New Postmortem Report" button on top. The incident
+   needs to be already resolved before this feature is available.
+
+2. The "Owner of the Review Process" should be set to the Incident Commander, or someone else they delegate to explicitly.
+
+3. Fill out the "Impact Start Time" to be our best guess for when the incident started (*not* when the report came in), and
+   the "Impact End Time" to be when service was restored. Best guesses will do!
+
+4. Add the slack channel we created for this incident as a "Data Source", filled in with an appropriate time to cover all
+   the messages there. You can add other channels too if there was conversation there about the incident. Click "Save Data Sources"
+   to populate the timeline below with messages from the slack channels.
+
+5. Fill out the timeline! The goal is to be concise but make it possible for someone reading it to answer "what happened, and when?".
+   The timeline should include:
+
+   1. The beginning of the impact.
+   2. When the incident was brought to our attention, with a link to the source (Freshdesk ticket, slack message, etc).
+   3. When we responded to the incident. This would coincide with the creation of the PagerDuty incident.
+   4. Various debugging actions performed to ascertain the cause of the issue. Talking to yourself as you do this on the
+      slack channel helps a lot here, as it helps communicate your methods to others on the team as well as help improve
+      processes in the future more easily.
+
+      Examples here would be things like `Looked at hub logs with "kubectl logs -n temple -l component=hub" and found <this>` or
+      `Opened the cloud console and discovered notifications about quota". Pasting in commands is very helpful! This is an
+      important way for team members to learn from each other - what you take for granted is perhaps news to someone else, or
+      you might learn alternate ways of doing things!
+   5. Actions taken to attempt to fix the issue, and their outcome. Paste commands executed if possible, as well as any
+      GitHub PRs made. Putting this in Slack again helps.
+   6. Any extra communication from the community affected that helped.
+   7. Whenever the impact was fixed, and how that was verified.
+   8. Whatever else you think would be helpful to someone who finds this incident report a few months from now, trying to fix a
+      similar incident.
+
+6. Fill out the "Analysis" section to the extent possible. In particular, the "Action Items" should be a list with items
+   linked out to GitHub issues created for follow-up. Perfection is the enemy of the good here. Save as you go.
+
+7. Click "Save & View Report* when you are done, and ask other members of the incident response team to review the incident report.
+   They might add missing context, additional action items / summary details, or redact information. The person listed as
+   the "Owner of the Review Process" is still responsible for making sure the rest of the process is completed.
+
+8. After sufficient review, and if the Incident Commander is happy with its completeness, mark the Status dropdown up top as "Reviewed".
+
+9. Download the PDF, and add it to the `2i2c/infrastrtucture` repository under the `incidents/` directory. This make sure our
+   incidents are all *public*, so others can learn from them as well.
+
 ## Handing off Incident Commander status
 
 During an incident, it may be necessary to designate another person to be the Incident Commander.
@@ -112,8 +184,8 @@ This is encouraged and expected, especially for more complex or longer incidents
 To designate another team member as the Incident Commander, follow these steps:
 
 1. **Confirm with them** that they are able and willing to serve as the Incident Commander.
-2. **Update the Incident Report issue** by updating the Incident Commander name in the top comment.
-3. **Notify the team** with a comment in the Incident Report issue.
+2. **Reassign the incident on PagerDuty** to the new commander. This should produce a message in the slack channel for this event,
+   thus communicating this change to the rest of the team.
 
 ## Key terms