Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add incident commander role + more steps to support process #422

Merged
merged 17 commits into from
Jun 27, 2022
25 changes: 25 additions & 0 deletions practices/development.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,31 @@ This section describes how our development team carries out its planning and day
👉 [Here's a link to see all Pull Requests for which your review is requested](https://github.com/issues?q=is%3Aopen+archived%3Afalse+sort%3Aupdated-desc+user%3A2i2c-org+type%3Apr+review-requested%3A%40me)
:::

## Roles and team structure

We use the following roles to help us understand responsiblities and expectations around developing and operating our infrastructure.
For roles that are related to more specific actions like support and incidents, see [our managed service documentation](service:index).

(roles:project-manager)=
### Project Manager

We are piloting the use of a dedicated Project Manager to help our team plan and coordinate with one another.
See [this GitHub issue](https://github.com/2i2c-org/team-compass/issues/398) for our plans and experience with this pilot thus far.

(roles:hub-engineer)=
### Hub Engineer

The job of a Hub Engineer is to develop and operate deployment infrastructure for a hub, and to perform major upgrades or improvements to resolve issues that cannot be solved by a [Hub Administrator](roles:hub-administrator).
Hub engineers regularly work on the JupyterHub infrastructure and provide open source development for the technology that powers each hub.
People in these roles are generally affiliated with 2i2c.

#### Responsibilities

- Respond to support requests from the Community Representative(s)
- Perform major upgrades on hub infrastructure
- Debug and resolve major issues with a hub that require intervention from a Hub Engineer
- Perform open source development on technologies that are in use by the hubs

(coordination:sprints)=
## Team Sprints

Expand Down
120 changes: 120 additions & 0 deletions projects/managed-hubs/incidents.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Incident response


When an {term}`Incident` is declared, we trigger a special response in order to ensure that it is resolved quickly.
This section describes our incident response process, major roles and terminology, and what to expect.[^incident-refs]

[^incident-refs]: The [PagerDuty Incident Response Guide](https://response.pagerduty.com/) and the [Google SRE Incident response guide](https://sre.google/workbook/incident-response/) inspired much of the content on this page.

:::{admonition} In Beta!
:class: warning
We are currently working out our Incident Response process.
The content on this page might change over time, and we welcome suggested changes and pull requests!
:::

## Roles and team structure

An {term}`Incident Response Team` is formed when an {term}`Incident` has been declared.
The goal of the Incident Response Team is to collectively resolve incidents.

An Incident Response Team is generally made up of:

- An {term}`Incident Commander`
- The {term}`Support Steward`s
- One or more {term}`Subject Matter Expert`s (SMEs)

```{glossary}
Incident Response Team
The group of roles that collectively understand, plan, resolve, and communicate our actions around an {term}`Incident`. The people in these roles may change in a fluid manner, and one person may serve in multiple roles. A rough way to approximate this team is "the people that have communicated in internal and external channels to resolve an incident."

Incident Commander
The Incident Commander has the authority to plan and delegate action to others on the {term}`Incident Response Team`. They are **not expected** to take actions themselves. Their goal is to help the team make consistent and deliberate progress towards resolving an incident. They are the {term}`Source of Truth` about the current state and action plan surrounding an incident.

Subject Matter Expert
A member on the {term}`Incident Response Team` with expertise in an area of relevance to an Incident. SMEs have a variety of backgrounds and abilities, and they should be pulled in to the Response Team as-needed by the {term}`Incident Commander`. Their goal is to take actions as-directed by the {term}`Incident Commander` to resolve an incident.
```

## Communication channels

### External communication

- The {term}`Support Steward` team acts as the primary point of communication with external stakeholders like the {term}`Community Representative`s.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, specifically for incidents, this should be optional. This adds an additional load for the support steward and the incident commander (if they are different people). The IC can request the support steward to act as this if necessary, but otherwise I think we should not require this of the support steward. They should be able to delegate an incident to the incident commander, and then by default continue with their existing role. I think if the IC is the source of truth, they should by default be the person who communicates too - otherwise we're adding an entirely new person to this chain, and that is often extremely frustrating during an incident process for everyone involved.

So my suggestion is that the IC can ask someone else to be the point of contact (support steward or someone else) if they so choose to, but that is not the default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your reasoning makes sense to me - one concern I have is that the incident commander needs to spend their cycles on resolving the incident, not necessarily also communicating it externally. Maybe the answer is to say the incident commander does this by default, but if they must log-off or are otherwise overwhelmed, they should delegate another team member to provide external communication?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly right! The default is they do it but if in their judgement delegating it is the right thing - if the extra communication overhead is worth it (as it often is) they can. They just delegate it to someone, not necessarily the support steward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the IC presents itself (or it is actually introduced by the support steward) to the ticket submitter, I think it makes sense NOW to make this optional. I also like the idea to delegate the communication to another one who is not necessarily the support steward (because we do not have a lot of people in support to address other tickets).

Why do I highlight the NOW word? Because when we get a dedicated support team, there should be a clear separation of boxes, IMHO. The support team should be handling the communication with the ticket submitters because they are trained and specialized to interact with people in stress looking for answers. An IC coming from the eng team is not well prepared for that interaction... and that might be a source of issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah - thinking to the Incident Commander docs from pagerduty I think this is the "external liason" role. They also have a dedicated person/team to do that work. In our case, I think there are two things I worry about there:

  • We don't have the staffing capacity for this now, but maybe this will change in the future as @damianavila suggests.
  • If people are not awake at the same time, we pay a big communication penalty when we have bottlenecks of information. If one person must be the one to communicate externally, and that person just went to sleep, then it means no communication can occur until they return to work. This feels like a stressful situation given that we don't have the staffing to ensure seamless handoffs between time-zones all the time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is why I think this is fine for now but we should change it in the future when we have enough capacity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I've updated this so that the Incident Commander is the communicator by default, and may delegate if they wish.

- We may interact with external stakeholders via comments in Incident Response issues if it helps resolve the incident more quickly.

### Internal communication

- The Slack channel [{guilabel}`#support-freshdesk`](https://2i2c.slack.com/archives/C028WU9PFBN) contains real-time communication about support issues. Use this to signal-boost support requests related to {term}`Incident`s.
- [Issues with the {guilabel}`incident` label](https://github.com/2i2c-org/infrastructure/issues?q=is%3Aopen+label%3A%22type%3A+Hub+Incident%22+sort%3Aupdated-desc) are where we track progress when [resolving incidents](support:incident-response).


(support:incident-response)=
## Incident response process

Incidents are a special kind of support ticket, because they are related to degraded service that immediately impacts communities.
We prioritize the resolution of incidents above all other kinds of work, and have a special process for tracking conversation and progress with them.

Here is the process that we follow for incidents:

1. **Acknowledge the incident**. Communicate with the Community Representative that there is an incident. Here is a template to get started:

```
Hello { NAME }, we have investigated this request and have concluded that
it is related to an incident that is causing diminished service for your
community.

We believe that this incident is related to { CONTEXT HERE } and will
investigate further on next actions. Information about our incident
response process can be found [in our team support documentation](https://team-compass.2i2c.org/en/latest/projects/managed-hubs/support.html).

We will open an incident report issue in [our infrastructure repository](https://github.com/2i2c-org/infrastructure)
where you can track progress if you wish.

We'll prioritize resolving this incident over our other work, and
will communicate with you throughout our attempts to resolve it.
We might be in touch with requests for clarifications if needed.
damianavila marked this conversation as resolved.
Show resolved Hide resolved
```
2. **Open an incident issue**.
GeorgianaElena marked this conversation as resolved.
Show resolved Hide resolved
For each {term}`Incident` we create a dedicated issue to track its progress. [{bdg-primary}`open an incident issue`](https://github.com/2i2c-org/infrastructure/issues/new?assignees=&labels=type%3A+Hub+Incident%2Csupport&template=3_incident-report.md&title=%5BIncident%5D+%7B%7B+TITLE+%7D%7D) and notify our engineering team via Slack.
3. **Try resolving the issue** and take notes while you gather information about it.
4. **If after 30 minutes the issue is not solved or you know you cannot resolve it**, ping our engineering team and our Project Manager in the {guilabel}`#support-freshdesk` channel so that they are aware of the incident.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the PM's perspective, the incident issue should be added to the sprint (cycle) and team backlogs, and the PM should do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have added that in!

5. **Designate an {term}`Incident Commander`**. If the Support Steward wishes to designate someone other than themselves as Incident Commander, do this in the Incident issue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this line implies the support steward has the "power" to designate the IC. How that power will be practiced?
I think there should be some known expectations/details around this. For instance, can the designation be rejected, and what do we do if that happens?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point - also important to think about the power dynamics here. If the support steward is a brand new team member with less experience than others, they may not feel comfortable just "delegating" to somebody else.

Is this a role that the Project Manager could play? I recall from 2i2c-org/infrastructure#1068 that one of the case studies there used a workflow like:

  • Support person tries to resolve themselves first
  • If they can't, they bring open an issue about this and discuss with the team manager (in our case, I think this would be the project manager)
  • Team manager then routes that work item to somebody else on the team.
  • Or if it is more complex, they discuss in their next team standup (I believe it is daily for them) and somebody is assigned to that work item out of that meeting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a role that the Project Manager could play?

Yep, probably.
I would suggest still keeping the power to designate in the support steward's hands for the sake of simplicity and quickness... but putting in the PM's hands the tie-breaker "power" is some conflict arises.
I would also encourage the support steward to have a conversation and agreement with the future IC before the designation actually happens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a note about asking the Project Manager

6. **Investigate and resolve the incident**. The Incident Commander should follow the structure of the incident issue opened in the step above.
7. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my previous question applies to this line as well 😉.

8. **Communicate every few hours**. The {term}`Incident Commander` is expected to communicate incident status and plan with the {term}`Support Steward`s, and the Support Stewards are expected to communicate to the {term}`Community Representative`s. They should provide periodic updates to communities as we attempt to resolve the incident. Here is a template to get started:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above about removing support steward from the line of fire here by default. I think this is especially important to us given our diverse timezones that we reduce communciation overhead by default.


```
Hello { NAME }, this is a quick update on our progress resolving
your incident.

We believe the problem is { XXX } and are investigating { YYY }
to resolve it. We will keep you updated as we continue to make progress.
Please let us know if you have had more reports of issues,
or reports that your issues have gone away.
```
damianavila marked this conversation as resolved.
Show resolved Hide resolved
9. **Communicate when the incident is resolved**. When we believe the incident is resolved, communicate with the Community Representative that things should be back to normal. Mark the FreshDesk ticket as {guilabel}`Resolved`.
10. **Fill in the {term}`Incident Report`**. The Incident Commander should do this in partnership with the Incident Response Team.
11. **Close the incident ticket**. Once we have confirmation from the community (or no response after 48 working hours), and have filled in the incident {term}`Incident Report`, then close the incident by:
- Closing the incident issue on GitHub
- Marking the FreshDesk ticket as {guilabel}`Closed`

## Handing off Incident Commander status

During an incident, it may be necessary to designate another person to be the Incident Commander.
For example, if it is getting late in the current IC's time zone, they feel burnt out from leading the incident response, or there is someone with better visibility or experience to be the Incident Commander.
This is encouraged and expected, especially for more complex or longer incidents!

To designate another team member as the Incident Commander, follow these steps:

1. **Confirm with them** that they are able and willing to serve as the Incident Commander
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **Confirm with them** that they are able and willing to serve as the Incident Commander
1. **Confirm with them** that they are able and willing to serve as the Incident Commander.

2. **Update the Incident Report issue** by updating the Incident Commander name in the top comment.
3. **Notify the team** with a comment in the Incident Report issue.
damianavila marked this conversation as resolved.
Show resolved Hide resolved

## Key terms
choldgraf marked this conversation as resolved.
Show resolved Hide resolved

```{glossary}
Incident Report
A document that describes what went wrong during an incident and what we'll do to avoid it in the future. When we have an {term}`Incident`, we create an Incident Report issue.
This helps us explain what went wrong, and directs actions to avoid the incident in the future. Its goal is to identify improvements to process, technology, and team dynamics that can avoid incidents like this in the future. It is **not** meant to point fingers at anybody and care should be taken to avoid making it seem like any one person is at fault[^post-mortems].
```

[^post-mortems]: See the [Google SRE post-mortem culture](https://sre.google/sre-book/postmortem-culture/) and the [Blameless guide to post-mortems](https://www.blameless.com/sre/what-are-blameless-postmortems-do-they-work-how) for some guidelines.
3 changes: 2 additions & 1 deletion projects/managed-hubs/index.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
(service:index)=
# Service for Managed JupyterHubs

The Managed JupyterHub Service is a special project that is run by 2i2c.
Expand Down Expand Up @@ -38,5 +39,5 @@ We keep a table with all of our currently-running JupyterHubs at this location:
pricing.md
sales.md
support.md
roles.md
incidents.md
```
76 changes: 0 additions & 76 deletions projects/managed-hubs/roles.md

This file was deleted.

Loading