Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add incident commander role + more steps to support process #422

Merged
merged 17 commits into from
Jun 27, 2022
Merged
4 changes: 4 additions & 0 deletions _static/custom.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
/* When we have glossaries with multiple items, only display the first */
dl.glossary dt + dt {
display: none;
}
4 changes: 2 additions & 2 deletions about/strategy.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,15 +39,15 @@ As a start, we wish to launch a **Managed JupyterHubs** service, and will begin
- 2i2c manages JupyterHubs for at least two institutions.
- 2i2c manages more lightweight, community-specific JupyterHubs for several smaller groups in research and education.
- 2i2c manages a "generic" JupyterHub that is not tied to any single institution or group.
- 2i2c has a beta-level business model for the first iteration of our Managed JupyterHub service.
- 2i2c has a beta-level sustainability model for the first iteration of our Collaborative JupyterHub Service.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO Managed is a more straightforward and broadly accepted term for what we do, while Collaborative is generally overloaded to the point of being a buzzword. I'd suggest we keep calling it 'Managed'

- 2i2c has built relationships with cloud providers that facilitate our ability to serve these hubs to our users.

## Launch major collaborations

**Rationale:**

2i2c wishes to conduct focused development in collaboration with others in research and education.
We wish to engage in several major projects that will support infrastructure that aligns with our mission, and that also feeds into our Managed JupyterHub service.
We wish to engage in several major projects that will support infrastructure that aligns with our mission, and that also feeds into our Collaborative JupyterHub Service.

**Objectives:**

Expand Down
2 changes: 1 addition & 1 deletion code-of-conduct/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ In addition, the 2i2c community and experience often extends outside those space

The 2i2c Code of Conduct does not apply to interactions between users of a Managed JupyterHub, though we encourage leaders in those communities to adopt a Code of Conduct for their hub infrastructure. The Code of Conduct does apply to any interaction between a user of a Managed JupyterHub and a 2i2c Team Member.

:::\{important}
:::{important}
When in doubt, please [report unacceptable behavior](coc:reporting) to us. If someone’s behavior outside of a 2i2c space makes you feel unsafe at 2i2c, that is absolutely relevant and actionable for us.
:::

Expand Down
28 changes: 28 additions & 0 deletions practices/development.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,34 @@ This section describes how our development team carries out its planning and day
👉 [Here's a link to see all Pull Requests for which your review is requested](https://github.com/issues?q=is%3Aopen+archived%3Afalse+sort%3Aupdated-desc+user%3A2i2c-org+type%3Apr+review-requested%3A%40me)
:::

## Roles and team structure

We use the following roles to help us understand responsiblities and expectations around developing and operating our infrastructure.
For roles that are related to more specific actions like support and incidents, see [our managed service documentation](service:index).

```{glossary}

Project Manager
Project Managers
We are piloting the use of a dedicated Project Manager to help our team plan and coordinate with one another.
See [this GitHub issue](https://github.com/2i2c-org/team-compass/issues/398) for our plans and experience with this pilot thus far.

Hub Engineer
Hub Engineers

The job of a Hub Engineer is to develop and operate deployment infrastructure for a hub, and to perform major upgrades or improvements to resolve issues that cannot be solved by a {term}`Hub Administrator`.
Hub engineers regularly work on the JupyterHub infrastructure and provide open source development for the technology that powers each hub.
People in these roles are generally affiliated with 2i2c.


**Responsibilities**

- Respond to support requests from the Community Representative(s)
- Perform major upgrades on hub infrastructure
- Debug and resolve major issues with a hub that require intervention from a Hub Engineer
- Perform open source development on technologies that are in use by the hubs
```

(coordination:sprints)=
## Team Sprints

Expand Down
4 changes: 2 additions & 2 deletions practices/expectations.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,8 @@ You are expected to participate in this workflow with other members of the team.

If you're an engineer at 2i2c, plan to do a combination of these things:

1. **Development for the [Managed JupyterHub Service](../projects/managed-hubs/index.md).** This is a collection of JupyterHubs that we run for various customers/communities in research/education. We are constantly updating and improving this infrastructure!
2. **Operation for the [Managed JupyterHub Service](../projects/managed-hubs/index.md).** In addition to developing the infrastructure, we also operate it and provide support. We try to divide this work so that we share the responsibility of operation and support.
1. **Development for the [Collaborative JupyterHub Service](../projects/managed-hubs/index.md).** This is a collection of JupyterHubs that we run for various customers/communities in research/education. We are constantly updating and improving this infrastructure!
2. **Operation for the [Collaborative JupyterHub Service](../projects/managed-hubs/index.md).** In addition to developing the infrastructure, we also operate it and provide support. We try to divide this work so that we share the responsibility of operation and support.
3. **Focused development in collaboration research and education communities.** In addition to running infrastructure, we also collaborate around specific projects, usually involving tools in the Jupyter ecosystem. You may work with researchers in improving their tool, or building something new for another group's project.
4. **Outreach, teaching, and community-building.** In addition to technical work, we also encourage 2i2c team members to make interactive computing more accessible and powerful for others via outreach, particularly for under-served communities.
5. **Open source development and support.** We run all of our infrastructure on open source tools - usually from communities for which we are not the sole leader. It is crucial that we participate in these communities and support them in addition to doing 2i2c-specific work.
4 changes: 2 additions & 2 deletions projects/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,14 +32,14 @@ The [organizational foundation project](https://github.com/orgs/2i2c-org/project

## Managed Hub Service launch

The Managed JupyterHub Service is a special project of 2i2c that is [developed and described at this website](managed-hubs/index.md).
The Collaborative JupyterHub Service is a special project of 2i2c that is [developed and described at this website](managed-hubs/index.md).

:::{admonition} Tracking deliverables
- [the high level goals and strategy are described here](https://docs.2i2c.org/en/latest/about/strategy/index.html)
- [the roadmap for the service is described here](https://docs.2i2c.org/en/latest/about/strategy/roadmap.html)
:::

More information about the Managed JupyterHub Service can be found in these sections:
More information about the Collaborative JupyterHub Service can be found in these sections:

(projects:jmte-pangeo)=
## Pangeo Hub Infrastructure development
Expand Down
127 changes: 127 additions & 0 deletions projects/managed-hubs/incidents.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Incident response


When an {term}`Incident` is declared, we trigger a special response in order to ensure that it is resolved quickly.
This section describes our incident response process, major roles and terminology, and what to expect.[^pager-duty][^google-sre][^acm-blog][^wikimedia-clinic-duty].

[^incident-refs]: The [PagerDuty Incident Response Guide](https://response.pagerduty.com/) is a good description of the Incident Command role and how it relates to similar roles.

[^google-sre]: The [Google SRE Incident response guide](https://sre.google/workbook/incident-response/) has a wealth of information about incident response and distributed SRE teams.

[^acm-blog]: [https://queue.acm.org/detail.cfm?id=3380779](This ACM blog post) describes the complexity of coordinating across a team of distributed responders during an incident, and notes a places where Incident Commander roles may actually hinder responsiveness. It is a good lesson in the complexity of incidents with distributed teams!

[^wikimedia-clinic-duty]: The [WikiMedia Clinic Duty](https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty#Responsibilities) process also inspired our process here, and is a great overall workflow around distributed SRE.

:::{admonition} In Beta!
:class: warning
We are currently working out our Incident Response process.
The content on this page might change over time, and we welcome suggested changes and pull requests!
:::

## Roles and team structure

An {term}`Incident Response Team` is formed when an {term}`Incident` has been declared.
The goal of the Incident Response Team is to collectively resolve incidents.

An Incident Response Team is generally made up of:

- An {term}`Incident Commander`
- The {term}`Support Stewards`
- One or more {term}`Subject Matter Experts` (SMEs)

```{glossary}
Incident Response Team
The group of roles that collectively understand, plan, resolve, and communicate our actions around an {term}`Incident`. The people in these roles may change in a fluid manner, and one person may serve in multiple roles. A rough way to approximate this team is "the people that have communicated in internal and external channels to resolve an incident."

Incident Commander
The Incident Commander has the authority to plan and delegate action to others on the {term}`Incident Response Team`. They are **not expected** to take actions themselves. Their goal is to help the team make consistent and deliberate progress towards resolving an incident. They are the {term}`Source of Truth` about the current state and action plan surrounding an incident.

External Liason
External Liasons
The person that is responsible for communicating with external stakeholders during an incident. This is either the {term}`Incident Commander`, or somebody to which they delegate this role. Every few working hours, they should communicate the status of the incident, updates about our current thinking and what we have tried, and any expected changes coming.

Subject Matter Expert
Subject Matter Experts
A member on the {term}`Incident Response Team` with expertise in an area of relevance to an Incident. SMEs have a variety of backgrounds and abilities, and they should be pulled in to the Response Team as-needed by the {term}`Incident Commander`. Their goal is to take actions as-directed by the {term}`Incident Commander` to resolve an incident.
```

## Communication channels

### External communication

- The {term}`Incident Commander` acts as the primary point of communication with external stakeholders like the {term}`Community Representative`s.
- They may **delegate** this responsibilitiy to another team member if they wish (e.g., to the {term}`Support Steward` team.)
- We may interact with external stakeholders via comments in Incident Response issues if it helps resolve the incident more quickly.

### Internal communication

- The Slack channel [{guilabel}`#support-freshdesk`](https://2i2c.slack.com/archives/C028WU9PFBN) contains real-time communication about support issues. Use this to signal-boost support requests related to {term}`Incidents`.
- [Issues with the {guilabel}`incident` label](https://github.com/2i2c-org/infrastructure/issues?q=is%3Aopen+label%3A%22type%3A+Hub+Incident%22+sort%3Aupdated-desc) are where we track progress when [resolving incidents](support:incident-response).


(support:incident-response)=
## Incident response process

Incidents are a special kind of support ticket, because they are related to degraded service that immediately impacts communities.
We prioritize the resolution of incidents above all other kinds of work, and have a special process for tracking conversation and progress with them.

Here is the process that we follow for incidents:

1. **Acknowledge the incident**. Communicate with the Community Representative that there is an incident. Use this canned response as a start for responding:

```{button-link} https://2i2c.freshdesk.com/a/admin/canned_responses/folders/80000143608/responses/80000247490/edit
:color: primary

Incident first response template
```

2. **Open an incident issue**.
GeorgianaElena marked this conversation as resolved.
Show resolved Hide resolved
For each {term}`Incident` we create a dedicated issue to track its progress. [{bdg-primary}`open an incident issue`](https://github.com/2i2c-org/infrastructure/issues/new?assignees=&labels=type%3A+Hub+Incident%2Csupport&template=3_incident-report.md&title=%5BIncident%5D+%7B%7B+TITLE+%7D%7D) and notify our engineering team via Slack.
3. **Try resolving the issue** and take notes while you gather information about it.
4. **If after 30 minutes the issue is not solved or you know you cannot resolve it**
- Ping our engineering team and our Project Manager in the {guilabel}`#support-freshdesk` channel so that they are aware of the incident.
- Add the incident issue to [our team backlog](https://github.com/orgs/2i2c-org/projects/22/).
5. **Designate an {term}`Incident Commander`**. Do this in the Incident issue. By default, this is the Support Steward.
- Confirm that the Incident Commander has the bandwidth and ability to serve in this role.
- If not, delegate this to another team member.[^note-on-delegation]
6. **Designate an {term}`External Liason`**. Do this in the Incident issue. By default, this is the Incident Commander, though they may delegate this to others.[^note-on-delegation]
7. **Investigate and resolve the incident**. The Incident Commander should follow the structure of the incident issue opened in the step above.
8. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.[^note-on-delegation]
9. **Communicate our status every few hours**. The {term}`External Liason` is expected to communicate incident status and plan with the {term}`Community Representative`s. They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started:

```{button-link} https://2i2c.freshdesk.com/a/admin/canned_responses/folders/80000143608/responses/80000247492/edit
:color: primary

Incident update template
```

9. **Communicate when the incident is resolved**. When we believe the incident is resolved, communicate with the Community Representative that things should be back to normal. Mark the FreshDesk ticket as {guilabel}`Resolved`.
10. **Fill in the {term}`Incident Report`**. The Incident Commander should do this in partnership with the Incident Response Team.
11. **Close the incident ticket**. Once we have confirmation from the community (or no response after 48 working hours), and have filled in the incident {term}`Incident Report`, then close the incident by:
- Closing the incident issue on GitHub
- Marking the FreshDesk ticket as {guilabel}`Closed`

[^note-on-delegation]: If you cannot find somebody to take on this work, or feel uncomfortable delegating, the {term}`Project Manager` should help you, and is empowered to delegate on your behalf.

## Handing off Incident Commander status

During an incident, it may be necessary to designate another person to be the Incident Commander.
For example, if it is getting late in the current IC's time zone, they feel burnt out from leading the incident response, or there is someone with better visibility or experience to be the Incident Commander.
This is encouraged and expected, especially for more complex or longer incidents!

To designate another team member as the Incident Commander, follow these steps:

1. **Confirm with them** that they are able and willing to serve as the Incident Commander.
2. **Update the Incident Report issue** by updating the Incident Commander name in the top comment.
3. **Notify the team** with a comment in the Incident Report issue.
damianavila marked this conversation as resolved.
Show resolved Hide resolved

## Key terms
choldgraf marked this conversation as resolved.
Show resolved Hide resolved

```{glossary}
Incident Report
Incident Reports
A document that describes what went wrong during an incident and what we'll do to avoid it in the future. When we have an {term}`Incident`, we create an Incident Report issue.
This helps us explain what went wrong, and directs actions to avoid the incident in the future. Its goal is to identify improvements to process, technology, and team dynamics that can avoid incidents like this in the future. It is **not** meant to point fingers at anybody and care should be taken to avoid making it seem like any one person is at fault[^post-mortems].
```

[^post-mortems]: See the [Google SRE post-mortem culture](https://sre.google/sre-book/postmortem-culture/) and the [Blameless guide to post-mortems](https://www.blameless.com/sre/what-are-blameless-postmortems-do-they-work-how) for some guidelines.
40 changes: 10 additions & 30 deletions projects/managed-hubs/index.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,17 @@
# Service for Managed JupyterHubs
(service:index)=
# Collaborative JupyterHub Service

The Managed JupyterHub Service is a special project that is run by 2i2c.
It is an ongoing service, and thus is less development-oriented than projects that are funded by grants and collaborations.
The Collaborative JupyterHub Service is an ongoing service to **sustain and scale** a collaborative, community-driven, interactive computing service for communities in research and education.

The key goal of the 2i2c Managed JupyterHub Service is to launch a self-sustaining service that provides managed JupyterHub distributions to customers in research and education.
**[`docs.2i2c.org`](https://docs.2i2c.org) has most of the information about this service**.

We are currently in the pilot phase of this project, with the goal of creating a basic shared infrastructure that can deploy, configure, and manage many JupyterHub distributions.
We are also building a sales and support pipeline around this infrasturcture.
The sections here contain information that is more relevant to 2i2c team members (like support process documentation).

## Where is information located?

**Infrastructure and development** happens in [the `infrastructure/` repository](https://github.com/2i2c-org/infrastructure).
This is both the deployment infrastructure for the 2i2c JupyterHubs, as well as documentation and team coordination around developing and running them.

**The Hub Engineer's Guide** describes how to develop and operate the 2i2c `infrastructure/` infrastructure.
It is generally designed for 2i2c engineers to follow and learn.
You can find it at [infrastructure.2i2c.org](https://infrastructure.2i2c.org).

**The Hub Administrator's Guide** provides for Hub Administrators to customize and control their hub.
It is community-facing, and meant as a more external view on the 2i2c Hubs Pilot.
It also describes high-level information about the pilot in general.
It is located at [docs.2i2c.org](https://docs.2i2c.org) (or [the `docs/` repository](https://github.com/2i2c-org/docs))

**Strategy and roadmaps** for the Managed Hubs Pilot are located in the Hub Administrator's guide.

For other information about the Managed Hubs Pilot, see the sections below.
```{toctree}
sales.md
support.md
incidents.md
```

## A list of running JupyterHubs

Expand All @@ -33,10 +20,3 @@ We keep a table with all of our currently-running JupyterHubs at this location:
## Other project information

- [this google folder](https://drive.google.com/drive/folders/1HEEfyT2h_fKeqKdsz9Ftiw9Be1Uj48D6?usp=sharing) has most information and brainstorms regarding this project

```{toctree}
pricing.md
sales.md
support.md
roles.md
```
6 changes: 0 additions & 6 deletions projects/managed-hubs/pricing.md

This file was deleted.

Loading