Define support ticket urgency levels and practices for how to deal with them #1154

choldgraf · 2022-03-29T17:48:24Z

Context

An important part of the support process is understanding how "urgent" the ticket is. For example - some tickets are general requests that we can finish in a few weeks, others are immediate fires that we need to fix ASAP. Having a system to categorize these tickets will help us make better decisions about our operational time, and reduce stress levels associated with not knowing whether we need to drop everything to fix something.

Proposal

We should define:

a few levels of "urgency" for our support tickets
criteria for how to categorize tickets into these levels
processes that we follow to resolve different kinds of levels

As a result of this, we may need to do further development work to improve our support practices, such as setting up a PagerDuty-like system, but we'll understand that better once we write out the high-level structure.

Implementation guide

See the parent issue for a lot of references to support processes with urgency levels:

Define some first-line and second-line support processes #1068

yuvipanda · 2022-03-29T19:05:07Z

I think an important first step is to define an incident, in an objective way that's independent of how Urgent the user thinks something is a problem. An incident is when one of the following is true:

Users can't log-in to the hub
Users can't start a server
(For dask-gateway) Users can't start dask workers

I think when any of these are true, we should 'declare an incident'. https://sre.google/workbook/incident-response/ has some good ideas on what to do when that happens, inspired by actual fire incidents in the literal wild.

yuvipanda · 2022-04-04T17:37:32Z

Taking a page out of https://sre.google/workbook/incident-response/#putting-best-practices-into-practice, so here's a very specific but first-draft process recommendation for an incident management workflow.

When a ticket comes in, we perform the following test:

Is it a report about users not being able to log in?
Is it a report about users not being able to start their server?
(For dask-gateway) is it a report about dask-servers not working?

If any of these criteria are met, the support steward declares an incident, by doing the following:

Opening an issue in this repo, using https://github.com/2i2c-org/infrastructure/issues/new?assignees=&labels=type%3A+Hub+Incident%2Csupport&template=3_incident-report.md&title=%5BIncident%5D+%7B%7B+TITLE+%7D%7D (we will refine this too)
Assign an Incident commander for this particular incident. I like the definition in https://response.pagerduty.com/before/different_roles/, which says they:

a. Commands and coordinates the incident response, delegating roles as needed. By default, the IC assumes all roles that have not been delegated yet.
b. Communicates effectively.
c. Stays in control of the incident response.
d. Works with other responders to resolve the incident.

The expectation should be that this is different person than the support steward, to reduce the load on the support steward and recognize they are not responsible for resolving all outages. We should define a process for figuring out who gets to be incident commander separate from this, but this comment just needs to acknowlege that this is a separate role from what the support steward is doing.
Respond to the support ticket by acknowledging the incident, tagging in the incident commander.

The incident commander is responsible for investigating the issue, pulling in people if necessary, and keeping the customer informed via the support ticket. They can also tag out when it is no longer working hours for them - we should engineer process that makes this viable.

How does this sound as a start? I can make this into a PR and we can iterate.

yuvipanda · 2022-04-04T17:44:09Z

https://response.pagerduty.com/before/different_roles/ is also a good read.

yuvipanda · 2022-04-04T17:54:10Z

Couple more vague thoughts:

It should deeply respect working hours rules, so we don't expect people to be 'up all night' (or outside their working hours, whatever it is). Tagging someone else in to this role is to be expected, so the process should be built around this being a role than a person.
Incidents should be rare - we should have an appropriate post-mortem process to try fix these up. If we are spending too much time on this, we could use error budget techniques (https://sre.google/sre-book/embracing-risk/) to deal with reducing that time.

damianavila · 2022-04-04T21:10:32Z

The expectation should be that this is different person than the support steward, to reduce the load on the support steward and recognize they are not responsible for resolving all outages. We should define a process for figuring out who gets to be incident commander separate from this, but this comment just needs to acknowlege that this is a separate role from what the support steward is doing.

So, we are thinking about having 2 people in support AND one incident commander from the rest of the team?
Or one (1) of the support folks (2) assume the incident commander role?

yuvipanda · 2022-06-23T02:33:16Z

I think this is closed by 2i2c-org/team-compass#422

damianavila · 2022-06-23T21:32:37Z

I think there are some additional pieces on this merged PR: 2i2c-org/docs#143

@choldgraf, do you want to keep this open for something else?

choldgraf · 2022-06-24T07:41:52Z

Let's say that 2i2c-org/team-compass#422 closes this one, and we can continue iterating in new / more focused issues from there?

choldgraf mentioned this issue Mar 29, 2022

Define some first-line and second-line support processes #1068

Closed

3 tasks

choldgraf added Enhancement An improvement to something or creating something new. 🏷️ team-process labels Mar 29, 2022

damianavila added this to Sprint Board Apr 12, 2022

damianavila moved this to Todo 👍 in Sprint Board Apr 12, 2022

damianavila removed this from Sprint Board Apr 12, 2022

choldgraf mentioned this issue Jun 22, 2022

Add incident commander role + more steps to support process 2i2c-org/team-compass#422

Merged

2 tasks

damianavila added this to DEPRECATED Engineering and Product Backlog Jun 23, 2022

damianavila moved this to In progress in DEPRECATED Engineering and Product Backlog Jun 23, 2022

damianavila moved this from In progress to Waiting in DEPRECATED Engineering and Product Backlog Jun 23, 2022

damianavila moved this from Waiting to In progress in DEPRECATED Engineering and Product Backlog Jun 24, 2022

damianavila assigned choldgraf Jun 24, 2022

choldgraf closed this as completed in 2i2c-org/team-compass#422 Jun 27, 2022

Repository owner moved this from In progress to Complete in DEPRECATED Engineering and Product Backlog Jun 27, 2022

damianavila mentioned this issue Jul 11, 2022

[blog] Quarter 2 update 2i2c-org/team-compass#452

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define support ticket urgency levels and practices for how to deal with them #1154

Define support ticket urgency levels and practices for how to deal with them #1154

choldgraf commented Mar 29, 2022 •

edited

Loading

yuvipanda commented Mar 29, 2022

yuvipanda commented Apr 4, 2022

yuvipanda commented Apr 4, 2022

yuvipanda commented Apr 4, 2022

damianavila commented Apr 4, 2022

yuvipanda commented Jun 23, 2022

damianavila commented Jun 23, 2022

choldgraf commented Jun 24, 2022

Define support ticket urgency levels and practices for how to deal with them #1154

Define support ticket urgency levels and practices for how to deal with them #1154

Comments

choldgraf commented Mar 29, 2022 • edited Loading

Context

Proposal

Implementation guide

yuvipanda commented Mar 29, 2022

yuvipanda commented Apr 4, 2022

yuvipanda commented Apr 4, 2022

yuvipanda commented Apr 4, 2022

damianavila commented Apr 4, 2022

yuvipanda commented Jun 23, 2022

damianavila commented Jun 23, 2022

choldgraf commented Jun 24, 2022

choldgraf commented Mar 29, 2022 •

edited

Loading