-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add incident commander role + more steps to support process #422
Conversation
- They assess whether they can resolve it quickly, and potentially do so. | ||
- If they cannot resolve it, then we raise this support issue with our engineering team. | ||
- If the issue is an {term}`Incident` | ||
- We will prioritize resolving it over everything else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this sentence!
projects/managed-hubs/support.md
Outdated
::: | ||
:::{note} | ||
We currently keep this term intentionally vague, and ask that communities are respectful of our time when making change requests. | ||
We are investigating the support budget that we should give to each community, and will update here when we have specific numbers in mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<3
projects/managed-hubs/support.md
Outdated
|
||
1. **Acknowledge the incident**. Communicate with the Community Representative that there is an incident. Here is a template to get started: | ||
|
||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make this a freshdesk template too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is possible then I agree! I haven't looked into this but if it exists, then we should make a template and that should be the "source of truth"
Update: added incident commanderAfter some feedback I've made some more edits with the following main changes:
@yuvipanda want to take a look and we can discuss?
EDIT EDIT: muahahah I have finally gotten the wifi code for the cafe beneath my apartment, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I like the direction this is going towards and I am curious to see how this will play out in practice, with this process formalized. I left a few suggestions and questions about things that I didn't understand. Thank for working on this @choldgraf ✨
Co-authored-by: Georgiana Elena <[email protected]>
Thanks for those comments @GeorgianaElena ! I believe I've addressed them all and also added in a section about handing off IC status to others. Let me know if you have other thoughts or suggestions! |
Co-authored-by: Sarah Gibson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for writing this up, @choldgraf! I absolutely love it and I think it's an improvement over our current process.
I've left some inline comments about reducing a burden on the support steward when the IC is a person different from the support steward. One of the important outcomes here is to make sure the support stewards don't burn out, and I think the default state when there is a separate IC should be that the support steward is no longer responsible for that (unless they are asked to). For example, if there are multiple ongoing incidents this puts a particularly bigger burden on the support steward. The communication overhead may also be significant, as there's now two extra places where communication needs to happen by default. I recognize that this might vary with individual IC style, but I think the default should be that we don't require the support steward to do this. Instead, the IC can call in someone to help with communications - this person can be the support steward, or someone else. I'd rather have us codify that than default to adding another duty to the support steward role.
projects/managed-hubs/incidents.md
Outdated
|
||
### External communication | ||
|
||
- The {term}`Support Steward` team acts as the primary point of communication with external stakeholders like the {term}`Community Representative`s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, specifically for incidents, this should be optional. This adds an additional load for the support steward and the incident commander (if they are different people). The IC can request the support steward to act as this if necessary, but otherwise I think we should not require this of the support steward. They should be able to delegate an incident to the incident commander, and then by default continue with their existing role. I think if the IC is the source of truth, they should by default be the person who communicates too - otherwise we're adding an entirely new person to this chain, and that is often extremely frustrating during an incident process for everyone involved.
So my suggestion is that the IC can ask someone else to be the point of contact (support steward or someone else) if they so choose to, but that is not the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your reasoning makes sense to me - one concern I have is that the incident commander needs to spend their cycles on resolving the incident, not necessarily also communicating it externally. Maybe the answer is to say the incident commander does this by default, but if they must log-off or are otherwise overwhelmed, they should delegate another team member to provide external communication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly right! The default is they do it but if in their judgement delegating it is the right thing - if the extra communication overhead is worth it (as it often is) they can. They just delegate it to someone, not necessarily the support steward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as the IC presents itself (or it is actually introduced by the support steward) to the ticket submitter, I think it makes sense NOW to make this optional. I also like the idea to delegate the communication to another one who is not necessarily the support steward (because we do not have a lot of people in support to address other tickets).
Why do I highlight the NOW word? Because when we get a dedicated support team, there should be a clear separation of boxes, IMHO. The support team should be handling the communication with the ticket submitters because they are trained and specialized to interact with people in stress looking for answers. An IC coming from the eng team is not well prepared for that interaction... and that might be a source of issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah - thinking to the Incident Commander docs from pagerduty I think this is the "external liason" role. They also have a dedicated person/team to do that work. In our case, I think there are two things I worry about there:
- We don't have the staffing capacity for this now, but maybe this will change in the future as @damianavila suggests.
- If people are not awake at the same time, we pay a big communication penalty when we have bottlenecks of information. If one person must be the one to communicate externally, and that person just went to sleep, then it means no communication can occur until they return to work. This feels like a stressful situation given that we don't have the staffing to ensure seamless handoffs between time-zones all the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this is why I think this is fine for now but we should change it in the future when we have enough capacity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I've updated this so that the Incident Commander is the communicator by default, and may delegate if they wish.
projects/managed-hubs/incidents.md
Outdated
For each {term}`Incident` we create a dedicated issue to track its progress. [{bdg-primary}`open an incident issue`](https://github.com/2i2c-org/infrastructure/issues/new?assignees=&labels=type%3A+Hub+Incident%2Csupport&template=3_incident-report.md&title=%5BIncident%5D+%7B%7B+TITLE+%7D%7D) and notify our engineering team via Slack. | ||
3. **Try resolving the issue** and take notes while you gather information about it. | ||
4. **If after 30 minutes the issue is not solved or you know you cannot resolve it**, ping our engineering team and our Project Manager in the {guilabel}`#support-freshdesk` channel so that they are aware of the incident. | ||
5. **Designate an {term}`Incident Commander`**. If the Support Steward wishes to designate someone other than themselves as Incident Commander, do this in the Incident issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this line implies the support steward has the "power" to designate the IC. How that power will be practiced?
I think there should be some known expectations/details around this. For instance, can the designation be rejected, and what do we do if that happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point - also important to think about the power dynamics here. If the support steward is a brand new team member with less experience than others, they may not feel comfortable just "delegating" to somebody else.
Is this a role that the Project Manager could play? I recall from 2i2c-org/infrastructure#1068 that one of the case studies there used a workflow like:
- Support person tries to resolve themselves first
- If they can't, they bring open an issue about this and discuss with the team manager (in our case, I think this would be the project manager)
- Team manager then routes that work item to somebody else on the team.
- Or if it is more complex, they discuss in their next team standup (I believe it is daily for them) and somebody is assigned to that work item out of that meeting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a role that the Project Manager could play?
Yep, probably.
I would suggest still keeping the power to designate in the support steward's hands for the sake of simplicity and quickness... but putting in the PM's hands the tie-breaker "power" is some conflict arises.
I would also encourage the support steward to have a conversation and agreement with the future IC before the designation actually happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a note about asking the Project Manager
projects/managed-hubs/incidents.md
Outdated
5. **Designate an {term}`Incident Commander`**. If the Support Steward wishes to designate someone other than themselves as Incident Commander, do this in the Incident issue. | ||
6. **Investigate and resolve the incident**. The Incident Commander should follow the structure of the incident issue opened in the step above. | ||
7. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly. | ||
8. **Communicate every few hours**. The {term}`Incident Commander` is expected to communicate incident status and plan with the {term}`Support Steward`s, and the Support Stewards are expected to communicate to the {term}`Community Representative`s. They should provide periodic updates to communities as we attempt to resolve the incident. Here is a template to get started: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above about removing support steward from the line of fire here by default. I think this is especially important to us given our diverse timezones that we reduce communciation overhead by default.
OK I believe that I've addressed each of the comments above! Please let me know if that makes sense or if you'd like to see any other edits! |
projects/managed-hubs/incidents.md
Outdated
|
||
To designate another team member as the Incident Commander, follow these steps: | ||
|
||
1. **Confirm with them** that they are able and willing to serve as the Incident Commander |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. **Confirm with them** that they are able and willing to serve as the Incident Commander | |
1. **Confirm with them** that they are able and willing to serve as the Incident Commander. |
projects/managed-hubs/support.md
Outdated
When a support request is made that requires an action from a 2i2c engineer, a Support Steward should describe this change in a GitHub issue, and add it to the [Sprint Board](coordination:sprint-board). | ||
Think about an engineering team member that likely has the skills and capacity needed, and ask them if they are willing to take on resolving this issue. | ||
Try not to ask the same person for support help many times in a row - we should spread the work needed to address support issues across the team. | ||
Here is a rough idea of our rationale to follow for arriving at a specific number: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mention arriving at "a specific number" but I think you did not mention that specific number above, is that intended?
From your previous comment, that number would be:
we could say that those various hub types correspond to 34/20 ~= 1 hours, or 34 / 8 ~= 4 hours of support each month.
I know you can derive that from the rationale you have included but I feel it needs a conclusion like the above one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to make it clear that this number is not yet specific, and that we want to keep it as a rationale only, but no precise numbers.
Hey all - I've addressed @damianavila's comment above, and I've also replaced the response template text with links to our FreshDesk templates, which I have just created (that addressed the other comment from @yuvipanda): https://2i2c.freshdesk.com/helpdesk/canned_responses/folders/80000143608/responses/80000247490/edit I think that this should now be relatively ready to go unless there are further comments. There might be some link failures that are dependent on another PR to get in, but I'd prefer if we solve them in follow-ups rather than block this PR if that's OK |
projects/managed-hubs/incidents.md
Outdated
|
||
```{button-link} https://2i2c.freshdesk.com/helpdesk/canned_responses/folders/80000143608/responses/80000247490/edit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link is "not found" for me, it redirects to https://2i2c.freshdesk.com/a/notfound (and I am logged in).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, I think I fixed this
Question: should this be in docs.2i2c.org/about ?I made a few more edits to this content, and I find myself naturally wanting to look for it in the "About the service" section of our 2i2c service documentation (somewhere in here: https://docs.2i2c.org/en/latest/about/overview.html). I feel like this section describes the human infrastructure and strategy behind the service, and is a hybrid of documentation for our team (to know what to follow) and other communities (to know what to expect). Do others agree that this content would make more sense inside docs.2i2c.org ? If so, I am curious what you think about the other content in the "Managed Hubs Service" section in our Team Compass. I kind-of feel like this could also live in Curious if others have thoughts |
38262c2
to
497e156
Compare
Updates to this PRI spoke with @damianavila in particular (and others in passing) about the idea of moving these docs to docs.2i2c.org, and in general the consensus seemed to be that we should keep "2i2c-specific" docs in our team compass to avoid cluttering up the service docs at docs.2i2c.org. So, the latest commit adds a few more updates and reorganizations to try and lean into this separation of duties a bit more. It does a few main things:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the incident commander stuff, and we can iterate as we move forward.
I'm very confused by 'Collaborative' replacing 'Managed' service though.
@@ -39,15 +39,15 @@ As a start, we wish to launch a **Managed JupyterHubs** service, and will begin | |||
- 2i2c manages JupyterHubs for at least two institutions. | |||
- 2i2c manages more lightweight, community-specific JupyterHubs for several smaller groups in research and education. | |||
- 2i2c manages a "generic" JupyterHub that is not tied to any single institution or group. | |||
- 2i2c has a beta-level business model for the first iteration of our Managed JupyterHub service. | |||
- 2i2c has a beta-level sustainability model for the first iteration of our Collaborative JupyterHub Service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO Managed is a more straightforward and broadly accepted term for what we do, while Collaborative is generally overloaded to the point of being a buzzword. I'd suggest we keep calling it 'Managed'
I'm happy to get this merged without the |
I opened up the issue below to discuss terminology etc, so we can merge this one in: |
Context
Our current support process is under-specified given the different kinds of support issues that we may get. This PR adds a few major concepts to our Support process documentation:
It also adds some extra contextual information, terminology, etc to help us get on the same page.
What are we missing
ref: 2i2c-org/infrastructure#1068 (comment)
also related to: 2i2c-org/docs#143 and 2i2c-org/infrastructure#1118
closes 2i2c-org/infrastructure#1154 closes 2i2c-org/infrastructure#1155