-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update service objectives page #143
Merged
Changes from 11 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
5bba928
Updating service objectives page
choldgraf 870b57f
Update service objectives
choldgraf 556878d
Outages
choldgraf 65964f4
Updates to support around incidents
choldgraf faf8833
working day
choldgraf e7baf13
More notes
choldgraf b663d20
updates
choldgraf 1c48ac0
Change and guidance
choldgraf aff4dd4
Update about/service-objectives.md
choldgraf 04b0d71
Update when we respond to incidents
yuvipanda 5044a98
Fix typo
yuvipanda 210919a
about/2i2c
choldgraf 46c31be
Update team link
choldgraf File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,134 @@ | ||
# Service Level Objectives | ||
|
||
This page describes the **Service Level Objectives** (SLOs) of 2i2c's infrastructure services[^slos]. | ||
These are our goals in running infrastructure for the communities that we serve. | ||
They indicate what our users can expect when using the infrastructure we support. | ||
They will evolve over time as we get feedback and learn how to best deliver impact via our services. | ||
|
||
:::{note} | ||
2i2c does not currently have a **Service Level Agreement** (SLAs), and the SLOs here are not legally-binding. | ||
We aim to create SLAs once we learn more about our capacity to fulfill them sustainably.[^zenodo] | ||
::: | ||
|
||
|
||
(objectives:stability)= | ||
## Availability and uptime | ||
|
||
The infrastructure that 2i2c runs should be available to its communities 24/7, and with minimal human intervention needed to maintain this level of performance. | ||
We invest in continous development to improve the resiliency and efficiency of the infrastructure that we run, following best-practices in service design and engineering in the cloud. | ||
|
||
- Communities should feel comfortable relying on 2i2c's services for critical educational and research needs. | ||
- There should not be prolonged periods of service disruption for any community. | ||
- We will invest in monitoring and reporting infrastructure to detect outages quickly and before they impact end-users. | ||
- When outages do occur, we will prioritize these over other work that our team is doing. | ||
|
||
:::{admonition} To be refined... | ||
It is a known anti-pattern to define an ambiguous SLO like "24/7". | ||
Truly meeting such an objective is nearly impossible and extremely costly. | ||
In the future, we plan to run an audit of our infrastructure and practices, and design quantifiable uptime targets for our SLOs. | ||
::: | ||
|
||
|
||
(objectives:intentional-downtime)= | ||
### Intentional downtime | ||
|
||
In some cases there may be intentional downtime for the infrastructure that we run. | ||
For example, if we need to undergo major maintenance of infrastructure transitions. | ||
|
||
- We will communicate with communities before any intentional downtime. | ||
- We will aim for downtime windows that happen outside of heavy usage. | ||
- We will communicate with communities when the expected downtime is over. | ||
|
||
(objectives:reduced-capacity)= | ||
### Reduced team capacity | ||
|
||
There are some periods of time when we have **expected reduced capacity**. | ||
These are periods of time when we are less strict about adhering to the service objectives on this page. | ||
This ensures that our work practices are sustainable and fair for our team. | ||
|
||
Here are periods of expected reduced capacity: | ||
|
||
- Weekends | ||
- The first and last weeks of the year. | ||
- Periods of overlapping international holidays. | ||
|
||
If this is disruptive to your community's activies, please reach out and we can discuss. | ||
However, we encourage you to avoid planning mission-critical events or actions during periods of expected reduced capacity. | ||
|
||
:::{admonition} A note on timezones | ||
Remember that 2i2c's team is distributed globally, and our working time zone may be different from yours. | ||
We aim to have team members in time zones that are working at the same time as the communities we serve, but there may occasionally be mis-matches in working hours. | ||
::: | ||
|
||
(objectives:support)= | ||
## Support responsiveness | ||
|
||
Support is one of the most important services that 2i2c provides, especially when there are problems or outages. | ||
For this reason, we commit to developing a support process that is efficient in responding to issues that communities bring to us. | ||
We define three types of support with 2i2c: | ||
|
||
- **Incidents** are requests connected with significant degraded service for one or more communities. For example, a system-wide outage or inability of users to log-in. | ||
- **Change Requests** are general requests for changes or improvements to a community's hub. For example, updating the environment or improving an open source tool. | ||
- **Guidance Requests** are questions or requests for conversations to discuss infrastructure decisions, provide guidance, etc. | ||
|
||
Below are our objectives broken down by the type of support they relate to. | ||
|
||
:::{seealso} | ||
- See [](../support.md) for more information about contacting support. | ||
- See [](tc:support:process) for our team's support process. | ||
::: | ||
|
||
### General support objectives | ||
|
||
- We have a dedicated communications channel for support (see [](../support.md)). | ||
- At least one team member is always tasked with monitoring this channel. | ||
- Our support team is communicative, helpful, and [abides by our Code of Conduct](tc:code-of-conduct). | ||
|
||
### Incident support objectives | ||
|
||
Our goal is to be more rapid in responding, communicating, and resolving support requests during incidents. | ||
Our ability to meet these objectives will depend on the times they are reported relative to the working hours of our support team. | ||
|
||
- We will triage and respond to Incidents within 8 working hours **at most**. We will on average respond to Incidents within **2 working hours**. | ||
damianavila marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- We will prioritize resolving Incidents over any other Change requests. | ||
- For major or complex outages, we will re-direct capacity on our engineering team to resolve them. | ||
|
||
### Change and Guidance Request support objectives | ||
|
||
- We will triage Change and Guidance requests and respond to them within 24 working hours. | ||
damianavila marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- We will prioritize resolving Change and Guidance Requests by balancing them against our other development priorities as described in {doc}`our Support Team Process documentation <tc:projects/managed-hubs/support>`. | ||
|
||
(objectives:cost)= | ||
## Costs and cloud flexibility | ||
|
||
Our communities rely on us to keep their cloud costs as low as possible. | ||
They also rely on us to provide infrastructure that is dynamic and meets the needs of diverse communities. | ||
|
||
There is an inherent tension between doing things quickly (which generally requires using extra resources to encourage speed) and cost efficiency (because you pay for those extra resources). | ||
This is particularly relevant during sharp increases in hub usage. | ||
|
||
- Communities should feel comfortable that moderate increases in usage will not result in instability, and that this flexibility does not result in unexpectedly high cloud costs. | ||
- We should provide this flexibility in a way that is sustainable for our team. | ||
- If infrastructure requires steady, but semi-random usage, we should prioritize cost efficiency. | ||
- If infrastructure will have known spikes of activity, we may temporarily favor speed over cost by asking for extra resources from the cloud provider. | ||
- If spikes in activity will come just after a holiday or weekend, we may make these changes a few days early to avoid working off-hours. | ||
|
||
:::{seealso} | ||
See [](pricing/index.md) for more information about costs. | ||
::: | ||
|
||
|
||
(objectives:updates)= | ||
## Upgrades and maintenance | ||
|
||
By continuously upgrading the cloud infrastructure and software environments that our hubs offer, we improve the experience of the communities that we serve by giving them new features, enhancements, and bug and security fixes. | ||
|
||
We aim to continuously upgrade this infrastructure in a way that minimizes the risk of instability or outages. | ||
|
||
- We will keep our hubs relatively up-to-date with the latest [JupyterHub](https://jupyterhub.readthedocs.io) and [Zero to JupyterHub](https://z2jh.jupyter.org) releases. | ||
- We will ensure that our hub infrastructure is compatible with the latest software releases in the common open source ecosystems we provide. | ||
- We will support open source communities in making regular updates and releases to their tools. | ||
|
||
[^slos]: For more about the difference between Service Level Objectives, Agreements, and Indicators, see [the Google SRE handbook](https://sre.google/sre-book/service-level-objectives/). | ||
|
||
[^zenodo]: This practice is inspired by [Zenodo's intentional lack of Service Level Agreements](https://about.zenodo.org/principles/). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a "To be refined" admonition to this section too as we haven't yet defined how this communication and scheduling would happen, e.g., 2i2c-org/team-compass#423