diff --git a/about/2i2c.md b/about/2i2c.md index 1da2c30..95aaf7c 100644 --- a/about/2i2c.md +++ b/about/2i2c.md @@ -31,7 +31,7 @@ Here are a few of the major projects our team memebers have been involved in ove ## 2i2c has expertise in open source workflows and Jupyter 2i2c's team is comprised of several "[Distinguished Contributors](https://jupyter.org/about)" in the Jupyter ecosystem, which is a crucial technical component of this service. -We are [core team members of JupyterHub and Binder](https://jupyterhub-team-compass.readthedocs.io/en/latest/team.html), and make regular contributions across the Jupyter ecosystem. +We are [core team members of JupyterHub and Binder](https://jupyterhub-team-compass.readthedocs.io/en/latest/team/index.html), and make regular contributions across the Jupyter ecosystem. Moreover, our team has many years of experience with all aspects of the Jupyter stack and we are comfortable interacting with open source communities everywhere. This makes 2i2c uniquely capable of both utilizing and improving this technology through upstream contributions. diff --git a/about/overview.md b/about/overview.md index a1cc74f..ad98d55 100644 --- a/about/overview.md +++ b/about/overview.md @@ -41,7 +41,7 @@ The service is currently in an **alpha phase**, and may evolve as we learn more 2i2c will operate and manage a 2i2c JupyterHub deployment for use by you and your community, accessible via a web URL. 2i2c will handle the design, configuration, development, and ongoing operation of the hub infrastructure. The following sections describe several common activities that the 2i2c team will perform as a part of your managed JupyterHub. -In addition, see our [Service Level Objectives](strategy/service-objectives.md) for an explanation of what we aim to accomplish in terms of uptime, reliability, and support for these services. +In addition, see our [Service Level Objectives](service-objectives.md) for an explanation of what we aim to accomplish in terms of uptime, reliability, and support for these services. ### JupyterHub Setup diff --git a/about/service-objectives.md b/about/service-objectives.md new file mode 100644 index 0000000..9a5833f --- /dev/null +++ b/about/service-objectives.md @@ -0,0 +1,140 @@ +# Service Level Objectives + +This page describes the **Service Level Objectives** (SLOs) of 2i2c's infrastructure services[^slos]. +These are our goals in running infrastructure for the communities that we serve. +They indicate what our users can expect when using the infrastructure we support. +They will evolve over time as we get feedback and learn how to best deliver impact via our services. + +:::{note} +2i2c does not currently have a **Service Level Agreement** (SLAs), and the SLOs here are not legally-binding. +We aim to create SLAs once we learn more about our capacity to fulfill them sustainably.[^zenodo] +::: + + +(objectives:stability)= +## Availability and uptime + +The infrastructure that 2i2c runs should be available to its communities 24/7, and with minimal human intervention needed to maintain this level of performance. +We invest in continous development to improve the resiliency and efficiency of the infrastructure that we run, following best-practices in service design and engineering in the cloud. + +- Communities should feel comfortable relying on 2i2c's services for critical educational and research needs. +- There should not be prolonged periods of service disruption for any community. +- We will invest in monitoring and reporting infrastructure to detect outages quickly and before they impact end-users. +- When outages do occur, we will prioritize these over other work that our team is doing. + +:::{admonition} To be refined... +It is a known anti-pattern to define an ambiguous SLO like "24/7". +Truly meeting such an objective is nearly impossible and extremely costly. +In the future, we plan to run an audit of our infrastructure and practices, and design quantifiable uptime targets for our SLOs. +::: + + +(objectives:intentional-downtime)= +### Intentional downtime + +In some cases there may be intentional downtime for the infrastructure that we run. +For example, if we need to undergo major maintenance of infrastructure transitions. + +- We will communicate with communities before any intentional downtime. +- We will aim for downtime windows that happen outside of heavy usage. +- We will communicate with communities when the expected downtime is over. +:::{admonition} This may change +We are still exploring how to effectively communicate and schedule work around intentional downtime, and our processes may change. +[See this issue for example](https://github.com/2i2c-org/team-compass/issues/423). +::: + +(objectives:reduced-capacity)= +### Reduced team capacity + +There are some periods of time when we have **expected reduced capacity**. +These are periods of time when we are less strict about adhering to the service objectives on this page. +This ensures that our work practices are sustainable and fair for our team. + +Here are periods of expected reduced capacity: + +- Weekends +- The first and last weeks of the year. +- Periods of overlapping international holidays. + +If this is disruptive to your community's activies, please reach out and we can discuss. +However, we encourage you to avoid planning mission-critical events or actions during periods of expected reduced capacity. + +:::{admonition} A note on timezones +Remember that 2i2c's team is distributed globally, and our working time zone may be different from yours. +We aim to have team members in time zones that are working at the same time as the communities we serve, but there may occasionally be mis-matches in working hours. +::: + +(objectives:support)= +## Support responsiveness + +Support is one of the most important services that 2i2c provides, especially when there are problems or outages. +For this reason, we commit to developing a support process that is efficient in responding to issues that communities bring to us. +We define three types of support with 2i2c: + +- **Incidents** are requests connected with significant degraded service for one or more communities. For example, a system-wide outage or inability of users to log-in. +- **Change Requests** are general requests for changes or improvements to a community's hub. For example, updating the environment or improving an open source tool. +- **Guidance Requests** are questions or requests for conversations to discuss infrastructure decisions, provide guidance, etc. + +Below are our objectives broken down by the type of support they relate to. + +:::{seealso} +- See [](../support.md) for more information about contacting support. +- See [](tc:support:process) for our team's support process. +::: + +### General support objectives + +- We have a dedicated communications channel for support (see [](../support.md)). +- At least one team member is always tasked with monitoring this channel. +- Our support team is communicative, helpful, and [abides by our Code of Conduct](tc:code-of-conduct). + +### Incident support objectives + +Our goal is to be more rapid in responding, communicating, and resolving support requests during incidents. +Our ability to meet these objectives will depend on the times they are reported relative to the working hours of our support team. + +- We will triage and respond to Incidents within **at most one working day**[^working-day]. We will **on average** respond to Incidents significantly faster than this, but do not commit to a specific timeline until we gain more experience. +- We will prioritize resolving Incidents over any other Change requests. +- For major or complex outages, we will re-direct capacity on our engineering team to resolve them. + +[^working-day]: We define a "working day" as a continuous 24 hour period between Monday and Friday. Our team and communities we serve are split across many time zones, and thus we use this more general definition of a working day rather than something timezone-specific. + +### Change and Guidance Request support objectives + +- We will triage Change and Guidance requests and respond to them within one working day. +- We will prioritize resolving Change and Guidance Requests by balancing them against our other development priorities as described in {doc}`our Support Team Process documentation `. + +(objectives:cost)= +## Costs and cloud flexibility + +Our communities rely on us to keep their cloud costs as low as possible. +They also rely on us to provide infrastructure that is dynamic and meets the needs of diverse communities. + +There is an inherent tension between doing things quickly (which generally requires using extra resources to encourage speed) and cost efficiency (because you pay for those extra resources). +This is particularly relevant during sharp increases in hub usage. + +- Communities should feel comfortable that moderate increases in usage will not result in instability, and that this flexibility does not result in unexpectedly high cloud costs. +- We should provide this flexibility in a way that is sustainable for our team. +- If infrastructure requires steady, but semi-random usage, we should prioritize cost efficiency. +- If infrastructure will have known spikes of activity, we may temporarily favor speed over cost by asking for extra resources from the cloud provider. +- If spikes in activity will come just after a holiday or weekend, we may make these changes a few days early to avoid working off-hours. + +:::{seealso} +See [](pricing/index.md) for more information about costs. +::: + + +(objectives:updates)= +## Upgrades and maintenance + +By continuously upgrading the cloud infrastructure and software environments that our hubs offer, we improve the experience of the communities that we serve by giving them new features, enhancements, and bug and security fixes. + +We aim to continuously upgrade this infrastructure in a way that minimizes the risk of instability or outages. + +- We will keep our hubs relatively up-to-date with the latest [JupyterHub](https://jupyterhub.readthedocs.io) and [Zero to JupyterHub](https://z2jh.jupyter.org) releases. +- We will ensure that our hub infrastructure is compatible with the latest software releases in the common open source ecosystems we provide. +- We will support open source communities in making regular updates and releases to their tools. + +[^slos]: For more about the difference between Service Level Objectives, Agreements, and Indicators, see [the Google SRE handbook](https://sre.google/sre-book/service-level-objectives/). + +[^zenodo]: This practice is inspired by [Zenodo's intentional lack of Service Level Agreements](https://about.zenodo.org/principles/). diff --git a/about/strategy/index.md b/about/strategy/index.md index 45c61bd..2be9590 100644 --- a/about/strategy/index.md +++ b/about/strategy/index.md @@ -8,7 +8,6 @@ We aim to run this pilot for several months, gaining experience and sharpening o This page describes the major strategy of the 2i2c Managed JupyterHubs pilot. ```{toctree} -service-objectives.md roadmap.md ``` diff --git a/about/strategy/service-objectives.md b/about/strategy/service-objectives.md deleted file mode 100644 index 796f46a..0000000 --- a/about/strategy/service-objectives.md +++ /dev/null @@ -1,75 +0,0 @@ -# Service Level Objectives and Principles - -This page describes the **Service Level Objectives** (SLOs) of 2i2c's infrastructure services[^slos]. -These describe our goals in running infrastructure for the communities that we serve. -They indicate what our users can expect when using the infrastructure we support. - -We design our infrastructure, and consistently hone our practices, to meet these objectives. -They evolve over time as we get feedback from communities we serve, and learn more about how to best deliver impact via our services. - -:::{note} -2i2c does not currently have a **Service Level Agreement** (SLAs), as this is generally a legally-binding document that involves calculation of risk via revenue lost during service outages. -We currently do not have the capacity to design and litigate strict SLAs, and believe that we will have the most impact by instead committing to service **objectives** that are transparent and follow best practices.[^zenodo] - -We may revisit this in the future depending on the feedback we get from other communities! -::: - -## High availability - -The infrastructure that 2i2c runs should be available to its communities 24/7, and with minimal human intervention needed to maintain this level of performance. -We invest in continous development to improve the resiliency and efficiency of the infrastructure that we run, following best-practices in service design and engineering in the cloud. - -:::{admonition} To be refined... -It is a known anti-pattern to define an ambiguous SLO like "24/7". -Truly meeting such an objective is nearly impossible and extremely costly. -In the future, we plan to run an audit of our infrastructure and practices, and design quantifiable uptime targets for our SLOs. -::: - -## Balance speed and cost - -There is an inherent tension between doing things quickly (which generally requires using extra resources to encourage speed) and cost efficiency (because you pay for those extra resources). -This is particularly relevant during **scaling events**. -These are moments when the infrastructure has enough usage that it must grow the cloud resources available to handle the new load. - -2i2c strives to build infrastructure that strikes a balance that depends on the particular use-case. -If infrastructure requires steady, but semi-random usage, we should prioritize cost efficiency. -If infrastructure will have known spikes of activity at the same time, we may temporarily favor speed over cost by asking for extra resources from the cloud provider. - -:::{note} -If your community requires a change in the infrastructure that occurs over a weekend, we will generally try to do this on the Friday beforehand, rather than over the weekend, even if this means it will cost marginally more in cloud infrastructure. -If we anticipate the cost to be significant, we will discuss with you ahead of time. -::: - -## Support responsiveness - -We have a dedicated communications channel for support at `support@2i2c.org`, and somebody on the engineering team is always tasked with monitoring this channel. - -When questions come in on the support channel, we triage them based on whether they cover a major problem for the community (e.g., if there is a major hub outage). - -If this is the case, we strive to respond as quickly as possible to mobilize the right team members and fix the problem. -We will communicate with the Community Representative throughout this process, and let them know when the problem has been resolved. -In general, we aim to respond to all support questions within 24 hours - though we strive for more quick responses if the issue is critical. - -## Intentional downtime - -In some cases there may be intentional downtime for the infrastructure that we run. -For example, if we need to undergo major maintenance of infrastructure transitions, it may necessitate bringing down the infrastructure for a few hours. - -In these cases, we will communicate with the Community Representative ahead of time, to inform them of our intentions and give an opportunity for them to tell us when this will be least disruptive. -We will then carry out our maintenance as quickly as possible, with minimal downtime, and notify the community representative(s) when this has been complete. - -## Holidays, weekends, and expected downtime - -Expected downtime are periods of time when there is generally less availability from the team (as well as from the communities we serve). -This includes weekends and heavy holiday periods like the end of the year. - -While we strive for our services to be available 24/7, we also believe in the importance of protecting weekends and holiday time for our team. -During expected downtime periods, you should expect reduced responsiveness in our support channels, and no promises about our ability to respond to questions or issues with your infrastructure. -We may agree to perform some of these operations during expected downtime, but this should be the exception, not the rule. - -If this is disruptive to your community's activies, please reach out and we can discuss. -However, we encourage you to avoid planning mission-critical events or actions during periods of expected downtime. - -[^slos]: For more about the difference between Service Level Objectives, Agreements, and Indicators, see [the Google SRE handbook](https://sre.google/sre-book/service-level-objectives/). - -[^zenodo]: This practice is inspired by [Zenodo's intentional lack of Service Level Agreements](https://about.zenodo.org/principles/). \ No newline at end of file diff --git a/conf.py b/conf.py index d27e992..dab32a1 100644 --- a/conf.py +++ b/conf.py @@ -61,6 +61,7 @@ } rediraffe_redirects = { + "about/strategy/service-objectives.md": "about/service-objectives.md", } # Disable linkcheck for anchors because it throws false errors for any JS anchors diff --git a/index.md b/index.md index 029db3b..297d62c 100644 --- a/index.md +++ b/index.md @@ -13,6 +13,7 @@ These sections describe the hub service at an organizational level. :caption: About the service about/overview about/pricing/index +about/service-objectives about/strategy/index ```