forked from kubernetes/contributor-site
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request kubernetes#440 from fsmunoz/sig-arch-prod-readines…
…s-spotlight Add SIG Architecture Production Readiness spotlight
- Loading branch information
Showing
1 changed file
with
138 additions
and
0 deletions.
There are no files selected for viewing
138 changes: 138 additions & 0 deletions
138
content/en/blog/2023/sig-architecture-prod-readiness-spotlight.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,138 @@ | ||
--- | ||
layout: blog | ||
title: "Spotlight on SIG Architecture: Production Readiness" | ||
slug: sig-architecture-production-readiness-spotlight-2023 | ||
date: 2023-11-02 | ||
canonicalUrl: https://www.k8s.dev/blog/2023/24/05/sig-architecture-prod-readiness-spotlight-2023/ | ||
author: "Frederico Muñoz (SAS Institute)" | ||
--- | ||
|
||
_This is the second interview of a SIG Architecture Spotlight series that will cover the different | ||
subprojects. In this blog, we will cover the [SIG Architecture: Production Readiness | ||
subproject](https://github.com/kubernetes/community/blob/master/sig-architecture/README.md#production-readiness-1)_. | ||
|
||
In this SIG Architecture spotlight, we talked with [Wojciech Tyczynski](https://github.com/wojtek-t) | ||
(Google), lead of the Production Readiness subproject. | ||
|
||
## About SIG Architecture and the Production Readiness subproject | ||
|
||
**Frederico (FSM)**: Hello Wojciech, could you tell us a bit about yourself, your role and how you | ||
got involved in Kubernetes? | ||
|
||
**Wojciech Tyczynski (WT)**: I started contributing to Kubernetes in January 2015. At that time, | ||
Google (where I was and still am working) decided to start a Kubernetes team in the Warsaw office | ||
(in addition to already existing teams in California and Seattle). I was lucky enough to be one of | ||
the seeding engineers for that team. | ||
|
||
After two months of onboarding and helping with different tasks across the project towards 1.0 | ||
launch, I took ownership of the scalability area and I was leading Kubernetes to support clusters | ||
with 5000 nodes. I’m still involved in [SIG Scalability](https://github.com/kubernetes/community/blob/master/sig-scalability/README.md) | ||
as its Technical Lead. That was the start of a journey since scalability is such a cross-cutting topic, | ||
and I started contributing to many other areas including, over time, to SIG Architecture. | ||
|
||
**FSM**: In SIG Architecture, why specifically the Production Readiness subproject? Was it something | ||
you had in mind from the start, or was it an unexpected consequence of your initial involvement in | ||
scalability? | ||
|
||
**WT**: After reaching that milestone of [Kubernetes supporting 5000-node clusters](https://kubernetes.io/blog/2017/03/scalability-updates-in-kubernetes-1-6/), | ||
one of the goals was to ensure that Kubernetes would not degrade its scalability properties over time. While | ||
non-scalable implementation is always fixable, designing non-scalable APIs or contracts is | ||
problematic. I was looking for a way to ensure that people are thinking about | ||
scalability when they create new features and capabilities without introducing too much overhead. | ||
|
||
This is when I joined forces with [John Belamaric](https://github.com/johnbelamaric) and | ||
[David Eads](https://github.com/deads2k) and created a Production Readiness subproject within SIG | ||
Architecture. While setting the bar for scalability was only one of a few motivations for it, it | ||
ended up fitting quite well. At the same time, I was already involved in the overall reliability of | ||
the system internally, so other goals of Production Readiness were also close to my heart. | ||
|
||
**FSM**: To anyone new to how SIG Architecture works, how would you describe the main goals and | ||
areas of intervention of the Production Readiness subproject? | ||
|
||
**WT**: The goal of the Production Readiness subproject is to ensure that any feature that is added | ||
to Kubernetes can be reliably used in production clusters. This primarily means that those features | ||
are observable, scalable, supportable, can always be safely enabled and in case of production issues | ||
also disabled. | ||
|
||
## Production readiness and the Kubernetes project | ||
|
||
**FSM**: Architectural consistency being one of the goals of the SIG, is this made more challenging | ||
by the [distributed and open nature of Kubernetes](https://www.cncf.io/reports/kubernetes-project-journey-report/)? | ||
Do you feel this impacts the approach that Production Readiness has to take? | ||
|
||
**WT**: The distributed nature of Kubernetes certainly impacts Production Readiness, because it | ||
makes thinking about aspects like enablement/disablement or scalability more challenging. To be more | ||
precise, when enabling or disabling features that span multiple components you need to think about | ||
version skew between them and design for it. For scalability, changes in one component may actually | ||
result in problems for a completely different one, so it requires a good understanding of the whole | ||
system, not just individual components. But it’s also what makes this project so interesting. | ||
|
||
**FSM**: Those running Kubernetes in production will have their own perspective on things, how do | ||
you capture this feedback? | ||
|
||
**WT**: Fortunately, we aren’t talking about _"them"_ here, we’re talking about _"us"_: all of us are | ||
working for companies that are managing large fleets of Kubernetes clusters and we’re involved in | ||
that too, so we suffer from those problems ourselves. | ||
|
||
So while we’re trying to get feedback (our annual PRR survey is very important for us), it rarely | ||
reveals completely new problems - it rather shows the scale of them. And we try to react to it - | ||
changes like "Beta APIs off by default" happen in reaction to the data that we observe. | ||
|
||
**FSM**: On the topic of reaction, that made me think of how the [Kubernetes Enhancement Proposal (KEP)](https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md) | ||
template has a Production Readiness Review (PRR) section, which is tied to the graduation | ||
process. Was this something born out of identified insufficiencies? How would you describe the | ||
results? | ||
|
||
**WT**: As mentioned above, the overall goal of the Production Readiness subproject is to ensure | ||
that every newly added feature can be reliably used in production. It’s not possible to enforce that | ||
by a central team - we need to make it everyone's problem. | ||
|
||
To achieve it, we wanted to ensure that everyone designing their new feature is thinking about safe | ||
enablement, scalability, observability, supportability, etc. from the very beginning. Which means | ||
not when the implementation starts, but rather during the design. Given that KEPs are effectively | ||
Kubernetes design docs, making it part of the KEP template was the way to achieve the goal. | ||
|
||
**FSM**: So, in a way making sure that feature owners have thought about the implications of their | ||
proposal. | ||
|
||
**WT**: Exactly. We already observed that just by forcing feature owners to think through the PRR | ||
aspects (via forcing them to fill in the PRR questionnaire) many of the original issues are going | ||
away. Sure - as PRR approvers we’re still catching gaps, but even the initial versions of KEPs are | ||
better now than they used to be a couple of years ago in what concerns thinking about | ||
productionisation aspects, which is exactly what we wanted to achieve - spreading the culture of | ||
thinking about reliability in its widest possible meaning. | ||
|
||
**FSM**: We've been talking about the PRR process, could you describe it for our readers? | ||
|
||
**WT**: The [PRR process](https://github.com/kubernetes/community/blob/master/sig-architecture/production-readiness.md) | ||
is fairly simple - we just want to ensure that you think through the productionisation aspects of | ||
your feature early enough. If you do your job, it’s just a matter of answering some questions in the | ||
KEP template and getting approval from a PRR approver (in addition to regular SIG approval). If you | ||
didn’t think about those aspects earlier, it may require spending more time and potentially revising | ||
some decisions, but that’s exactly what we need to make the Kubernetes project reliable. | ||
|
||
## Helping with Production Readiness | ||
|
||
**FSM**: Production Readiness seems to be one area where a good deal of prior exposure is required | ||
in order to be an effective contributor. Are there also ways for someone newer to the project to | ||
contribute? | ||
|
||
**WT**: PRR approvers have to have a deep understanding of the whole Kubernetes project to catch | ||
potential issues. Kubernetes is such a large project now with so many nuances that people who are | ||
new to the project can simply miss the context, no matter how senior they are. | ||
|
||
That said, there are many ways that you may implicitly help. Increasing the reliability of | ||
particular areas of the project by improving its observability and debuggability, increasing test | ||
coverage, and building new kinds of tests (upgrade, downgrade, chaos, etc.) will help us a lot. Note | ||
that the PRR subproject is focused on keeping the bar at the design level, but we should also care | ||
equally about the implementation. For that, we’re relying on individual SIGs and code approvers, so | ||
having people there who are aware of productionisation aspects, and who deeply care about it, will | ||
help the project a lot. | ||
|
||
**FSM**: Thank you! Any final comments you would like to share with our readers? | ||
|
||
**WT**: I would like to highlight and thank all contributors for their cooperation. While the PRR | ||
adds some additional work for them, we see that people care about it, and what’s even more | ||
encouraging is that with every release the quality of the answers improves, and questions "do I | ||
really need a metric reflecting if my feature works" or "is downgrade really that important" don’t | ||
really appear anymore. |