Skip to content

Latest commit

 

History

History
1248 lines (968 loc) · 58.1 KB

File metadata and controls

1248 lines (968 loc) · 58.1 KB

KEP-3636: CSI Drivers in Windows as HostProcess Pods

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • "Implementation History" section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

CSI enables third-party storage providers to write and deploy plugins without the need to alter the core Kubernetes codebase.

A CSI Driver in Kubernetes has two main components: a controller plugin which runs in the control plane and a node plugin which runs on every node.

The node component of a CSI Driver require direct access to the host for making block devices and/or filesystems available to the kubelet, CSI Drivers use the mkfs(8) and mount(8) commands to format and mount filesystems. CSI Drivers running in Windows nodes can't execute similar Windows commands due to the missing capability of running privileged operations from a container. To workaround this issue, a proxy binary called CSI Proxy was introduced as a way to perform privileged storage operations by relaying the execution of these privileged storage operations to it, CSI Drivers connect to a gRPC API exposed by CSI Proxy as named pipes in the host and invoke the CSI Proxy API services to execute privileged powershell commands to mount and format filesystems on behalf of the CSI Driver. CSI Proxy became GA in Kubernetes 1.22.

At around the same time, SIG Windows introduced [HostProcess containers](https://kubernetes.io/blog/2021/08/16/windows-hostprocess-containers/. This feature enables running containers as a process in the host (hence the name), with this feature CSI Drives can directly perform the same privileged storage operations that CSI Proxy did. This KEP explains the process to transition CSI Drivers to become HostProcess containers.

Glossary

Reference for a few terms used throughout this document:

Motivation

CSI Proxy enabled running CSI in Windows nodes using a client/server model. The server is the CSI Proxy binary running as a Windows service in the node, and the client is the CSI Driver, which communicates with CSI Proxy on every CSI call done on the node plugin. While this model has worked fine, it has a few drawbacks:

  • Different deployment model than Linux - Linux privileged containers are used to perform the privileged storage operations (format/mount). However, Windows containers aren't privileged. To work around the problem, the CSI Driver is run as non-privileged containers, and privileged operations are relayed to CSI Proxy. In deployment manifests, the Windows component needs to have an additional section to mount the named pipes exposed by CSI Proxy as a hostpath.
  • Additional component in the host to maintain - The cluster administrator needs to install and run CSI Proxy during the node bootstrap. The cluster administrator also needs to think about the upgrade workflow in addition to upgrading the CSI Driver.
  • Difficult releases of bugfixes & features - After a bugfix, we create a new version of the CSI Proxy to be redeployed in the cluster. After a feature is merged, in addition to redeploying a new version of CSI Proxy, the client needs to be updated with a new version of the CSI Proxy client and connect to the new version of the named pipes, this workflow is not as simple as the Linux counterpart which only needs to update Go dependencies.
  • Multiple API versions to maintain - As part of the original design of CSI Proxy it was decided to have different protobuf versions whenever there were breaking changes (like updates in the protobuf services & messages), this lead to having multiple versions of the API (v1alphaX, v1betaX, v1). In addition, if we want to add a new feature we'd need to create a new API version e.g. v2alpha1 (see this PR as an example of adding methods to the Volume API Group).

In 1.22, SIG Windows introduced HostProcess containers as an alternative way to run containers. HostProcess containers run directly in the host and behave like to a regular process.

Using HostProcess containers in CSI Drivers enables CSI Drivers to perform the privileged storage operations directly. Most of the drawbacks in the client/server model are no longer present in the new model.

Goals

  • Identify the pros/cons of the different ways to transition CSI Drivers to become HostProcess containers - This includes changes in dependent components like CSI Proxy, as well as defining the changes in the CSI Drivers.
  • Identify the security implications of running CSI Drivers as HostProcess containers - Like their Linux counterpart, HostProcess containers need to have security policies in place limiting the scenarios in which they are enabled. We provide an analysis on the security implications in this KEP.

Non-Goals

  • Improve the performance of CSI Drivers in Windows - There should be an improvement in the performance by removing the communication aspects between the CSI Driver and CSI Proxy (the protobuf serialization/deserialization, the gRPC call through named pipes). However, this improvement might not be noticeable, as most of the latency comes from doing the format/mount operations through powershell commands, which is outside the scope of this change.
  • Define security implementation details - A goal is to understand the security implications of enabling HostProcess containers. We aim to provide guidelines but not implementation details about the components that need to be installed in the cluster.

Proposal

As part of the transition of CSI Drivers to HostProcess containers, we would like to:

  • Refactor the CSI Proxy codebase to become a Go library in favor of the current client/server model.
  • Define guidelines for the transition of CSI Drivers to HostProcess containers, including changes in the Go code, deployment, and security considerations.

Notes/Constraints/Caveats

HostProcess containers run as processes in the host. One of the differences with a privileged Linux container is that there's no filesystem isolation. This means that enabling HostProcess containers should be done for system components only. This point will be expanded on in the detailed design.

Risks and Mitigations

Security implications of HostProcess containers will be reviewed by the SIG Windows team and the SIG Storage team initially.

One risk about enabling the HostProcess containers feature is not having enough security policies in the cluster for workloads, if workloads can be deployed as HostProcess containers or if there's an escalation that allow non-privileged pods to become HostProcess containers then workloads have complete access to the host filesystem, this allows access to the tokens in /var/lib/kubelet as well as the volumes of other pods inside /var/lib/kubelet/

Design Details

The following paragraphs summarize the architecture of CSI Proxy, how CSI Drivers use it, the purpose of the conversion layer of CSI Proxy that enables backward compatibility with previous API Versions, and a description of the files generated by the conversion layer used in CSI Proxy.

CSI Proxy has a client/server design with two main components:

  • a binary that runs in the host (the CSI Proxy server). This binary can execute privileged storage operations on the host. Once configured to run as a Windows service, it creates named pipes on startup for all the versions of the API Groups defined on the codebase.
  • client go libraries that CSI Drivers and Addons import to connect to the CSI Proxy server. The methods and objects available in the library are defined with protobuf. On startup, the CSI Driver initializes a client for each version of the API Groups required, which will connect and issue requests through gRPC to their pre-configured named pipes on the host.

CSI Driver implementers can write a Windows specific implementation of the node component of the CSI Driver. In the implementation, a CSI Driver will make use of the imported CSI Proxy client libraries to issue privileged storage operations. Assuming that a volume was created and attached to a node by the controller component of the CSI Driver, the following CSI calls will be done by the kubelet to the CSI Driver.

Volume set up

  • NodeStageVolume - Create a Windows volume, format it to NTFS, and create a partition access path in the node (global mount).
  • NodePublishVolume - Create a symlink from the kubelet Pod-PVC path to the global path (pod mount).

Volume tear down

  • NodeUnpublishVolume - Remove the symlink created above.
  • NodeUnstageVolume - Remove the partition access path.

CSI Proxy is designed to be backwards compatible, and a single binary running in the Windows node can serve requests from multiple CSI Proxy client versions. We're able to do this, because the CSI Proxy binary will create named pipes on startup for all the versions available in every API Group (e.g. the Volume, Disk, Filesystem, SMB groups). In addition, there's a conversion layer in the CSI Proxy binary that transforms client version specific requests to server "version agnostic" requests, which are then processed by the CSI Proxy binary. The following diagram shows the conversion process (from the CSI Proxy development docs):

CSI Proxy client/server model

Understanding the conversion layer will help in the transition to HostProcess containers, as most of the code that the clients use to communicate with the CSI Proxy server is generated. The conversion layer's objective is to generate Go code that maps versioned client requests to server agnostic requests. It does so by analyzing the generated api.pb.go files (generated through protoc from the protobuf files) for each version of the API Groups and generating multiple files for different purposes (taking as example the Volume API Group):

  • <version>/server_generated.go
    • The gRPC server implementation of the methods of a versioned API Group. Each method receives a versioned request and expects a versioned response. The code generated follows this pattern:
func v1Foo(v1Request v1FooRequest) v1FooResponse {

  // convert versioned request to server request (version agnostic)
  fooRequest = convertV1FooRequestToFooRequest(v1Request)

  // process request (server handler)
  fooResponse = server.Foo(fooRequest)

  // convert server response (version agnostic) to versioned response
  v1Response = convertFooResponseToV1FooResponse(fooResponse)

  return v1Response
}
  • types_generated.go The idea is to collect all the methods available across all the versions of an API Group so that the server has a corresponding implementation for it. The generator reads all the methods found across the volume/<version>/api.pb.go files and generates an interface with all the methods found that the server must implement, in the example above the server interface will have the Foo method
  • <version>/conversion_generated.go The generated implementation of the conversion functions shown above (e.g. convertV1FooRequestToFooRequest, convertFooResponseToV1FooResponse). In some cases, it's possible that the conversion code generator generates a nested data structure that's not built correctly. There's an additional file with overrides for the functions that were generated incorrectly.
  • Client <API Group>/<version>/client_generated.go Generated in the client libraries to be used by users of the CSI Proxy client. It creates proxy methods corresponding to the api.pb.go methods of the versioned API Group. This file defines the logic to create a connection to the corresponding named pipe, creating a gRPC client out of it and storing it for later usage. As a result, the proxy methods don't need a reference to the gRPC client.

Prerequisite: Make CSI Proxy an embedded library without a server component

If we configure the Windows node component of a CSI Driver/Addon to be a Windows HostProcess pod, then it'll be able to use the same powershell commands that we use in the server code of CSI Proxy. The idea is to use the server code of CSI Proxy as a library in CSI Drivers/Addons. With this, we also remove the server component.

As described in the Windows HostProcess Pod guide, we'd need to configure the PodSpec of node component of the CSI Driver/Addon that runs in Windows nodes with:

spec:
  securityContext:
    windowsOptions:
      hostProcess: true
      runAsUserName: "NT AUTHORITY\\SYSTEM"

Implementation idea 1: Update the conversion layer to use the server code gRPC

Modify the implementation of <API Group>/<version>/client_generated.go so that it calls the server implementation directly (which should be part of the imported go module). The current implementation uses w.client which is the gRPC client:

func (w *Client) GetVolumeStats(
  context context.Context,
  request *v1.GetVolumeStatsRequest,
  opts ...grpc.CallOption
) (*v1.GetVolumeStatsResponse, error) {
        return w.client.GetVolumeStats(context, request, opts...)
}

The new implementation should use the server code instead. In the server code, volumeserver is the implementation agnostic server that's instantiated by every versioned client volumeservervX. E.g.,

import  v1 "github.com/kubernetes-csi/csi-proxy/client/api/volume/v1"
import volumeserver "github.com/kubernetes-csi/csi-proxy/pkg/server/volume"
import volumeserverv1 "github.com/kubernetes-csi/csi-proxy/pkg/server/volume/v1"

// initialize all the versioned volume servers i.e. do what cmd/csi-proxy does but on the client
serverImpl := volumeserver.NewServer()

// shim that would need to be auto generated for every version
serverv1 := volumeserverv1.NewVersionedServer(serverImpl)

// client still calls the conversion handler code
func (w *Client) GetVolumeStats(
  context context.Context,
  request *v1.GetVolumeStatsRequest
) (*v1.GetVolumeStatsResponse, error) {
        return serverv1.GetVolumeStats(context, request)
}

csi-proxy-reuse-client-server-pod

Pros:

  • We get to reuse the protobuf code.
  • We would still support the client/server model, as this is a new method that clients would use.
  • We only need to change the client import paths to use the alternative version that doesn't connect to the server with gRPC, which minimizes the changes necessary in the client code.

Cons:

  • New APIs would need to be added to the protobuf file, and we would need to run the code generation tool again, with the rule of not modifying already released API Groups. This means that we would also need to create another API Group version for a new API.
  • We still have two distinct concepts of version: the Go module version and the API version. Given that we want to use CSI Proxy as a library, it makes sense to use the Go module version as the source of truth and implement a single API version in each Go version.

Implementation idea 2: Update the CSI Drivers to use the server code directly (preferred)

Modify the client code to use the server API handlers directly which would call the server implementation next, this means that the concept of an "API version" is also removed from the codebase, the clients instead would import and use the internal server structs (request and response objects).

Currently, GCE PD CSI driver uses the v1 Filesystem API group as follows:

// note the API version in the imports
import  fsapi "github.com/kubernetes-csi/csi-proxy/client/api/filesystem/v1"
import  fsclient "github.com/kubernetes-csi/csi-proxy/client/groups/filesystem/v1"
func NewCSIProxyMounterV1() (*CSIProxyMounterV1, error) {
        fsClient, err := fsclient.NewClient()
        if err != nil {
                return nil, err
        }
        return &CSIProxyMounterV1{
                FsClient:     fsClient,
        }, nil
}

// ExistsPath - Checks if a path exists. Unlike util ExistsPath, this call does not perform follow link.
func (mounter *CSIProxyMounterV1) PathExists(path string) (bool, error) {
        isExistsResponse, err := mounter.FsClient.PathExists(context.Background(),
                &fsapi.PathExistsRequest{
                        Path: mount.NormalizeWindowsPath(path),
                })
        if err != nil {
                return false, err
        }
        return isExistsResponse.Exists, err
}

// usage
csiProxyV1, _ := NewCSIProxyMounterV1()
csiProxyV1.PathExists(path)

Internally the PathExists call is in the file <API Group>/<version>/client_generated.go described above, which performs the execution through gRPC. In the proposal we'd need to use the server implementation instead:

// note that there is no version in the import
import fsserver "github.com/kubernetes-csi/csi-proxy/pkg/server/filesystem"
import fsserverimpl "github.com/kubernetes-csi/csi-proxy/pkg/server/filesystem/impl"
import fsapi "github.com/kubernetes-csi/csi-proxy/pkg/os/filesystem"

// no need to initialize a gRPC client, however the server handler impl is initialized instead
// no need for a versioned client

func NewCSIProxyMounter() (*CSIProxyMounter, error) {
  fsServer, err := fsserver.NewServer(fsapi.New())
  if err != nil {
    return nil, err
  }
  return &CSIProxyMounter{
    FsServer: fsServer,
  }, nil
}

// ExistsPath - Checks if a path exists. Unlike util ExistsPath, this call does not perform follow link.
func (mounter *CSIProxyMounter) PathExists(path string) (bool, error) {
        isExistsResponse, err := mounter.FsServer.PathExists(context.Background(),
                &fsserverimpl.PathExistsRequest{
                        Path: mount.NormalizeWindowsPath(path),
                },
                // 3rd arg is the version, remove the version here too!
        )
        if err != nil {
                return false, err
        }
        return isExistsResponse.Exists, err
}

// usage
csiProxy, _ := NewCSIProxyMounter()
csiProxy.PathExists(path)

csi-proxy-library

Pros:

  • We remove the concept of API Version & the conversion layer and instead consider the go mod version as the API version. This is how other libraries like k8s.io/mount-utils work.
    • Version dependent server validation in the API handler layer is removed.
    • Legacy structs for older API versions are removed.
  • New APIs are easier to add. Only the server handler & impl code is modified, so there’s no need for the code generation tool anymore.

Cons:

  • The client goes through a bigger diff. Every occurrence of a call to a CSI Proxy method needs to be modified to use the server handler & impl code, but this penalty is paid only once.
    • Legacy interface implementations for the v1beta API in the CSI Drivers are removed.
  • As we no longer use protobuf to define the API and use internal structs instead, we'd need to update the API docs to be directly generated from source code (including the comments around server handler methods and internal server structs).

It is worth noting that at this point, the notion of a server is no longer valid, as CSI Proxy has become a library. We can take this opportunity to reorganize the packages by

  1. Moving /pkg/server/<API Group> and /pkg/server/<API Group>/impl to /pkg/<API Group>
  2. Moving /pkg/os/<API Group> to /pkg/<API Group>/api

The new structure looks like:

pkg
├── disk
│   ├── api
│   │   ├── api.go
│   │   └── types.go
│   ├── disk.go
│   └── types.go
├── iscsi
│   ├── api
│   │   ├── api.go
│   │   └── types.go
│   ├── disk.go
│   └── types.go

There are also three minor details we can take care of while we’re migrating:

  1. The two structs under pkg/shared/disk/types.go are only ever referenced by pkg/os/disk, so they can be safely added to pkg/disk/api/types.go.
  2. The FS server receives workingDirs as an input, in addition to the OS API. It’s only used to sandbox what directories the CSI Proxy is enabled to operate on. Now that control is part of the CSI Driver, we can safely remove it.
  3. pkg/os/filesystem is no longer necessary, as the implementation just calls out to the Golang standard library os package. We can deprecate it in release notes and remove it in a future release.

Implementation idea 3: Convert CSI Proxy to a Library of Functions

With the new changes, CSI Proxy is effectively just a library of Go functions mapping to Windows commands. The notion of servers and clients is no longer relevant, so it makes sense to restructure the package into a library of functions, with each API Group’s interfacing functions and types provided under pkg/<API Group> (right now, these files sit at pkg/server/<API Group>/server.go and pkg/server/<API Group>/impl/types.go). The OS-facing API at /pkg/os is kept is, and the corresponding OS API struct is initialized globally inside each pkg/<API Group> (to allow for subbing during testing). All other code can be safely deleted.

// there is now only one single import
import fs "github.com/kubernetes-csi/csi-proxy/pkg/fs"

// there is no longer a need to initialize a server
func NewCSIProxyMounter() *CSIProxyMounter {
  return &CSIProxyMounter{
  }
}

// ExistsPath - Checks if a path exists. Unlike util ExistsPath, this call does not perform follow link.
func (*CSIProxyMounter) PathExists(path string) (bool, error) {
        // both mounter.FsServer and fsserverimpl are changed to just fs
        isExistsResponse, err := fs.PathExists(context.Background(),
                &fs.PathExistsRequest{
                        Path: mount.NormalizeWindowsPath(path),
                }
        )
        if err != nil {
                return false, err
        }
        return isExistsResponse.Exists, err
}

// usage
csiProxy := NewCSIProxyMounter()
csiProxy.PathExists(path)

// at test time
fs.UseAPI(mockAPI)
// run tests…
fs.ResetAPI()

This is the most invasive option of all three. Specifically, we combine the two imports into one and move to a pure function paradigm. However, the method implementation sees very minimal changes, requiring only import path updates.

Pros:

  • Like implementation idea 2, we switch to a single notion of version via Go modules.
  • The pure function paradigm more accurately reflects the nature of the new design, which simplifies how clients use the library.
  • Like implementation idea 2, new APIs are easier to add by moving away from code generation.

Cons:

  • There is now an implicit dependency on the os API package-level variable. Testing can still be done by subbing out the variable with a mock implementation during test time.
  • More work (2 imports -> 1, remove server initialization, replace function call and request type package names) needs to be done by clients to adapt to the new change, though it’s not that much more than implementation idea 2. Again, the price is only paid once.
  • Like impl idea 2, we also need to transition our API doc generation to generate from Go source.

Comparison Matrix

Idea 1: Update the conversion layer to use the server code gRPC Idea 2: Update the CSI Drivers to use the server code directly (preferred) Idea 3: Convert CSI Proxy to a Library of Functions
Adoption cost Minimal (only changing imports) Considerate (imports and API calls) Considerate (imports, API calls, and initialization)
Future development Still need code generation and and protobuf Directly add methods to Go code, but leaves legacy notion of “server” Directly add functions to Go code. Code base cleaned up
Versioning Both Go mod version and API version are maintained Go mod version only Go mod version only
Testing Current tests should still work. Current tests should still work. OS API mocking needs to be subbed in, as we have an implicit dependency
Support for legacy client/server model Still supported Not supported Not supported

Maintenance of the new model and existing client/server model of CSI Proxy

The library-development branch will be used for the development of this model. master will have the existing client/server mode. We plan to create alpha tags on the library-development branch and use it in CSI Drivers. Once integrated, we will create a v2 tag and make library-deveopment the new default. master will point to the new implementation, whereas the legacy code is maintained on thev1.x.

v1.x will still be open for urgent bug fixes but new features should be developed in the v2 codebase.

Security analysis

  • Install the Pod Security Admissions controller and use Pod Security Standards

    • Embrace the least privilege principle, quoting Enforcing Pod Security Standards | Kubernetes
      • Namespaces that lack any configuration at all should be considered significant gaps in your cluster security model. We recommend taking the time to analyze the types of workloads occurring in each namespace, and by referencing the Pod Security Standards, decide on an appropriate level for each of them. Unlabeled namespaces should only indicate that they've yet to be evaluated.
      • Namespaces allowing privileged workloads should establish and enforce appropriate access controls.
      • For workloads running in those permissive namespaces, maintain documentation about their unique security requirements. If at all possible, consider how those requirements could be further constrained.
    • In namespaces without privileged workloads:
    kubectl label --overwrite ns my-existing-namespace \
      pod-security.kubernetes.io/enforce=restricted \
      pod-security.kubernetes.io/enforce-version=v1.25
    
    • Both the baseline and restricted Pod Security Standards disallows the creation of HPC pods (docs).
  • Create a Windows user with limited permissions to create files under the kubelet controlled path C:\var\lib\kubelet

Test Plan

Unit tests

For CSI Proxy we already have unit tests inside pkg/<API Group>. These tests are run on presubmit for every PR.

Examples:

Integration tests

For CSI Proxy, we already have integration tests inside integrationtests. These tests are run on presubmit for every PR.

Examples:

e2e tests

OSS storage e2e tests run out of tree. We plan to migrate at least 1 CSI Driver to use the CSI Proxy library and see the existing e2e tests passing.

Graduation Criteria

Most of the code used by CSI Drivers through CSI Proxy is already GA. This KEP is defining a new mechanism to run the same code that the CSI Driver executes through CSI Proxy directly inside the CSI Driver.

Upgrade / Downgrade Strategy

The following is a list of items that need to happen in different components of CSI in Windows for CSI Drivers to become HostProcess containers:

CSI Proxy

  • Start a development branch for the upcoming work (library-development).
  • Refactor the filesystem, disk, volume, system, iSCSI, SMB API Groups out of the current client/server.
  • Remove the client/server code from the codebase.
  • Update the unit and integration tests to work with the refactored code.
  • Run the integration tests in a HostProcess container.
  • Update the README and DEVELOPMENT docs.
  • Once the above items are completed, we can create an alpha tag in the library-development branch to import in CSI Drivers.

CSI Driver

  • Update the CSI Proxy library to the alpha v2 tag from the library-development branch.
  • Update the codebase import to use the server implementation directly instead of the client library.
  • Update the CSI Driver deployment manifest with the HostProcess container fields in the PodSpec.
  • Run the e2e tests.

Version Skew Strategy

Previously, CSI Proxy has a different release cycle than the CSI Driver, where each binary had its own version and supported different CSI Proxy clients. Once CSI Proxy becomes a library the version will be managed by the go module version instead (similar to kubernetes/mount-utils).

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name:
    • Components depending on the feature gate:
  • Other
    • Describe the mechanism:
    • Will enabling / disabling the feature require downtime of the control plane?
    • Will enabling / disabling the feature require downtime or reprovisioning of a node? (Do not assume Dynamic Kubelet Config feature is enabled).
Does enabling the feature change any default behavior?
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
What happens if we reenable the feature if it was previously rolled back?
Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?
What specific metrics should inform a rollback?
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?
How can someone using this feature know that it is working for their instance?
  • Events
    • Event Reason:
  • API .status
    • Condition name:
    • Other field:
  • Other (treat as last resort)
    • Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Metric name:
    • [Optional] Aggregation method:
    • Components exposing the metric:
  • Other (treat as last resort)
    • Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?
Will enabling / using this feature result in introducing new API types?
Will enabling / using this feature result in any new calls to the cloud provider?
Will enabling / using this feature result in increasing size or count of the existing API objects?
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Infrastructure Needed (Optional)