-
Notifications
You must be signed in to change notification settings - Fork 19
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[0025] Add Administrative Tasks and Failure Modes
Signed-off-by: Micah Hausler <[email protected]>
- Loading branch information
1 parent
fa36479
commit e059283
Showing
1 changed file
with
82 additions
and
26 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,41 +1,53 @@ | ||
--- | ||
id: 0025 | ||
title: Tinkerbell on Kubernetes | ||
title: Support for the Kubernetes Resource Model | ||
status: ideation | ||
authors: Micah Hausler <[email protected]> | ||
--- | ||
|
||
## Summary | ||
|
||
This is a proposal to architect Tinkerbell as a Kubernetes native application. It is a rearchitecture of the 'control plane' backend, and leaves the 'data plane' components of Tink workers and actions unchanged. | ||
This is a proposal to architect Tinkerbell as an application using the [Kubernetes Resource Model][krm]. | ||
It is a rearchitecture of the 'control plane' backend, and leaves the 'data plane' components of Tink workers and actions unchanged. | ||
|
||
[krm]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture/resource-management.md | ||
|
||
## Goals and not Goals | ||
|
||
Goals: | ||
* Compatibility with existing Tink workers, workflow execution, and actions. | ||
* More easily support non-request serving controllers in Tinkerbell. | ||
* **Compatibility with existing Tink workers, workflow execution, and actions.** | ||
* **More easily support non-request serving controllers in Tinkerbell.** | ||
In this architecture, controllers like PBnJ could leverage Kubernetes primitives like [Custom Resource Definitions][crds] (CRDs), [WATCH APIs][watch], and [Field Management][fm] to complete workflow steps. | ||
* Migrate existing components of Hegel, Boots, and Tinkerbell API to use Kubernetes as the datastore | ||
* Reduce the security surface of the Tinkerbell API. | ||
* **Reduce the security surface of the Tinkerbell API.** | ||
Implementing multiple authorization modes is a non-trivial task. | ||
The fewer APIs, and authorization options, and lines of code that exist, the fewer opportunities there are for security issues to arise. | ||
Tinkerbell is a high-value component of data center infrastructure, so protection of DHCP infrastructure, BMC/IPMI management, needs to be treated as such. | ||
* Support a highly-available architecture. | ||
* **Support a highly-available architecture.** | ||
Postgres is a fantastic database, but managing high-availability with graceful failover is not trivial. | ||
Using an alternative data store that better supports failure would better help operators to have higher availability and not require downtime for upgrades or failover. | ||
* **Migrate existing components**. | ||
Hegel, Boots, and Tinkerbell API will be modified to use Kubernetes as the datastore | ||
* **Support migration of existing installations.** | ||
Migration tooling and documentation will be provided for existing Workflows, Hardware, and Templates stored in Tinkerbell. | ||
Migration should be as straightforward as: | ||
* Creating Tinkerbell CRDs in the Kubernetes API | ||
* Running a provided migration command | ||
* Restarting each Tinkerbell component targeting Kubernetes (ex: `--k8s-mode=true`) | ||
|
||
Non Goals: | ||
* Implement attribute-based authorization in Tink API. | ||
This is intentionally descoped from this proposal, but could be implemented in a separate proposal. | ||
* Require Tinkerbell to be operated as pods inside Kubernetes. | ||
* **Implement attribute-based authorization in Tink API.** | ||
This is intentionally descoped from this proposal and can be implemented in a separate proposal. | ||
* **Require Tinkerbell to be operated as pods inside Kubernetes.** | ||
The Kubernetes API would become a dependency of Tinkerbell, but that API could exist in a cloud provider or on-premise. | ||
* Require the use of Cluster API ([CAPI][capi]). | ||
* **Require the use of Cluster API ([CAPI][capi]).** | ||
The [Tinkerbell CAPI provider][capt] (CAPT) is mentioned in this proposal only for reference of a known Tinkerbell client. | ||
Implementation of this proposal will necessitate changes in CAPT, but that is not the core motivation for this proposal. | ||
* Implementation of PBnJ as a Kubernetes controller | ||
* Make the Tink worker a client of Kubernetes. | ||
* **Implementation of PBnJ as a Kubernetes controller** | ||
* **Make the Tink worker a client of Kubernetes.** | ||
Kubernetes doesn't natively support robust attribute-based identities for non-node identities. | ||
It does have the [Node Authorizer][node-authorizer], but that is specific to authorization of Kubelet communication to the Kubernetes API. | ||
* **Provide a zero-downtime migration experience.** | ||
Migration tooling and documentation will be provided, but migrating in-progress workflows is out of scope. | ||
|
||
[crds]: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ | ||
[watch]: https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes | ||
|
@@ -83,13 +95,60 @@ By using the Kubernetes API as the datastore, the motivations are addressed in t | |
* All Kubernetes clients (Boots, Hegel, PBnJ, etc) can have least-permission RBAC policies to limit the permissions for each respective client. | ||
* High-availability Kubernetes and etcd can be delegated to a cloud provider or an on-premise Kubernetes cluster | ||
|
||
### Tradeoffs | ||
### Administrative Tasks | ||
|
||
**Migration from Postgres** | ||
|
||
In order to support an easier migration from Postgres, we will add a `--kubernetes-mode` or equivalent flag to the Tinkerbell components and for some period of time support two alternate modes: Tinkerbell as it exists today, and the Kubernetes Resource Model. | ||
|
||
TODO: How long to support both modes? | ||
|
||
A zero-downtime migration is not in scope for this change mainly because it would add significant complexity and require mirroring data between Postgres and Kubernetes. | ||
|
||
**Bootstrapping** | ||
|
||
One of the stated design goals is to not require running the Tinkerbell control plane as pods in the Kubernetes cluster. | ||
Installation of Tinkerbell using Kubernetes as a datastore will differ in the following ways: | ||
* A Kubernetes cluster will need to exist | ||
* A Kubernetes [validating admission webhook][validating-webhook] for CRD types will need to be hosted and reachable by the Kubernetes API server. | ||
This component is most easily operated as a pod in a cluster, but could be operated and hosted externally. | ||
It does not require access to the on-premise Tinkerbell control plane | ||
* Tinkerbell API, Hegel, and Boots will need network connectivity and credentials to the Kubernetes API. | ||
As processes existing outside of the Kubernetes cluster, this will most likely be [x509 client certificates][x509-certs] or exported [Kubernetes Service Account tokens][sa-tokens]. | ||
* Hardware and Templates will be seeded using `kubectl` instead of `tink-cli` | ||
|
||
Should operators want to run Tinkerbell in an on-premise Kubernetes cluster and use that cluster as the data store, there arises a need to bootstrap the Tinkerbell control plane. | ||
In order to support this mode, we will need to support migration from either a cloud Kubernetes cluster or a temporarily bootstrap cluster with [`kind`][kind] to a long-lived on-premise management cluster. | ||
|
||
[validating-webhook]: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#validatingadmissionwebhook | ||
[x509-certs]: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#x509-client-certs | ||
[sa-tokens]: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#service-account-tokens | ||
[kind]: https://kind.sigs.k8s.io/ | ||
|
||
TODO | ||
**Migrating between Kubernetes clusters, Backups, and Restoration** | ||
|
||
In order to support the previously mentioned bootstrap cluster model, migration from one cluster to another will need to be supported. | ||
Backups of Hardware, Templates, and Workflows can be made using [`kubectl`][kubectl], but simple restoration is not possible as the [`/status` subresource][status-subresource] on CRDs makes the `.status` field a separate API call. | ||
|
||
Concerns around data backup and failover were previously delegated to Postgres administration and now become a Kubernetes administration issue. | ||
|
||
The `tink-cli` will need to implement backup and restore functionality to support both these administrative concerns. | ||
|
||
[kubectl]: https://kubernetes.io/docs/reference/kubectl/overview/ | ||
[status-subresource]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#status-subresource | ||
|
||
### Failure modes | ||
|
||
**Kubernetes unavailability** | ||
|
||
This change introduces a dependency on Kubernetes being available. | ||
Concerns around data backup and failover were previously delegated to Postgres administration and now become a Kubernetes administration issue. | ||
As mentioned previously, backup and restoration commands will be added to restore from Kubernetes API unavailability and could help support switching to a different underlying Kubernetes cluster should the primary cluster become unavailable. | ||
|
||
### User Experience | ||
|
||
In order to use Tinkerbell, clients would interact with the Kubernetes API. In order to provision a machine, the steps would be: | ||
In order to use Tinkerbell, clients would interact with the Kubernetes API. | ||
In order to provision a machine, the steps would be: | ||
|
||
1. User creates a Hardware CRD object in Kubernetes. | ||
This is analogous to the current `tink hardware push < hardware.json` command. | ||
|
@@ -127,7 +186,8 @@ service WorkflowService { | |
} | ||
``` | ||
|
||
The structure of the Kubernetes CRD types will start out with those defined in the [Tinkerbell CAPI Provider][capt-types]. The code defining those types would be migrated into the tinkerbell/tink GitHub repository so other clients could import the Go packages. | ||
The structure of the Kubernetes CRD types will start out similar to those defined in the [Tinkerbell CAPI Provider][capt-types]. | ||
The code defining those types will be migrated into the tinkerbell/tink GitHub repository so other clients could import the Go packages. | ||
|
||
[capt-types]: https://github.com/tinkerbell/cluster-api-provider-tink/tree/main/tink/api/v1alpha1 | ||
|
||
|
@@ -158,7 +218,8 @@ RBAC policies would govern the level of access to the Kubernetes API. | |
PBnJ will require: | ||
* Read-only access to Hardware CRDs to discover management interface connectivity | ||
* Mutating access to Workflow CRDs to execute workflow steps like power cycling and BIOS management | ||
* Some level of secret access to connect to management interfaces. Implementation and design of that access is out of scope of this proposal. | ||
* Some level of secret access to connect to management interfaces. | ||
Implementation and design of that access is out of scope of this proposal. | ||
|
||
**Tink API** | ||
|
||
|
@@ -169,7 +230,8 @@ The Tinkerbell API will require: | |
|
||
**Tinkerbell Workflow Controller** | ||
|
||
There will need to be a controller to process creation of workflows. It will need: | ||
There will need to be a controller to process creation of workflows. | ||
It will need: | ||
* Read access to Hardware CRDs in Kubernetes | ||
* Read access to Template CRDs in Kubernetes | ||
* Write access to Workflow CRDs `.spec` in Kubernetes | ||
|
@@ -186,13 +248,7 @@ Kubernetes cluster administrators can define custom levels of access with RBAC p | |
|
||
## Alternatives | ||
|
||
There are two primary alternatives to achieve some of the stated design goals: | ||
|
||
Add a `--kubernetes-mode` flag to the Tinkerbell components and support two alternate modes: Tinkerbell as it exists today, and Tinkerbell on Kubernetes. | ||
This would allow two alternate deployment/database models for operators who do not want to introduce Kubernetes to their deployment model. | ||
The biggest disadvantage of this alternative would be added complexity in supporting two separate paths of the Tinkerbell codebase. | ||
|
||
Leave the Tinkerbell API alone, and leverage/modify the Tinkerbell API's [internal Database interface][tink-db-iface] to function on top of Kubernetes or a key-value datastore like [etcd][etcd]. | ||
There are primary alternative to achieve some of the stated design goals would leave the Tinkerbell API alone, and leverage/modify the Tinkerbell API's [internal Database interface][tink-db-iface] to function on top of Kubernetes or a key-value datastore like [etcd][etcd]. | ||
This alternative would help with high-availability deployments, but all the other motivations would remain unaddressed, and need to be implemented in Tinkerbell's API. | ||
|
||
[tink-db-iface]: https://github.com/tinkerbell/tink/blob/0f46dc0/db/db.go#L21-L60 | ||
|