Skip to content

Commit

Permalink
[0025] Add Administrative Tasks and Failure Modes
Browse files Browse the repository at this point in the history
Signed-off-by: Micah Hausler <[email protected]>
  • Loading branch information
micahhausler committed Aug 30, 2021
1 parent fa36479 commit e059283
Showing 1 changed file with 82 additions and 26 deletions.
108 changes: 82 additions & 26 deletions proposals/0025/README.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,53 @@
---
id: 0025
title: Tinkerbell on Kubernetes
title: Support for the Kubernetes Resource Model
status: ideation
authors: Micah Hausler <[email protected]>
---

## Summary

This is a proposal to architect Tinkerbell as a Kubernetes native application. It is a rearchitecture of the 'control plane' backend, and leaves the 'data plane' components of Tink workers and actions unchanged.
This is a proposal to architect Tinkerbell as an application using the [Kubernetes Resource Model][krm].
It is a rearchitecture of the 'control plane' backend, and leaves the 'data plane' components of Tink workers and actions unchanged.

[krm]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture/resource-management.md

## Goals and not Goals

Goals:
* Compatibility with existing Tink workers, workflow execution, and actions.
* More easily support non-request serving controllers in Tinkerbell.
* **Compatibility with existing Tink workers, workflow execution, and actions.**
* **More easily support non-request serving controllers in Tinkerbell.**
In this architecture, controllers like PBnJ could leverage Kubernetes primitives like [Custom Resource Definitions][crds] (CRDs), [WATCH APIs][watch], and [Field Management][fm] to complete workflow steps.
* Migrate existing components of Hegel, Boots, and Tinkerbell API to use Kubernetes as the datastore
* Reduce the security surface of the Tinkerbell API.
* **Reduce the security surface of the Tinkerbell API.**
Implementing multiple authorization modes is a non-trivial task.
The fewer APIs, and authorization options, and lines of code that exist, the fewer opportunities there are for security issues to arise.
Tinkerbell is a high-value component of data center infrastructure, so protection of DHCP infrastructure, BMC/IPMI management, needs to be treated as such.
* Support a highly-available architecture.
* **Support a highly-available architecture.**
Postgres is a fantastic database, but managing high-availability with graceful failover is not trivial.
Using an alternative data store that better supports failure would better help operators to have higher availability and not require downtime for upgrades or failover.
* **Migrate existing components**.
Hegel, Boots, and Tinkerbell API will be modified to use Kubernetes as the datastore
* **Support migration of existing installations.**
Migration tooling and documentation will be provided for existing Workflows, Hardware, and Templates stored in Tinkerbell.
Migration should be as straightforward as:
* Creating Tinkerbell CRDs in the Kubernetes API
* Running a provided migration command
* Restarting each Tinkerbell component targeting Kubernetes (ex: `--k8s-mode=true`)

Non Goals:
* Implement attribute-based authorization in Tink API.
This is intentionally descoped from this proposal, but could be implemented in a separate proposal.
* Require Tinkerbell to be operated as pods inside Kubernetes.
* **Implement attribute-based authorization in Tink API.**
This is intentionally descoped from this proposal and can be implemented in a separate proposal.
* **Require Tinkerbell to be operated as pods inside Kubernetes.**
The Kubernetes API would become a dependency of Tinkerbell, but that API could exist in a cloud provider or on-premise.
* Require the use of Cluster API ([CAPI][capi]).
* **Require the use of Cluster API ([CAPI][capi]).**
The [Tinkerbell CAPI provider][capt] (CAPT) is mentioned in this proposal only for reference of a known Tinkerbell client.
Implementation of this proposal will necessitate changes in CAPT, but that is not the core motivation for this proposal.
* Implementation of PBnJ as a Kubernetes controller
* Make the Tink worker a client of Kubernetes.
* **Implementation of PBnJ as a Kubernetes controller**
* **Make the Tink worker a client of Kubernetes.**
Kubernetes doesn't natively support robust attribute-based identities for non-node identities.
It does have the [Node Authorizer][node-authorizer], but that is specific to authorization of Kubelet communication to the Kubernetes API.
* **Provide a zero-downtime migration experience.**
Migration tooling and documentation will be provided, but migrating in-progress workflows is out of scope.

[crds]: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/
[watch]: https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes
Expand Down Expand Up @@ -83,13 +95,60 @@ By using the Kubernetes API as the datastore, the motivations are addressed in t
* All Kubernetes clients (Boots, Hegel, PBnJ, etc) can have least-permission RBAC policies to limit the permissions for each respective client.
* High-availability Kubernetes and etcd can be delegated to a cloud provider or an on-premise Kubernetes cluster

### Tradeoffs
### Administrative Tasks

**Migration from Postgres**

In order to support an easier migration from Postgres, we will add a `--kubernetes-mode` or equivalent flag to the Tinkerbell components and for some period of time support two alternate modes: Tinkerbell as it exists today, and the Kubernetes Resource Model.

TODO: How long to support both modes?

A zero-downtime migration is not in scope for this change mainly because it would add significant complexity and require mirroring data between Postgres and Kubernetes.

**Bootstrapping**

One of the stated design goals is to not require running the Tinkerbell control plane as pods in the Kubernetes cluster.
Installation of Tinkerbell using Kubernetes as a datastore will differ in the following ways:
* A Kubernetes cluster will need to exist
* A Kubernetes [validating admission webhook][validating-webhook] for CRD types will need to be hosted and reachable by the Kubernetes API server.
This component is most easily operated as a pod in a cluster, but could be operated and hosted externally.
It does not require access to the on-premise Tinkerbell control plane
* Tinkerbell API, Hegel, and Boots will need network connectivity and credentials to the Kubernetes API.
As processes existing outside of the Kubernetes cluster, this will most likely be [x509 client certificates][x509-certs] or exported [Kubernetes Service Account tokens][sa-tokens].
* Hardware and Templates will be seeded using `kubectl` instead of `tink-cli`

Should operators want to run Tinkerbell in an on-premise Kubernetes cluster and use that cluster as the data store, there arises a need to bootstrap the Tinkerbell control plane.
In order to support this mode, we will need to support migration from either a cloud Kubernetes cluster or a temporarily bootstrap cluster with [`kind`][kind] to a long-lived on-premise management cluster.

[validating-webhook]: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#validatingadmissionwebhook
[x509-certs]: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#x509-client-certs
[sa-tokens]: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#service-account-tokens
[kind]: https://kind.sigs.k8s.io/

TODO
**Migrating between Kubernetes clusters, Backups, and Restoration**

In order to support the previously mentioned bootstrap cluster model, migration from one cluster to another will need to be supported.
Backups of Hardware, Templates, and Workflows can be made using [`kubectl`][kubectl], but simple restoration is not possible as the [`/status` subresource][status-subresource] on CRDs makes the `.status` field a separate API call.

Concerns around data backup and failover were previously delegated to Postgres administration and now become a Kubernetes administration issue.

The `tink-cli` will need to implement backup and restore functionality to support both these administrative concerns.

[kubectl]: https://kubernetes.io/docs/reference/kubectl/overview/
[status-subresource]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#status-subresource

### Failure modes

**Kubernetes unavailability**

This change introduces a dependency on Kubernetes being available.
Concerns around data backup and failover were previously delegated to Postgres administration and now become a Kubernetes administration issue.
As mentioned previously, backup and restoration commands will be added to restore from Kubernetes API unavailability and could help support switching to a different underlying Kubernetes cluster should the primary cluster become unavailable.

### User Experience

In order to use Tinkerbell, clients would interact with the Kubernetes API. In order to provision a machine, the steps would be:
In order to use Tinkerbell, clients would interact with the Kubernetes API.
In order to provision a machine, the steps would be:

1. User creates a Hardware CRD object in Kubernetes.
This is analogous to the current `tink hardware push < hardware.json` command.
Expand Down Expand Up @@ -127,7 +186,8 @@ service WorkflowService {
}
```

The structure of the Kubernetes CRD types will start out with those defined in the [Tinkerbell CAPI Provider][capt-types]. The code defining those types would be migrated into the tinkerbell/tink GitHub repository so other clients could import the Go packages.
The structure of the Kubernetes CRD types will start out similar to those defined in the [Tinkerbell CAPI Provider][capt-types].
The code defining those types will be migrated into the tinkerbell/tink GitHub repository so other clients could import the Go packages.

[capt-types]: https://github.com/tinkerbell/cluster-api-provider-tink/tree/main/tink/api/v1alpha1

Expand Down Expand Up @@ -158,7 +218,8 @@ RBAC policies would govern the level of access to the Kubernetes API.
PBnJ will require:
* Read-only access to Hardware CRDs to discover management interface connectivity
* Mutating access to Workflow CRDs to execute workflow steps like power cycling and BIOS management
* Some level of secret access to connect to management interfaces. Implementation and design of that access is out of scope of this proposal.
* Some level of secret access to connect to management interfaces.
Implementation and design of that access is out of scope of this proposal.

**Tink API**

Expand All @@ -169,7 +230,8 @@ The Tinkerbell API will require:

**Tinkerbell Workflow Controller**

There will need to be a controller to process creation of workflows. It will need:
There will need to be a controller to process creation of workflows.
It will need:
* Read access to Hardware CRDs in Kubernetes
* Read access to Template CRDs in Kubernetes
* Write access to Workflow CRDs `.spec` in Kubernetes
Expand All @@ -186,13 +248,7 @@ Kubernetes cluster administrators can define custom levels of access with RBAC p

## Alternatives

There are two primary alternatives to achieve some of the stated design goals:

Add a `--kubernetes-mode` flag to the Tinkerbell components and support two alternate modes: Tinkerbell as it exists today, and Tinkerbell on Kubernetes.
This would allow two alternate deployment/database models for operators who do not want to introduce Kubernetes to their deployment model.
The biggest disadvantage of this alternative would be added complexity in supporting two separate paths of the Tinkerbell codebase.

Leave the Tinkerbell API alone, and leverage/modify the Tinkerbell API's [internal Database interface][tink-db-iface] to function on top of Kubernetes or a key-value datastore like [etcd][etcd].
There are primary alternative to achieve some of the stated design goals would leave the Tinkerbell API alone, and leverage/modify the Tinkerbell API's [internal Database interface][tink-db-iface] to function on top of Kubernetes or a key-value datastore like [etcd][etcd].
This alternative would help with high-availability deployments, but all the other motivations would remain unaddressed, and need to be implemented in Tinkerbell's API.

[tink-db-iface]: https://github.com/tinkerbell/tink/blob/0f46dc0/db/db.go#L21-L60
Expand Down

0 comments on commit e059283

Please sign in to comment.