Skip to content

Commit

Permalink
Added Tinkerbell on Kubernetes Proposal
Browse files Browse the repository at this point in the history
  • Loading branch information
micahhausler committed Aug 27, 2021
1 parent ad6afad commit d02c914
Show file tree
Hide file tree
Showing 4 changed files with 201 additions and 0 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.*.swp
.DS_Store
199 changes: 199 additions & 0 deletions proposals/0025/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
---
id: 0025
title: Tinkerbell on Kubernetes
status: ideation
authors: Micah Hausler <[email protected]>
---

## Summary

This is a proposal to architect Tinkerbell as a Kubernetes native application. It is a rearchitecture of the 'control plane' backend, and leaves the 'data plane' components of Tink workers and actions unchanged.

## Goals and not Goals

Goals:
* Compatibility with existing Tink workers, workflow execution, and actions.
* More easily support non-request serving controllers in Tinkerbell.
In this architecture, controllers like PBnJ could leverage Kubernetes primitives like [Custom Resource Definitions][crds] (CRDs), [WATCH APIs][watch], and [Field Management][fm] to complete workflow steps.
* Migrate existing components of Hegel, Boots, and Tinkerbell API to use Kubernetes as the datastore
* Reduce the security surface of the Tinkerbell API.
Implementing multiple authorization modes is a non-trivial task.
The fewer APIs, and authorization options, and lines of code that exist, the fewer opportunities there are for security issues to arise.
Tinkerbell is a high-value component of data center infrastructure, so protection of DHCP infrastructure, BMC/IPMI management, needs to be treated as such.
* Support a highly-available architecture.
Postgres is a fantastic database, but managing high-availability with graceful failover is not trivial.
Using an alternative data store that better supports failure would better help operators to have higher availability and not require downtime for upgrades or failover.

Non Goals:
* Implement attribute-based authorization in Tink API.
This is intentionally descoped from this proposal, but could be implemented in a separate proposal.
* Require Tinkerbell to be operated as pods inside Kubernetes.
The Kubernetes API would become a dependency of Tinkerbell, but that API could exist in a cloud provider or on-premise.
* Require the use of Cluster API ([CAPI][capi]).
The [Tinkerbell CAPI provider][capt] (CAPT) is mentioned in this proposal only for reference of a known Tinkerbell client.
Implementation of this proposal will necessitate changes in CAPT, but that is not the core motivation for this proposal.
* Implementation of PBnJ as a Kubernetes controller
* Make the Tink worker a client of Kubernetes.
Kubernetes doesn't natively support robust attribute-based identities for non-node identities.
It does have the [Node Authorizer][node-authorizer], but that is specific to authorization of Kubelet communication to the Kubernetes API.

[crds]: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/
[watch]: https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes
[fm]: https://kubernetes.io/docs/reference/using-api/server-side-apply/#field-management
[capi]: https://cluster-api.sigs.k8s.io/
[capt]: https://github.com/tinkerbell/cluster-api-provider-tink

## Content

### Architectural Motivation

Tinkerbell is very flexible and uses standard protocols like PXE, DHCP, and HTTP to support provisioning hardware.
In order to support more features, more API machinery work will be required.
Listed below are some of the contributing architectural motivations.

* Streaming updates to clients is difficult with the current database
* Multiple-worker workflows will require numerous API changes
* There is no authorization in the Tinkerbell API today.
In order to support a least-privilege access model, several new authorization modes would need to be supported
* An attribute-based authorization method for Tink workers.
At the time of writing, any Tink worker is authorized to list all hardware data, including possibly sensitive metadata for other hardware.
Today, the Tinkerbell Cluster API (CAPI) provider stores machine CA private key data in Tinkerbell hardware metadata.
Ideally, some authorizer that could make decisions based on authentication identity would be used (The Kubernetes analog would be the [Node Authorizer][node-authorizer]).
The authenticatoin method in this model could be a network identity, a TPM, or some other bootstrapped authentication credential.
* A role-based access method administrative clients.
* Either a role-based access method, or some forwarding mechanism for passthrough clients like Boots and Hegel.
* Operating high-availability Postgres is non-trivial in many environments.

[node-authorizer]: https://kubernetes.io/docs/reference/access-authn-authz/node/

If we only focus on the Tinkerbell API clients as it exists in Q3 2021, the architecture looks something like the following:

[![current-architecture](./tink-arch-1.png)](.tink-arch-2.png)

In this proposal, all non-worker clients of Tinkerbell would become Kubernetes clients.

[![proposed-architecture](./tink-arch-2.png)](./tink-arch-2.png)

By using the Kubernetes API as the datastore, the motivations are addressed in the following ways:

* Kubernetes natively supports streaming watches for CRD types.
Any process (PBnJ, Tink API, CAPT, etc) that need streaming updates can leverage Kubernetes Watches.
* Multi-worker workflows will require some backend changes, but that will be opaque to Tink worker clients.
* The only authorization mode required for the Tinkerbell API will be to support
* All Kubernetes clients (Boots, Hegel, PBnJ, etc) can have least-permission RBAC policies to limit the permissions for each respective client.
* High-availability Kubernetes and etcd can be delegated to a cloud provider or an on-premise Kubernetes cluster

### Tradeoffs

TODO

### User Experience

In order to use Tinkerbell, clients would interact with the Kubernetes API. In order to provision a machine, the steps would be:

1. User creates a Hardware CRD object in Kubernetes.
This is analogous to the current `tink hardware push < hardware.json` command.
1. User creates a Template CRD object in Kubernetes.
This is analogous to the current `tink template create < template.json` command
1. User creates a Workflow CRD object in Kubernetes.
All the user would need to specify in the object `.spec` would be a reference to the Template object, and mappings for devices.
This is analogous to the current `tink workflow create -t $TEMPLATE_ID -r "{\"device_1\":\"$MAC_ADDR\"}` command.
1. The Tinkerbell API would include a Kubernetes workflow controller that would watch for created workflows, and fill out the `.status` with most of the logic that currently exists in the [`CreateWorkflow()`][createwfrpc] RPC.
1. The Tinkerbell API could subscribe to Workflow CRD changes, and stream them on to Tinkerbell worker clients over the existing [`GetWorkflowContexts()`][getwfctxs] streaming RPC.
1. Tinkerbell worker clients would continue to call the existing Tinkerbell APIs to execute workflows.
The Tinkerbell API would store updates in a Workflow CRD `.status`

[createwfrpc]: https://github.com/tinkerbell/tink/blob/b217be8/grpc-server/workflow.go#L19-L72
[getwfctxs]: https://github.com/tinkerbell/tink/blob/a56e5cf9/protos/workflow/workflow.proto#L75

## System-context-diagram

See above diagrams.

## APIs

The following [Tinkerbell Workflow APIs][wf-apis] are used by the Tink worker, and would remain.
All other Tinkerbell APIs would be removed.

[wf-apis]: https://github.com/tinkerbell/tink/blob/f6aa3930/protos/workflow/workflow.proto

```protobuf
service WorkflowService {
rpc GetWorkflowContexts(WorkflowContextRequest) returns (stream WorkflowContext) {}
rpc GetWorkflowActions(WorkflowActionsRequest) returns (WorkflowActionList) {}
rpc ReportActionStatus(WorkflowActionStatus) returns (Empty) {}
rpc GetWorkflowData(GetWorkflowDataRequest) returns (GetWorkflowDataResponse) {}
rpc UpdateWorkflowData(UpdateWorkflowDataRequest) returns (Empty) {}
}
```

The structure of the Kubernetes CRD types will start out with those defined in the [Tinkerbell CAPI Provider][capt-types]. The code defining those types would be migrated into the tinkerbell/tink GitHub repository so other clients could import the Go packages.

[capt-types]: https://github.com/tinkerbell/cluster-api-provider-tink/tree/main/tink/api/v1alpha1

## Threat Model

### Actors

**Machines**

In this proposal, a machine is not trusted any more than in Tinkerbell's API today.
By design, in this proposal machines will have decreased access than they do today, with access to the APIs listed above.
As it is in Tinkerbell today, the network is trusted when provisioning hardware.

**Hegel**

Hegel will only require read-only access to Hardware CRDs in Kubernetes.
RBAC policies would govern the level of access to the Kubernetes API.

Hegel will need to guard against confused-deputy attacks and not return metadata to the wrong client.

**Boots**

Boots will only require read-only access to Hardware CRDs in Kubernetes.
RBAC policies would govern the level of access to the Kubernetes API.

**PBnJ**

PBnJ will require:
* Read-only access to Hardware CRDs to discover management interface connectivity
* Mutating access to Workflow CRDs to execute workflow steps like power cycling and BIOS management
* Some level of secret access to connect to management interfaces. Implementation and design of that access is out of scope of this proposal.

**Tink API**

The Tinkerbell API will require:
* Read access to Hardware CRDs in Kubernetes
* Read access to Template CRDs in Kubernetes
* Write access to Workflow CRDs in Kubernetes

**Tinkerbell Workflow Controller**

There will need to be a controller to process creation of workflows. It will need:
* Read access to Hardware CRDs in Kubernetes
* Read access to Template CRDs in Kubernetes
* Write access to Workflow CRDs `.spec` in Kubernetes

**Tinkerbell Validating Webhook**

A validating webhook will be used to validate Templates and Workflows as they are created.
It will not require any Kubernetes API access, but needs to be reachable by the Kubernetes API server.

**Tinkerbell Administrator**

Tinkerbell administrators might require varying levels of access in order to perform CRUD operations on any or all Tinkerbell types.
Kubernetes cluster administrators can define custom levels of access with RBAC policies to grant Tinkerbell admins the correct least-privilege level of access.

## Alternatives

There are two primary alternatives to achieve some of the stated design goals:

Add a `--kubernetes-mode` flag to the Tinkerbell components and support two alternate modes: Tinkerbell as it exists today, and Tinkerbell on Kubernetes.
This would allow two alternate deployment/database models for operators who do not want to introduce Kubernetes to their deployment model.
The biggest disadvantage of this alternative would be added complexity in supporting two separate paths of the Tinkerbell codebase.

Leave the Tinkerbell API alone, and leverage/modify the Tinkerbell API's [internal Database interface][tink-db-iface] to function on top of Kubernetes or a key-value datastore like [etcd][etcd].
This alternative would help with high-availability deployments, but all the other motivations would remain unaddressed, and need to be implemented in Tinkerbell's API.

[tink-db-iface]: https://github.com/tinkerbell/tink/blob/0f46dc0/db/db.go#L21-L60
[etcd]: https://etcd.io/
Binary file added proposals/0025/tink-arch-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added proposals/0025/tink-arch-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit d02c914

Please sign in to comment.