Added Tinkerbell on Kubernetes Proposal

tinkerbell · Aug 27, 2021 · d02c914 · d02c914
1 parent ad6afad
commit d02c914
Show file tree

Hide file tree

Showing 4 changed files with 201 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,2 @@
+.*.swp
+.DS_Store
diff --git a/proposals/0025/README.md b/proposals/0025/README.md
@@ -0,0 +1,199 @@
+---
+id: 0025
+title: Tinkerbell on Kubernetes
+status: ideation
+authors: Micah Hausler <[email protected]>
+---
+
+## Summary
+
+This is a proposal to architect Tinkerbell as a Kubernetes native application. It is a rearchitecture of the 'control plane' backend, and leaves the 'data plane' components of Tink workers and actions unchanged.
+
+## Goals and not Goals
+
+Goals:
+* Compatibility with existing Tink workers, workflow execution, and actions.
+* More easily support non-request serving controllers in Tinkerbell.
+  In this architecture, controllers like PBnJ could leverage Kubernetes primitives like [Custom Resource Definitions][crds] (CRDs), [WATCH APIs][watch], and [Field Management][fm] to complete workflow steps.
+* Migrate existing components of Hegel, Boots, and Tinkerbell API to use Kubernetes as the datastore
+* Reduce the security surface of the Tinkerbell API.
+  Implementing multiple authorization modes is a non-trivial task.
+  The fewer APIs, and authorization options, and lines of code that exist, the fewer opportunities there are for security issues to arise.
+  Tinkerbell is a high-value component of data center infrastructure, so protection of DHCP infrastructure, BMC/IPMI management, needs to be treated as such.
+* Support a highly-available architecture.
+  Postgres is a fantastic database, but managing high-availability with graceful failover is not trivial.
+  Using an alternative data store that better supports failure would better help operators to have higher availability and not require downtime for upgrades or failover.
+
+Non Goals:
+* Implement attribute-based authorization in Tink API.
+  This is intentionally descoped from this proposal, but could be implemented in a separate proposal.
+* Require Tinkerbell to be operated as pods inside Kubernetes.
+  The Kubernetes API would become a dependency of Tinkerbell, but that API could exist in a cloud provider or on-premise.
+* Require the use of Cluster API ([CAPI][capi]).
+  The [Tinkerbell CAPI provider][capt] (CAPT) is mentioned in this proposal only for reference of a known Tinkerbell client.
+  Implementation of this proposal will necessitate changes in CAPT, but that is not the core motivation for this proposal.
+* Implementation of PBnJ as a Kubernetes controller
+* Make the Tink worker a client of Kubernetes.
+  Kubernetes doesn't natively support robust attribute-based identities for non-node identities.
+  It does have the [Node Authorizer][node-authorizer], but that is specific to authorization of Kubelet communication to the Kubernetes API.
+
+[crds]: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/
+[watch]: https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes
+[fm]: https://kubernetes.io/docs/reference/using-api/server-side-apply/#field-management
+[capi]: https://cluster-api.sigs.k8s.io/
+[capt]: https://github.com/tinkerbell/cluster-api-provider-tink
+
+## Content
+
+### Architectural Motivation
+
+Tinkerbell is very flexible and uses standard protocols like PXE, DHCP, and HTTP to support provisioning hardware.
+In order to support more features, more API machinery work will be required.
+Listed below are some of the contributing architectural motivations.
+
+* Streaming updates to clients is difficult with the current database
+* Multiple-worker workflows will require numerous API changes
+* There is no authorization in the Tinkerbell API today.
+  In order to support a least-privilege access model, several new authorization modes would need to be supported
+  * An attribute-based authorization method for Tink workers.
+    At the time of writing, any Tink worker is authorized to list all hardware data, including possibly sensitive metadata for other hardware.
+    Today, the Tinkerbell Cluster API (CAPI) provider stores machine CA private key data in Tinkerbell hardware metadata.
+    Ideally, some authorizer that could make decisions based on authentication identity would be used (The Kubernetes analog would be the [Node Authorizer][node-authorizer]).
+    The authenticatoin method in this model could be a network identity, a TPM, or some other bootstrapped authentication credential.
+  * A role-based access method administrative clients.
+  * Either a role-based access method, or some forwarding mechanism for passthrough clients like Boots and Hegel.
+* Operating high-availability Postgres is non-trivial in many environments.
+
+[node-authorizer]: https://kubernetes.io/docs/reference/access-authn-authz/node/
+
+If we only focus on the Tinkerbell API clients as it exists in Q3 2021, the architecture looks something like the following:
+
+[![current-architecture](./tink-arch-1.png)](.tink-arch-2.png)
+
+In this proposal, all non-worker clients of Tinkerbell would become Kubernetes clients.
+
+[![proposed-architecture](./tink-arch-2.png)](./tink-arch-2.png)
+
+By using the Kubernetes API as the datastore, the motivations are addressed in the following ways:
+
+* Kubernetes natively supports streaming watches for CRD types.
+  Any process (PBnJ, Tink API, CAPT, etc) that need streaming updates can leverage Kubernetes Watches.
+* Multi-worker workflows will require some backend changes, but that will be opaque to Tink worker clients.
+* The only authorization mode required for the Tinkerbell API will be to support
+  * All Kubernetes clients (Boots, Hegel, PBnJ, etc) can have least-permission RBAC policies to limit the permissions for each respective client.
+* High-availability Kubernetes and etcd can be delegated to a cloud provider or an on-premise Kubernetes cluster
+
+### Tradeoffs
+
+TODO
+
+### User Experience
+
+In order to use Tinkerbell, clients would interact with the Kubernetes API. In order to provision a machine, the steps would be:
+
+1. User creates a Hardware CRD object in Kubernetes.
+   This is analogous to the current `tink hardware push < hardware.json` command.
+1. User creates a Template CRD object in Kubernetes.
+   This is analogous to the current `tink template create < template.json` command
+1. User creates a Workflow CRD object in Kubernetes.
+   All the user would need to specify in the object `.spec` would be a reference to the Template object, and mappings for devices.
+   This is analogous to the current `tink workflow create -t $TEMPLATE_ID -r "{\"device_1\":\"$MAC_ADDR\"}` command.
+1. The Tinkerbell API would include a Kubernetes workflow controller that would watch for created workflows, and fill out the `.status` with most of the logic that currently exists in the [`CreateWorkflow()`][createwfrpc] RPC.
+1. The Tinkerbell API could subscribe to Workflow CRD changes, and stream them on to Tinkerbell worker clients over the existing [`GetWorkflowContexts()`][getwfctxs] streaming RPC.
+1. Tinkerbell worker clients would continue to call the existing Tinkerbell APIs to execute workflows.
+  The Tinkerbell API would store updates in a Workflow CRD `.status`
+
+[createwfrpc]: https://github.com/tinkerbell/tink/blob/b217be8/grpc-server/workflow.go#L19-L72
+[getwfctxs]: https://github.com/tinkerbell/tink/blob/a56e5cf9/protos/workflow/workflow.proto#L75
+
+## System-context-diagram
+
+See above diagrams.
+
+## APIs
+
+The following [Tinkerbell Workflow APIs][wf-apis] are used by the Tink worker, and would remain.
+All other Tinkerbell APIs would be removed.
+
+[wf-apis]: https://github.com/tinkerbell/tink/blob/f6aa3930/protos/workflow/workflow.proto
+
+```protobuf
+service WorkflowService {
+  rpc GetWorkflowContexts(WorkflowContextRequest) returns (stream WorkflowContext) {}
+  rpc GetWorkflowActions(WorkflowActionsRequest) returns (WorkflowActionList) {}
+  rpc ReportActionStatus(WorkflowActionStatus) returns (Empty) {}
+  rpc GetWorkflowData(GetWorkflowDataRequest) returns (GetWorkflowDataResponse) {}
+  rpc UpdateWorkflowData(UpdateWorkflowDataRequest) returns (Empty) {}
+}
+```
+
+The structure of the Kubernetes CRD types will start out with those defined in the [Tinkerbell CAPI Provider][capt-types]. The code defining those types would be migrated into the tinkerbell/tink GitHub repository so other clients could import the Go packages.
+
+[capt-types]: https://github.com/tinkerbell/cluster-api-provider-tink/tree/main/tink/api/v1alpha1
+
+## Threat Model
+
+### Actors
+
+**Machines**
+
+In this proposal, a machine is not trusted any more than in Tinkerbell's API today.
+By design, in this proposal machines will have decreased access than they do today, with access to the APIs listed above.
+As it is in Tinkerbell today, the network is trusted when provisioning hardware.
+
+**Hegel**
+
+Hegel will only require read-only access to Hardware CRDs in Kubernetes.
+RBAC policies would govern the level of access to the Kubernetes API.
+
+Hegel will need to guard against confused-deputy attacks and not return metadata to the wrong client.
+
+**Boots**
+
+Boots will only require read-only access to Hardware CRDs in Kubernetes.
+RBAC policies would govern the level of access to the Kubernetes API.
+
+**PBnJ**
+
+PBnJ will require:
+* Read-only access to Hardware CRDs to discover management interface connectivity
+* Mutating access to Workflow CRDs to execute workflow steps like power cycling and BIOS management
+* Some level of secret access to connect to management interfaces. Implementation and design of that access is out of scope of this proposal.
+
+**Tink API**
+
+The Tinkerbell API will require:
+* Read access to Hardware CRDs in Kubernetes
+* Read access to Template CRDs in Kubernetes
+* Write access to Workflow CRDs in Kubernetes
+
+**Tinkerbell Workflow Controller**
+
+There will need to be a controller to process creation of workflows. It will need:
+* Read access to Hardware CRDs in Kubernetes
+* Read access to Template CRDs in Kubernetes
+* Write access to Workflow CRDs `.spec` in Kubernetes
+
+**Tinkerbell Validating Webhook**
+
+A validating webhook will be used to validate Templates and Workflows as they are created.
+It will not require any Kubernetes API access, but needs to be reachable by the Kubernetes API server.
+
+**Tinkerbell Administrator**
+
+Tinkerbell administrators might require varying levels of access in order to perform CRUD operations on any or all Tinkerbell types.
+Kubernetes cluster administrators can define custom levels of access with RBAC policies to grant Tinkerbell admins the correct least-privilege level of access.
+
+## Alternatives
+
+There are two primary alternatives to achieve some of the stated design goals:
+
+Add a `--kubernetes-mode` flag to the Tinkerbell components and support two alternate modes: Tinkerbell as it exists today, and Tinkerbell on Kubernetes.
+This would allow two alternate deployment/database models for operators who do not want to introduce Kubernetes to their deployment model.
+The biggest disadvantage of this alternative would be added complexity in supporting two separate paths of the Tinkerbell codebase.
+
+Leave the Tinkerbell API alone, and leverage/modify the Tinkerbell API's [internal Database interface][tink-db-iface] to function on top of Kubernetes or a key-value datastore like [etcd][etcd]. 
+This alternative would help with high-availability deployments, but all the other motivations would remain unaddressed, and need to be implemented in Tinkerbell's API.
+
+[tink-db-iface]: https://github.com/tinkerbell/tink/blob/0f46dc0/db/db.go#L21-L60
+[etcd]: https://etcd.io/
diff --git a/proposals/0025/tink-arch-1.png b/proposals/0025/tink-arch-1.png
diff --git a/proposals/0025/tink-arch-2.png b/proposals/0025/tink-arch-2.png