Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for the Kubernetes Resource Model #46

Merged
merged 1 commit into from
Nov 24, 2021

Conversation

micahhausler
Copy link
Contributor

@micahhausler micahhausler commented Aug 27, 2021

Description

Added proposal 0026 Support for the Kubernetes Resource Model.


View rendered proposals/0026/README.md

@rawkode
Copy link
Contributor

rawkode commented Aug 27, 2021

Rather than needing a fully fledged Kubernetes cluster, it would be cool to see this work with kcp

https://github.com/kcp-dev/kcp

@micahhausler
Copy link
Contributor Author

Rather than needing a fully fledged Kubernetes cluster, it would be cool to see this work with kcp

https://github.com/kcp-dev/kcp

That would be super cool. As long as KCP supports CRDs, admission webhooks, finalizers, watches, leases, and RBAC (which I think it does support all of those), it should be portable.

proposals/0025/README.md Outdated Show resolved Hide resolved
proposals/0025/README.md Outdated Show resolved Hide resolved
proposals/0025/README.md Outdated Show resolved Hide resolved
@micahhausler micahhausler changed the title Added Tinkerbell on Kubernetes Proposal Add support for the Kubernetes Resource Model Aug 30, 2021
@andy-v-h
Copy link

andy-v-h commented Sep 1, 2021

I like that idea of creating in tree k8s support. Specifically making it easier to do a hitless reload on hegel by updating a ConfigMap, for example. However, I'm not sure a CRD based approach is the direction we would be best. My primary concern is the additional load on etcd and resiliency there. A scenario I find concerning is where you have relatively small amounts of compute available for running tinkerbell, and requiring it to rely on etcd vs being able to use an external DB service. And in a DR scenario, how would a recovered tinkerbell not reprov all of my hosts?

@micahhausler
Copy link
Contributor Author

I like that idea of creating in tree k8s support. Specifically making it easier to do a hitless reload on hegel by updating a ConfigMap, for example. However, I'm not sure a CRD based approach is the direction we would be best. My primary concern is the additional load on etcd and resiliency there.

ConfigMaps in Kubernetes are unstructured and don't allow for strong typing. This causes lots of practical validation problems. CRDs are roughly equivalent in terms of efficiency and load, and they allow for much easier client development. etcd specifically can be operated in many different modes, from single node (lower resiliency) more typically 3 nodes for increased availability. Neither etcd nor Kubernetes need to be operated on the same compute that Tinkerbell is operated on. This proposal is agnostic to where Kubernetes is run, and operators can connect Tinkerbell to Kubernetes in a managed cloud environment where compute resource is less of a concern.

A scenario I find concerning is where you have relatively small amounts of compute available for running tinkerbell, and requiring it to rely on etcd vs being able to use an external DB service.

Neither Kubernetes nor etcd would be required to run on the same compute as Tinkerbell. This would be equivalent to connecting to an external DB service.

And in a DR scenario, how would a recovered tinkerbell not reprov all of my hosts?

Check out the Migrating between Kubernetes clusters, Backups, and Restoration section under Administrative Tasks but the short version is that the .status of Workflows would be included in backups and restoration and machines would not be re-provisioned.

Does that address your concerns?

@splaspood
Copy link

I would love to see tinkerbell support this as one of many possible data backends, forcing us to create a well defined interface between tink server and n number of possible underlying data stores.

@displague
Copy link
Member

displague commented Sep 15, 2021

Like @splaspood, I wonder what the tinkerbell services would look like if the backend could be toggled between a local filesystem (manifest.d/ style directories), postgres, and Kubernetes resources.

In order to support these three different storage models, a pluggable data interface would benefit from a Tinkerbell data event system.

Rather than having tinkerbell poll a database, or watch a directory, or watch a set of kubernetes CRs and configmaps, tinkerbell could await data events while the data interface modules take on the polling and watching.

Does this mean that tinkerbell takes on another grpc service? Perhaps.
Does tinkerbell need caching or relationship mapping between resources? Maybe? (I don't think so.)

What events would tinkerbell need to listen for from pluggable data modules?
What features does tinkerbell need to update data stored in generic data modules?

@displague
Copy link
Member

displague commented Sep 15, 2021

Reconsidering my last comment, rather than invent a new format of event drivers and data passing (which are some of the problems we are trying to solve with this proposal), by taking the K8s API as the data and eventing backend (not necessarily with containers involved) we inherit a standard and well-evolved set of tools and opinions that bring stability (to the API and data backend).

We solve the pluggable data backend because we can use services like https://github.com/k3s-io/kine to provide a postgresql, MySQL, and yet-to-be-added backends.

@micahhausler
Copy link
Contributor Author

Reconsidering my last comment, rather than invent a new format of event drivers and data passing (which are some of the problems we are trying to solve with this proposal), by taking the K8s API as the data and eventing backend (not necessarily with containers involved) we inherit a standard and well-evolved set of tools and opinions that bring stability (to the API and data backend).

We solve the pluggable data backend because we can use services like https://github.com/k3s-io/kine to provide a postgresql, MySQL, and yet-to-be-added backends.

YES! This is exactly it!

@jacobweinstock
Copy link
Member

This is a great proposal and the WIP demo at the community meeting was really great. I do like the idea of putting this behind a flag.

@tstromberg tstromberg added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Sep 21, 2021
@micahhausler
Copy link
Contributor Author

mergify bot added a commit to tinkerbell/tink that referenced this pull request Oct 1, 2021
Signed-off-by: Micah Hausler <[email protected]>

## Description

This includes 4 changes that are preparatory for future work to migrate to the Kubernetes data model.

* Propagate stream context in streaming API. Previously `context.Background()` was used, but the stream provides a context object
* Refactored shadowed "context" package name. By naming a method variable "context", no `context` package calls could be made in the method
* Added `context.Context` to `GetWorkflowsForWorker()` database API. This plumbs down the context from the API call into the `d.instance.QueryContext()` call.
* Refactor the database interface. I added a new interface `WorkerWorkflow` with the methods that get used by APIs the Tink Worker invokes. This is essentially a no-op for now. 

## Why is this needed

See tinkerbell/proposals#46

## How Has This Been Tested?

Locally ran tests.

## How are existing users impacted? What migration steps/scripts do we need?

No impact to existing users

## Checklist:

I have:

- [ ] updated the documentation and/or roadmap (if required)
- [ ] added unit or e2e tests
- [ ] provided instructions on how to upgrade
Copy link
Member

@displague displague left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this excellent and well-thought-out proposal, @micahhausler.

I've left some comments as I read through the document, many of the questions that I raise are answered elsewhere. This suggests to me that some information could be reordered for clarity. There were a few details that I felt were missing and I've said as much (this may be hard to gather as I left follow-up comments).

I'd recommend going through this feedback and the previous feedback and make sure that any details that were given in feedback comments and responses have made their way into the proposal.

This proposal, to me, is one of the most significant to be raised. Well done!

proposals/0026/README.md Outdated Show resolved Hide resolved
proposals/0026/README.md Outdated Show resolved Hide resolved
At the time of writing, any Tink worker is authorized to list all hardware data, including possibly sensitive metadata for other hardware.
Today, the Tinkerbell Cluster API (CAPI) provider stores machine CA private key data in Tinkerbell hardware metadata.
Ideally, some authorizer that could make decisions based on authentication identity would be used (The Kubernetes analog would be the [Node Authorizer][node-authorizer]).
The authenticatoin method in this model could be a network identity, a TPM, or some other bootstrapped authentication credential.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The authenticatoin method in this model could be a network identity, a TPM, or some other bootstrapped authentication credential.
The authentication method in this model could be a network identity, a TPM, or some other bootstrapped authentication credential.


[![current-architecture](./tink-arch-1.png)](.tink-arch-2.png)

In this proposal, all non-worker clients of Tinkerbell would become Kubernetes clients.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In this proposal, all non-worker clients of Tinkerbell would become Kubernetes clients.
In this proposal, all non-worker clients of Tinkerbell may become Kubernetes clients.

This really comes down to wether or not PostgreSQL support is preserved at all and then whether the user chooses it.

1. User creates a Workflow CRD object in Kubernetes.
All the user would need to specify in the object `.spec` would be a reference to the Template object, and mappings for devices.
This is analogous to the current `tink workflow create -t $TEMPLATE_ID -r "{\"device_1\":\"$MAC_ADDR\"}` command.
1. The Tinkerbell API would include a Kubernetes workflow controller that would watch for created workflows, and fill out the `.status` with most of the logic that currently exists in the [`CreateWorkflow()`][createwfrpc] RPC.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format of status is not discussed here, but something to consider and then call out if utilized is conditioned statuses, https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crossplane/crossplane-runtime#198 includes some relatable examples, benefits, and counter-point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to see an API schema in this proposal? I was thinking it would be better to discuss and iterate on specific API structures in tinkerbell/tink#542

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of Crossplane, it looks like @displague started on https://github.com/crossplane-contrib/provider-tinkerbell a while back, which would at least provide a KRM compliant shim to Tinkerbell. Perhaps an alternative that wouldn't require rearchitecting the project?

Copy link
Member

@displague displague Oct 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Going a bit off-topic to consider what this KRM implementation means for a Crossplane Runtime Model (XRM) implementation, and perhaps how they could be the same)

Speaking of Crossplane, it looks like @displague started on a Tinkerbell Crossplane provider a while back, which would provide a KRM compliant shim to Tinkerbell

Typically the XRM spec mirrors an external cloud provider API spec as the aforementioned Tinkerell Crossplane Provider does. But if the Tinkerbell spec is implemented as KRM, but does not implement XRM, then a Crossplane Provider would need to interact with Tinkerbell through a KRM mirroring spec. The cluster could have two nearly identical resources within the same cluster to represent a composable resource.

We could overlook this, as Pods and CronJobs or ReplicaSets have a thin wrapping relationship.

However, this suggests to me that it should be possible for a controller to implement an XRM compatible interface while adopting the KRM. The benefits of doing so would be that Crossplane compositions (XR) can be applied, managed by Crossplane. The underlying service, Tinkerbell bare metal orchestration, would transparently serve as a Crossplane Provider. The Tinkerbell service would not be dependent on Crossplane's installation to add this layer of compatibility.

One complication that comes to mind is the expectation that Crossplane Resources have a Provider. In this approach, resources would be the provider. Perhaps, like the Crossplane Helm provider, the provider configuration would refer to the host cluster? Would the default be to report with the local cluster? Would alternatives provider configuration offer connection details for a non-KRM Tinkerbell backplane? Another KRM Tinkerbell backplane?

Adopting Crossplane's packaging mechanisms could provide additional benefits especially for service updates. Although, in this approach, it's not clear to me what benefits this holds over Helm or Kustomize manifests.

What does this mean in practice? The Tinkerbell KRM implementation (starting with tinkerbell/tink#542) would adopt XRM types like Referencers, ResourceStatus, ResourceSpec, and ProviderConfig.

https://crossplane.io/docs/v1.4/contributing/provider_development_guide.html#provider-development-guide
https://blog.crossplane.io/crossplane-vs-cloud-infrastructure-addons/ (by @negz) provides the relevant Crossplane context.
https://danielmangum.com/posts/crossplane-infrastructure-llvm/ (by @hasheddan discusses the versatility of the Crossplane resource model)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely too speculative to be actionable. This would be the first controller to act in this XRM compatible way without being first-and-foremost a Crossplane provider.

proposals/0026/README.md Outdated Show resolved Hide resolved

## APIs

The following [Tinkerbell Workflow APIs][wf-apis] are used by the Tink worker, and would remain.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can these services not be initially replaced by the Kubernetes model? Do you see them eventually be removed?

Do status conditions, events, logs, and additional CRDs offer the means to move these entities to the Kubernetes API?

.. I think I see.

These are the APIs needed for the provisioning machine to get actions and update state. That could be more clearly defined.

Why would Tink workers not become Kubernetes clients, for example? Each machine could use a unique service account which would open up more possibilities (potentially reducing the machine threat model).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can these services not be initially replaced by the Kubernetes model? Do you see them eventually be removed?

The main reason to only include support for worker APIs is to shorten the migration period and amount of effort required. The ultimate goal is to make all non-worker Tink clients talk to Kubernetes, so instead of a two-step migration where boots/hegel still talk to Tink API, just go directly to the new model. So yes, I do see them eventually removed.

These are the APIs needed for the provisioning machine to get actions and update state. That could be more clearly defined.

Will do.

Why would Tink workers not become Kubernetes clients, for example? Each machine could use a unique service account which would open up more possibilities (potentially reducing the machine threat model)

I am not inclined to make Tink workers K8s clients at this time. The main reason is I want to limit what an individual (unauthenticated/untrusted) worker can do or see in the API. K8s authz lacks the ability to easily do something like "This worker should only be able to view a specific CRD that matches the worker's name." We basically want attribute based authz, and K8s is all role-based. Yes, you could create a specific RBAC role for each worker, but that gets messy. It's much easier to have a trusted deputy like the Tink API that can correctly and securely propagate the state changes in the storage layer.

proposals/0026/README.md Show resolved Hide resolved
proposals/0026/README.md Show resolved Hide resolved
proposals/0026/README.md Outdated Show resolved Hide resolved
jacobweinstock
jacobweinstock previously approved these changes Nov 2, 2021
Copy link
Member

@jacobweinstock jacobweinstock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that this has been vetted quite a bit and has been open for comment for a while now. I believe @micahhausler has addressed all comments.

thebsdbox
thebsdbox previously approved these changes Nov 2, 2021
Copy link
Contributor

@thebsdbox thebsdbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, I'd love to see this moved forward.

@micahhausler
Copy link
Contributor Author

I've updated the proposal and addressed comments

proposals/0026/README.md Outdated Show resolved Hide resolved
proposals/0026/README.md Outdated Show resolved Hide resolved
@micahhausler
Copy link
Contributor Author

@displague Fixed cutoffs and did a once-over of the whole thing

displague
displague previously approved these changes Nov 16, 2021
Copy link
Member

@displague displague left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details and revisiting the document for follow-ups, @micahhausler!

proposals/0026/README.md Outdated Show resolved Hide resolved
@micahhausler
Copy link
Contributor Author

Updated the status to Published

@jacobweinstock
Copy link
Member

jacobweinstock commented Nov 18, 2021

Hey @tstromberg, when/if you have some cycle would you mind checking this out again, please?

Copy link
Contributor

@tstromberg tstromberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let it be merged! Thank you for addressing the bootstrapping concerns.

@mergify mergify bot merged commit 8390df9 into tinkerbell:main Nov 24, 2021
mergify bot added a commit to tinkerbell/smee that referenced this pull request May 5, 2022
Signed-off-by: Micah Hausler <[email protected]>

## Description

The new backend uses the Kubernetes CRD types in Tinkerbell core. Boots could use a bit of a refactor/organization in how the backends work, but for now I've tried to fit in to the existing structure. 

## Why is this needed

This is the boots side of work for tinkerbell/proposals#46

Fixes: #

## How Has This Been Tested?

Tested with real bare metal workers using an image built from this branch. I will add E2E tests in a follow up PR, or can add to this branch

## How are existing users impacted? What migration steps/scripts do we need?

No impact to existing users

## Checklist:

I have:

- [ ] updated the documentation and/or roadmap (if required)
- [ ] added unit or e2e tests
- [ ] provided instructions on how to upgrade
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. ready-to-merge Signal to Mergify to merge the PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants