Skip to content
This repository has been archived by the owner on Jun 28, 2023. It is now read-only.

Offer minimal deployment model that supports development and experimentation of Tanzu #2266

Closed
joshrosso opened this issue Oct 18, 2021 · 58 comments · Fixed by #2376
Closed
Assignees
Labels
kind/feature A request for a new feature proposal/acccepted Change is accepted
Milestone

Comments

@joshrosso
Copy link
Contributor

joshrosso commented Oct 18, 2021

Asks

  1. Read this proposal
  2. Try out the proposed model
  3. Vote 👍 or 👎 on this issue
  4. If you have additional feedback, respond to this issue

We aim to accept or reject this proposal 60 days after opening (12/17/2021).

Proposal

⚠️ The name of this feature is currently being determined, you may see it referred to as local, standalone, and unmanaged-cluster in various places. We have decided to move forward with the name unmanaged-cluster. Please note that references to local or standalone (not standalone-cluster) represent unmanaged-cluster

🚨 This proposal has been partially implemented to help further the conversation around whether we should accept it in this project. Read here for details on how to try it and design details. 🚨

Standalone clusters (SAC) are our attempt to provide workload clusters without the need of a long-running management cluster. With this, we intended to:

  • Minimize resource requirements for clusters
  • Reduce on boarding time from download to cluster

To accomplish this, we re-purposed Cluster API and extended TKG-Lib (via tanzu-framework) to create the standalone cluster model. With this model in use for many months we've learned that:

  1. SAC users often wanted a single-node or local cluster.
  2. Users wanting more complex cluster-lifecycle management, need managed clusters.
  3. Our usage of cluster-api and tkg-lib created a poor ux for 1 (above) and a stunted experience for 2 (above).
  4. Maintaining the functionality support in our dependencies has had significant overhead as it's a use case tangential to our dependencies.

We believe that solving 1 is high-value for those using Tanzu. We also believe that attempting to replicate cluster-lifecycle management on a single node comes at an inappropriate cost (via dependencies).

We propose the deprecation of standalone-cluster in favor of introducing local (clusters).

High-level implementation details

In Tanzu, a Management Cluster does the processing of a TanzuKubernetesRelease (TKR). It uses the TKR to determine how to create a workload cluster.

In the local model, we'll move the management cluster’s TKR processing client-side. After processing the TKR, we have all the information needed to create a Tanzu [workload] cluster that looks similar to that of one created by a management-cluster. See the following depiction of this relationship.

image

As seen above, after parsing the TKR (client-side) and understanding properties of the to-be-created cluster, we can call into a local provider to create a minimal cluster. By leveraging a provider abstraction (interface), we can insulate ourselves from the underlying details of how the infra/host/desktop-env are created. What matters is we receive a kubeconfig with admin access to the API server.

Our initial provider implementation will be kind because it's widely accepted in the Kubernetes community. The following gif demonstrates the bootstrap UX for a local Tanzu cluster.

GIFs are broken up to save file-size

Cluster creation:

1

Cluster init:

2

Cluster list / deletion:

3

local is good for:

  • Workstation use cases of Tanzu
  • Minimal (non-prod) resourced use cases of Tanzu
  • CI/CD validations atop Tanzu
  • Validation of newly created TKRs

local is not meant for:

  • Cluster-lifecycle simulation
    • For this, use management-cluster.

Additionally, this approach would inherently solve many issues we face today:

  • Resource utilization would be drastically reduced.
  • WSL/Docker Desktop could be natively supported.
  • Workstation/Host restarts would be supported.
  • Added support for running multiple clusters.
  • Testing of different component versions (e.g. kapp-controller) would be possible.
  • We would delete our fork of cluster-api and tanzu-framework.

For in-depth implementation details, please see our PR.

Release Plan

  • 0.10.0 will feature this new model alongside the existing standalone-cluster model.
  • In 0.10.0, the existing standalone-cluster model will print a deprecation notice to the user.
  • In 0.11.0, we'll remove the existing standalone-cluster model.

FAQ

This section will be updated as questions come in

  • Q: What about non-local standalone-clusters (AWS, Azure, vSphere)
    • A: Execution of this proposal will cause a gap for AWS, Azure, and vSphere as there will no longer be non-managed clusters available. However, standalone-clusters are essentially very-limited management clusters with a few components ripped out. For users wanting to test and deploy a single cluster in one of those environments, we encourage simply creating a management-cluster and scheduling workloads to it. This is not our production-ready advise, but can get you the exact functionality (plus some) of the existing standalone-cluster model.
@joshrosso joshrosso added kind/feature A request for a new feature proposal/pending Capability has not yet been accepted by TCE project. Work should not start until accepted. labels Oct 18, 2021
@joshrosso joshrosso added this to the v0.10.0 milestone Oct 18, 2021
@dims
Copy link

dims commented Oct 18, 2021

❤️ this! +1 to kind

@vrabbi
Copy link
Contributor

vrabbi commented Oct 18, 2021

This would be great. I think the issie of host restarts will remain in multi node local clusters but could be solved this way with single node clusters.
The only worry is divergence from CAPI and utilizing a different bootstrapping mechanism makes it further from the standars tanzu deployment and could mean somethings eont work the same. For example CAPI may set via CABPK some defaults in terms of encryption,ciphers,feature flags, etc. That kind may not. By doing thay we cant anymore commit to the same things working on a local clister as on a managed cluster unless im missing something

@joshrosso
Copy link
Contributor Author

The only worry is divergence from CAPI and utilizing a different bootstrapping mechanism makes it further from the standars tanzu deployment and could mean somethings eont work the same. For example CAPI may set via CABPK some defaults in terms of encryption,ciphers,feature flags, etc. That kind may not. By doing thay we cant anymore commit to the same things working on a local clister as on a managed cluster unless im missing something

Great point. There are ways to translate some of these things to kind (or other providers brought in by the local model). For example, feature flags. My intuition and hope is that customization in CABPK won't fundamentally change the behavior of workload clusters such that it makes packages, etc work significantly differently.

However, if it did, we could parse these kubeadm customization(s) locally and ensure their behaviors propagate into the underlying provider.

@joshrosso joshrosso changed the title Replace tanzu standalone-cluster model with tanzu local model Proposal: Replace tanzu standalone-cluster model with tanzu local model Oct 18, 2021
@jorgemoralespou
Copy link
Contributor

Here are my comments:
Pros:

  • Speed of provisioning
  • Seems to work
  • Provides an easy way to support multiple local clusters and restart (although doesn't work today)

Cons:

  • Not aligned with rest of ClusterAPI way of doing things (although not sure this is a pure Con)
  • Does not allow to create remote "standalone" clusters, which many developers might want/need, as they would probably not do management clusters (too complicated) and reusing the management-cluster for regular use does not work easily because of difference in controllers, etc...

Things that would be important for developers (or that type of non knowledgeable users):

  • Being able to start/stop (lifecycle) a cluster added to create, delete, list
  • Being able to reconfigure the cluster in case of IP roam. If I create a cluster and the IP of my machine changes, I want the cluster to still work.
  • Ability to inject a trusted CA to the nodes (to create trust with the host by reusing a host created/personal CA)
  • Listing in a verbose mode (so one can see status within the lifecycle of a cluster)
  • Ability to easily switch contexts between running clusters
  • Ability to disable those Kind icons at the beginning of each output line
  • Easy definition/configuration of the clusters, since no longer adheres to ClusterAPI config
  • Ability to modify the host to access the nodes/ingress via a DNS.

Overall I like this approach, even if it removes some of the benefits "standalone" cluster had given how hard implementing local controller via kcp seemed, although I would have preferred that route to align more with ClusterAPI and TKG/Tanzu.

@qnetter
Copy link
Contributor

qnetter commented Oct 18, 2021

I'm having a slow day - what are the benefits over creating a management cluster via CAPD and scheduling workloads on it?

@jorgemoralespou
Copy link
Contributor

I have seen some:

  • speed, 10x faster to have a cluster
  • resources required, as you're only creating one cluster at any time (versus 2 for standalone, the bootstrap one and the standalone one)
  • support many local customizations that CAPD does not provide as it was mostly only designed for testing

I'm sure there are more.

CAPD is not a proper infrastructure provider in ClusterAPI as it was designed for a single purpose of unit tests. Code would need to be modified to be a proper ClusterAPI infrastructure provider. There's a lot of technical debt that make things harder. I guess that's the biggest hidden benefit.

@joshrosso
Copy link
Contributor Author

Exactly what @jorgemoralespou said, plus:

  • To respect TKRs and spin up a more realistic Tanzu workload cluster, you still need to create a workload cluster from the management cluster
    • So this means you need a bootstrap cluster + mgmt cluster + workload cluster to get to where this proposal gets.
  • Regarding speed and resources: one of the biggest drivers for this change is how resource intensive CAPD can be. As such, we've helped countless users with bootstrapping issues, which are often rooted in resource constraints.

@qnetter
Copy link
Contributor

qnetter commented Oct 18, 2021

I'm pretty sure the lack of standalone clusters or the equivalent, especially given the no-reboot limitation, on other providers is not a problem. Do we have a time and resource comparison? I understand the concepts but I'm wondering 10x what :)

@jorgemoralespou
Copy link
Contributor

3 minutes (a local cluster) versus 30 minutes (a standalone cluster) in my machine

@jorgemoralespou
Copy link
Contributor

jorgemoralespou commented Oct 18, 2021

I personally don't like Kind and the fact that this proposal misaligns from ClusterAPI, but given the huge different in experience, I have never wanted to use a CAPD standalone cluster but I will definitely use local clusters.

@joshrosso
Copy link
Contributor Author

I'm pretty sure the lack of standalone clusters or the equivalent, especially given the no-reboot limitation, on other providers is not a problem. Do we have a time and resource comparison? I understand the concepts but I'm wondering 10x what :)

Don't take this data as scientific, but here's what I got on a very old 2 core linux box (running a bunch of random stuff)

  • cluster bootstrap (includes installing the tce user managed repo): 2m37.770s
    • unless optimized, used to be 10-30min
  • cluster delete: 0m2.967s
    • unless optimized, used to be 10-30min

@vrabbi
Copy link
Contributor

vrabbi commented Oct 18, 2021

Also by using kind which has support for things other than docker such as podman it opens capabilities to do such an integration into local clusters if the whole docker desktop licensing thing becomes an issue and people move away from it. Linux isnt an issue in that regard but mac and windows users which i believe would be the vast majority of use cases for TCE local clusters could be benefited by supporting a different container runtime for running the cluster itself.

@joshrosso
Copy link
Contributor Author

I agree. This also speaks to why it's important we get the provider interface right. Beyond kind, we could support a variety of underlying models, as long as post-cluster create, we can get passed back an admin kubeconfig.

@randomvariable
Copy link

randomvariable commented Oct 18, 2021

Strong agree with this. If I were a workload developer, I'm less concerned with simulating cluster lifecycle, and testing high availability aspects of the workload are more likely to depend on the attributes of the particular cloud I'm deploying to (AZs, storage etc...), which CAPD isn't a good enough approximation of to be useful.

@nrb
Copy link
Contributor

nrb commented Oct 19, 2021

@jorgemoralespou Can you say more about this? What differences specifically?

reusing the management-cluster for regular use does not work easily because of difference in controllers, etc...

My understanding is the difference between a management cluster and a workload cluster is 3-5 controllers running in the management cluster for the CAPI information. Other than that, I thought they were identical.

Coming from my kubernetes app development background, this would have been very helpful for testing locally against a Kubernetes API server. It would have been less useful for certain constructs (backing up volume data with Velero), but as mentioned above, that was often cloud platform dependent in any case.

@jorgemoralespou
Copy link
Contributor

My understanding is the difference between a management cluster and a workload cluster is 3-5 controllers running in the management cluster for the CAPI information. Other than that, I thought they were identical.

Management cluster does also have a couple of controllers that install packages on the workload-cluster (addon-manager) and (capabilities-manager) if I'm not mistaken. That's one of the reasons why standalone clusters and workload clusters have differences to upgrade kapp-controller as an example.

@vincepri
Copy link

Have we thought about creating a Kind provider for Cluster API rather than trying to replicate the lifecycle model?

@joshrosso
Copy link
Contributor Author

joshrosso commented Oct 19, 2021

rather than trying to replicate the lifecycle model?

Where do you feel this proposal is replicating the lifecycle model?

The proposal's intent was to say that, we don't need a lifecycle model. We just need to bootstrap a cluster on a single node.

For those reading this proposal, we're largely advocating to stay out of the cluster lifecycle problem domain.

On a technical level, our implementation/proposal calls an API equivalent to cluster create.

What that API invokes under the hood can be anything. For example it can:

  • Call kind (
    • note: this is our default/ref implementation because it requires next to no resources, no bootstrap cluster, is widely adopted and is very fast.
  • Call CAPD
  • Call automation for kvm/esxi/fusion/etc

Once the thing managing the lifecycle finishes bootstrapping the cluster, we receive a kubeconfig back.

Then, that's when this plugin really steps in to do its work. And it makes its decisions on what to do on the cluster based on the declaration of the distribution (which exists in the TKR).

Hope this helps, but please let me know if there's overlap I'm not seeing.

@vincepri
Copy link

vincepri commented Oct 20, 2021

Thanks for the added context @joshrosso, the above makes.

When I started reading through the issue from the problem statement it seems that the issues were mostly around speed of standalone cluster execution, have we explored avenues that would help both tkg-lib and cluster-api to create a CAPD (or similar) based cluster in the minimum amount of time possible?

What's the general role of a standalone/local cluster for our users? From our docs:

This enables our users to try out many projects and technology in the Tanzu portfolio with a reduced barrier of entry.

Is a kind cluster enough for all use cases? What are the implications of not having an active management-workload cluster in this case? If we don't need a lifecycle model, would we never need having access to Cluster API primitives like Cluster, ClusterClass soon, or MachineDeployments?

@randomvariable
Copy link

randomvariable commented Oct 20, 2021

What are the implications of not having an active management-workload cluster in this case? If we don't need a lifecycle model, would we never need having access to Cluster API primitives like Cluster, ClusterClass soon, or MachineDeployments?

The way I see this, is that this is mostly about the local application development workflow for your average business unit appdev. Having the ability to provision a local kind/minikube whatever cluster locally as fast as possible and deploying some Tanzu addons to give it a Tanzu look and feel. In these instances, we're not really concerned around a full lifecycle model I think.

I think however, maintaining clusterctl save/restore and some of the use cases from the existing standalone cluster where CAPI does the provisioning is still useful for everything which isn't "i need a tanzu flavoured k8s on my laptop right now and don't eat all my RAM"

@timothysc
Copy link

I like the idea. I think we should update the docs to outline the user stories of when you would use (A) vs. (B). There may be some gotchas around conf parameters, but as we tinker we'll know more. e.g. Will pinniped, contour, etc. just work?

@vincepri
Copy link

This all makes sense, thanks folks — it's definitely good to have more context, appreciate all the responses

@randomvariable
Copy link

Will pinniped, contour, etc. just work

I can see the need to get Contour working in local clusters, but I don't think Pinniped is going to be that useful since the persona this is intended for isn't going to have permissions or the desire to hook up their local dev cluster to an IdP.

@joshrosso
Copy link
Contributor Author

joshrosso commented Jan 4, 2022

A few updates:

  • The decided on name is unmanaged-cluster.
  • This proposal is approved.
  • In our 0.10.0 release, this model will be available alongside the existing standalone-cluster model
  • The existing standalone-cluster model output a deprecation warning to users.

@garrying
Copy link
Contributor

garrying commented Jan 4, 2022

Minor feedback/thought on the terminal experience of the proof-of-concept given it is introducing newish output patterns: The secondary text could get difficult to read depending on the minimum contrast config of the terminal and users' color scheme. For example, on Solarize, the text is barely visible. Screen Shot 2022-01-04 at 4 52 58 PM

Related: #2730 where we're starting to think about improving visibility of processes.

@stmcginnis
Copy link
Contributor

The color formatting looks great... when we are running in a terminal theme that fits well with it. But it's hard to guarantee that, so I think we should either see if we can find some way to query the terminal to get color recommendations based on the theme, or we should just go with the default color and just use indentation to make it easier to read.

There may also be some color-blind concerns with the way we are doing it now as well.

@joshrosso
Copy link
Contributor Author

or we should just go with the default color and just use indentation to make it easier to read.

^this. The indentation is adequate.

@jpmcb
Copy link
Contributor

jpmcb commented Jan 6, 2022

A few thoughts on the color / contrast problem:

  • We should follow the $NO_COLOR env variable standard and disable colors if that is present: https://no-color.org/
  • We should also check for if the terminal doesn't support colors. We should be able to inspect this via $TERM == dumb
  • And we should check if the terminal is a TTY terminal (we might already be doing this ..)
  • It wouldn't be a bad idea to introduce a color library so that we aren't using raw escape characters to create colors in our code. Something like: https://github.com/fatih/color
    • I don't think there's a way to inspect the terminals color palate and dynamically set the colors based on what the user is using. But I believe using a color library would enable us to set common sets of colors that should be supported by most color themes (or at least the popular ones) and avoid some of the poor contrast issues we've created. I agree the indentation is adequate for understanding the UX flow and if someone really doesn't want the colors, they could set the flag in the command or use $NO_COLOR (when/if we implement that)
  • All in all however, this shouldn't block this getting out to users since it's possible to bypass the colors entirely. All of the above would be user experience enhancements to an already working UX.

@joshrosso joshrosso added proposal/acccepted Change is accepted and removed proposal/pending Capability has not yet been accepted by TCE project. Work should not start until accepted. labels Jan 10, 2022
@kartiklunkad26
Copy link
Contributor

How would the docs look like for unmanaged-cluster in the context of existing documentation? Haven't really seen anything for docs in the proposal?

@kcoriordan
Copy link
Contributor

I'm looking at this today, and tracking here: #2808

@joshrosso joshrosso changed the title Proposal: Introduce a new minimal bootstrap option to replace the existing standalone-cluster plugin Introduce a minimal deployment model that supports development and experimentation of Tanzu Jan 14, 2022
@joshrosso joshrosso changed the title Introduce a minimal deployment model that supports development and experimentation of Tanzu Offer minimal deployment model that supports development and experimentation of Tanzu Jan 14, 2022
@garrying
Copy link
Contributor

The links to proposed model are pointing to an empty README. For posterity, here's a link to the original README contents: https://github.com/vmware-tanzu/community-edition/blob/db06202fdd79271e4b5e80a0aa76387ca78917f0/cli/cmd/plugin/standalone-cluster/README.md

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.