[WIP] Enhancement for installing OpenShift natively via Cluster API #1479

JoelSpeed · 2023-09-19T16:51:25Z

We are exploring the option of installing OpenShift via Cluster API, by creating a Bootstrap and ControlPlane provider implementation as well as some supplemental infrastructure provisioning controllers. This enhancement details the expected workflow for this, assuming that we already have a working Cluster API ControlPlane.

openshift-ci · 2023-09-19T16:51:39Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2023-09-19T16:52:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from joelspeed. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

patrickdillon · 2023-09-19T21:44:55Z

enhancements/cluster-api/installing-openshift-natively-via-cluster-api.md

+
+A Cluster API control plane and bootstrap provider will be created to handle the orchestration and configuration of the OpenShift cluster during the bootstrap process.
+The control plane provider will be responsible for creating (and destroying) the bootstrap node, and provisioning the control plane nodes once the bootstrap node is ready.
+The bootstrap provider will be responsible for generating the correct ignition data for the bootstrap node, control plane nodes and worker nodes


How do the bootstrap provider and MCO intersect? Is the bootstrap provider just creating ignition stubs pointing to the MCO for worker and control plane (as the installer does today)?

Yeah that's correct. I think in the future we could start to merge some of the responsibilities, but at the moment, we assume no connectivity between the guest cluster and management cluster, so, having the ability for the guests to pull from the management isn't expected

patrickdillon · 2023-09-19T21:45:53Z

enhancements/cluster-api/installing-openshift-natively-via-cluster-api.md

+
+## Proposal
+
+A Cluster API control plane and bootstrap provider will be created to handle the orchestration and configuration of the OpenShift cluster during the bootstrap process.


What are these providers? Are they controllers?

Yeah, the idea is building small controllers that handle the different parts of the bootstrap process and report back via status objects

I plan to flesh out what they do exactly in the implementation details section later

patrickdillon · 2023-09-19T22:07:16Z

enhancements/cluster-api/installing-openshift-natively-via-cluster-api.md

+
+#### Phase 1
+
+1. Leverage an existing Cluster API control plane to provision infrastructure for OpenShift clusters.


I think the content here is great, but these strike me more as implementation details than goals. Consider moving this to the proposal section, and replacing with goals, which would be things like "Enable day-2 management of infrastructure" (I'm not sure if that is a valid goal for this phase, but is just an example).

dhellmann · 2023-09-22T18:45:24Z

enhancements/cluster-api/installing-openshift-natively-via-cluster-api.md

+
+The `cluster` phase will now skip the `ignition-configs` phase and will instead apply the Cluster API resources generated in the `manifests` phase to the Cluster API control plane.
+
+The installer will directly apply the Cluster API resources to the Cluster API control plane.


This sounds a lot like oc apply. Is there a future where the 2 command line tools converge at all?

I think that's up to the installer team and oc folks to decide, but you're right, it is effectively just running an oc apply at this point to take manifests from the laptop or whatever and apply them to the control plane for CAPI.

I know there's prior art for the installer embedding binaries so it's possible we leverage and subprocess the oc binary for this purpose rather than re-inventing the wheel. This would align with a current exploration avenue of using subprocesses to run the temporary CAPI control plane for provisioning

dhellmann · 2023-09-22T18:48:05Z

enhancements/cluster-api/installing-openshift-natively-via-cluster-api.md

+
+#### Opinionated installer generated infrastructure definitions
+
+The installer binary will be updated to transform the existing install config into Cluster API resources.


Is there an upstream tool like this for CAPI or do they rely on everyone using the APIs directly?

This very much depends on the distro and where you're using CAPI. A lot of folks interact directly with the resources, especially those who are leveraging CAPI on their own infrastructure, like certain customers I'm aware of. Then there are people who have integrated it into their product, some of those have wrapped it. Eg Tanzu exposes an abstraction on top of CAPI resources rather than exposing the resources directly. So they will have something similar to this logic.

dhellmann · 2023-09-22T18:52:08Z

enhancements/cluster-api/installing-openshift-natively-via-cluster-api.md

+This secret will be referenced in the `OpenShiftControlPlane` spec.
+
+To allow the user to customise manifests, the installer will take all manifests from the `manifests` and `openshift` folders and wrap them into secrets to be applied to the cluster namespace.
+Each secret will be annotated to indicate that it should be included in the ignition generation phase, and to identify whether it was a `manifest` file or `openshift` file.


What sort of validation can we do for user-provided manifests? Something as simple as a syntax error won't be caught until the secret is unpacked so the manifest inside it can be applied to the new cluster, right?

Correct, but I don't think we do any validation there today, so I think that is a pre-existing problem. The installer currently loads the files from disk and puts them into the bootstrap ignition directly, so if any file has been edited and is malformed today, it would result in the same UX

dhellmann · 2023-09-22T18:54:34Z

enhancements/cluster-api/installing-openshift-natively-via-cluster-api.md

+
+The bootstrap provider will read the `OpenShiftControlPlane` spec to determine the install state and manifests required to complete the bootstrap ignition generation and will reconstruct the required structure for the installer to complete the `ignition-configs` phase in cluster.
+
+Once all resources are applied, the installer will watch the Cluster API control plane resource status to determine when the cluster is ready.


How quickly can we turn a failure into an error message on the console where the user ran openshift-install? How many layers of APIs need to propagate the error message?

Can the installer receive fine-grained status/logs?

At each phase we know which resources need to be checked. So, the installer can do the following:

Watch the status of the Cluster object, until it reports that it is provisioned

While watching the Cluster, until Cluster.Status.InfrastructureReady true, watch InfraCluster, eg AWSCluster for status

Once Cluster.Status.InfrastructureReady: true, watch OpenShiftControlPlane.Status until Initialized: true
Then it goes to watching clusteroperators in the way it already does.

So I'd expect the installer to start interpreting the conditions on these two objects and reporting the errors when they change. In my experience they are pretty good at updating the status.

That said, we can also stream the logs from the controllers in a debug mode if we wanted to. Perhaps rather than dumping to the terminal output, we could capture the CAPI logs into a debug file/folder, which I think we already do for terraform

dhellmann · 2023-09-22T18:57:15Z

enhancements/cluster-api/installing-openshift-natively-via-cluster-api.md

+
+There are 2 ways to achieve this:
+* By using the installer and customising the Cluster API resources generated by the installer.
+* By manually crafting the infrastructure resources and applying them to the Cluster API control plane, for example, for externally supported platforms.


The previous section talked about manifests wrapped in Secrets, too. Would users be expected to do that if they were using the API directly?

And the install state secret?

The previous section talked about manifests wrapped in Secrets, too. Would users be expected to do that if they were using the API directly?

Yes, to a degree. If they want to customise the manifests, they need to generate them on their own machine and then pass them into the cluster somehow. Assuming we are still using CAPI, it's reasonable to expect that the installer can take the customised manifests and do this wrapping for them.

For the install state, I think this will be a temporary workaround to avoid rewriting the whole of the installer in one go, but yes, again, I'd expect probably that the installer will be responsible for this. That said, the installer can reconstruct the install state from the install-config. So if a user uploaded the install-config and the manifests correctly, that would be sufficient IIUC.

There's a bit more experimentation required here to achieve some of this I think

dhellmann · 2023-09-22T19:28:33Z

enhancements/cluster-api/installing-openshift-natively-via-cluster-api.md

+#### OpenShiftControlPlane
+
+The `OpenShiftControlPlane` resource will be the configuration for the control plane provider implementation for Cluster API.
+It must adhere to the upstream Cluster API [control plane provider API contract][control-plane-api-contract].


How much of what's described below is part of that contract versus unique to OpenShift?

The entirety of machineTemplate is part of the contract, I will try to make that obvious.

The additions on top of that such as which manifests to load and the install state secret are openshift specific and need to finessing via a POC to make sure they're what we want before we go to far down this road

openshift-bot · 2023-10-25T01:15:05Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

JoelSpeed · 2023-10-25T10:47:06Z

/remove-lifecycle stale

Intending to get back to this and address feedback soon

openshift-bot · 2023-11-23T01:15:29Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

JoelSpeed · 2023-11-23T15:34:55Z

/remove-lifecycle stale

openshift-bot · 2023-12-22T01:15:14Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2023-12-29T08:45:23Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

JoelSpeed · 2024-01-03T11:17:38Z

/remove-lifecycle rotten

openshift-bot · 2024-02-01T01:15:19Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2024-02-08T08:45:30Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

dhellmann · 2024-02-13T16:31:21Z

#1555 is changing the enhancement template in a way that will cause the header check in the linter job to fail for existing PRs. If this PR is merged within the development period for 4.16 you may override the linter if the only failures are caused by issues with the headers (please make sure the markdown formatting is correct). If this PR is not merged before 4.16 development closes, please update the enhancement to conform to the new template.

openshift-bot · 2024-02-21T00:15:33Z

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2024-02-21T00:16:00Z

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dhellmann · 2024-03-01T16:25:46Z