Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Enhancement for installing OpenShift natively via Cluster API #1479

Conversation

JoelSpeed
Copy link
Contributor

We are exploring the option of installing OpenShift via Cluster API, by creating a Bootstrap and ControlPlane provider implementation as well as some supplemental infrastructure provisioning controllers. This enhancement details the expected workflow for this, assuming that we already have a working Cluster API ControlPlane.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 19, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 19, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 19, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from joelspeed. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


A Cluster API control plane and bootstrap provider will be created to handle the orchestration and configuration of the OpenShift cluster during the bootstrap process.
The control plane provider will be responsible for creating (and destroying) the bootstrap node, and provisioning the control plane nodes once the bootstrap node is ready.
The bootstrap provider will be responsible for generating the correct ignition data for the bootstrap node, control plane nodes and worker nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do the bootstrap provider and MCO intersect? Is the bootstrap provider just creating ignition stubs pointing to the MCO for worker and control plane (as the installer does today)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's correct. I think in the future we could start to merge some of the responsibilities, but at the moment, we assume no connectivity between the guest cluster and management cluster, so, having the ability for the guests to pull from the management isn't expected


## Proposal

A Cluster API control plane and bootstrap provider will be created to handle the orchestration and configuration of the OpenShift cluster during the bootstrap process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these providers? Are they controllers?

Copy link
Contributor Author

@JoelSpeed JoelSpeed Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the idea is building small controllers that handle the different parts of the bootstrap process and report back via status objects

I plan to flesh out what they do exactly in the implementation details section later


#### Phase 1

1. Leverage an existing Cluster API control plane to provision infrastructure for OpenShift clusters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the content here is great, but these strike me more as implementation details than goals. Consider moving this to the proposal section, and replacing with goals, which would be things like "Enable day-2 management of infrastructure" (I'm not sure if that is a valid goal for this phase, but is just an example).


The `cluster` phase will now skip the `ignition-configs` phase and will instead apply the Cluster API resources generated in the `manifests` phase to the Cluster API control plane.

The installer will directly apply the Cluster API resources to the Cluster API control plane.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds a lot like oc apply. Is there a future where the 2 command line tools converge at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's up to the installer team and oc folks to decide, but you're right, it is effectively just running an oc apply at this point to take manifests from the laptop or whatever and apply them to the control plane for CAPI.

I know there's prior art for the installer embedding binaries so it's possible we leverage and subprocess the oc binary for this purpose rather than re-inventing the wheel. This would align with a current exploration avenue of using subprocesses to run the temporary CAPI control plane for provisioning


#### Opinionated installer generated infrastructure definitions

The installer binary will be updated to transform the existing install config into Cluster API resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an upstream tool like this for CAPI or do they rely on everyone using the APIs directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This very much depends on the distro and where you're using CAPI. A lot of folks interact directly with the resources, especially those who are leveraging CAPI on their own infrastructure, like certain customers I'm aware of. Then there are people who have integrated it into their product, some of those have wrapped it. Eg Tanzu exposes an abstraction on top of CAPI resources rather than exposing the resources directly. So they will have something similar to this logic.

This secret will be referenced in the `OpenShiftControlPlane` spec.

To allow the user to customise manifests, the installer will take all manifests from the `manifests` and `openshift` folders and wrap them into secrets to be applied to the cluster namespace.
Each secret will be annotated to indicate that it should be included in the ignition generation phase, and to identify whether it was a `manifest` file or `openshift` file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What sort of validation can we do for user-provided manifests? Something as simple as a syntax error won't be caught until the secret is unpacked so the manifest inside it can be applied to the new cluster, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, but I don't think we do any validation there today, so I think that is a pre-existing problem. The installer currently loads the files from disk and puts them into the bootstrap ignition directly, so if any file has been edited and is malformed today, it would result in the same UX


The bootstrap provider will read the `OpenShiftControlPlane` spec to determine the install state and manifests required to complete the bootstrap ignition generation and will reconstruct the required structure for the installer to complete the `ignition-configs` phase in cluster.

Once all resources are applied, the installer will watch the Cluster API control plane resource status to determine when the cluster is ready.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How quickly can we turn a failure into an error message on the console where the user ran openshift-install? How many layers of APIs need to propagate the error message?

Can the installer receive fine-grained status/logs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At each phase we know which resources need to be checked. So, the installer can do the following:

  • Watch the status of the Cluster object, until it reports that it is provisioned
  • While watching the Cluster, until Cluster.Status.InfrastructureReady true, watch InfraCluster, eg AWSCluster for status
  • Once Cluster.Status.InfrastructureReady: true, watch OpenShiftControlPlane.Status until Initialized: true
    Then it goes to watching clusteroperators in the way it already does.

So I'd expect the installer to start interpreting the conditions on these two objects and reporting the errors when they change. In my experience they are pretty good at updating the status.

That said, we can also stream the logs from the controllers in a debug mode if we wanted to. Perhaps rather than dumping to the terminal output, we could capture the CAPI logs into a debug file/folder, which I think we already do for terraform


There are 2 ways to achieve this:
* By using the installer and customising the Cluster API resources generated by the installer.
* By manually crafting the infrastructure resources and applying them to the Cluster API control plane, for example, for externally supported platforms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous section talked about manifests wrapped in Secrets, too. Would users be expected to do that if they were using the API directly?

And the install state secret?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous section talked about manifests wrapped in Secrets, too. Would users be expected to do that if they were using the API directly?

Yes, to a degree. If they want to customise the manifests, they need to generate them on their own machine and then pass them into the cluster somehow. Assuming we are still using CAPI, it's reasonable to expect that the installer can take the customised manifests and do this wrapping for them.

For the install state, I think this will be a temporary workaround to avoid rewriting the whole of the installer in one go, but yes, again, I'd expect probably that the installer will be responsible for this. That said, the installer can reconstruct the install state from the install-config. So if a user uploaded the install-config and the manifests correctly, that would be sufficient IIUC.

There's a bit more experimentation required here to achieve some of this I think

#### OpenShiftControlPlane

The `OpenShiftControlPlane` resource will be the configuration for the control plane provider implementation for Cluster API.
It must adhere to the upstream Cluster API [control plane provider API contract][control-plane-api-contract].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of what's described below is part of that contract versus unique to OpenShift?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entirety of machineTemplate is part of the contract, I will try to make that obvious.

The additions on top of that such as which manifests to load and the install state secret are openshift specific and need to finessing via a POC to make sure they're what we want before we go to far down this road

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 25, 2023
@JoelSpeed
Copy link
Contributor Author

/remove-lifecycle stale

Intending to get back to this and address feedback soon

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 25, 2023
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 23, 2023
@JoelSpeed
Copy link
Contributor Author

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 23, 2023
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2023
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 29, 2023
@JoelSpeed
Copy link
Contributor Author

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 3, 2024
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2024
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 8, 2024
@dhellmann
Copy link
Contributor

#1555 is changing the enhancement template in a way that will cause the header check in the linter job to fail for existing PRs. If this PR is merged within the development period for 4.16 you may override the linter if the only failures are caused by issues with the headers (please make sure the markdown formatting is correct). If this PR is not merged before 4.16 development closes, please update the enhancement to conform to the new template.

@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Feb 21, 2024
Copy link
Contributor

openshift-ci bot commented Feb 21, 2024

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dhellmann
Copy link
Contributor

(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

8 similar comments
@dhellmann
Copy link
Contributor

(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

@dhellmann
Copy link
Contributor

(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

@dhellmann
Copy link
Contributor

(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

@dhellmann
Copy link
Contributor

(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

@dhellmann
Copy link
Contributor

(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

@dhellmann
Copy link
Contributor

(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

@dhellmann
Copy link
Contributor

(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

@dhellmann
Copy link
Contributor

(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants