-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Enhancement for installing OpenShift natively via Cluster API #1479
[WIP] Enhancement for installing OpenShift natively via Cluster API #1479
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
A Cluster API control plane and bootstrap provider will be created to handle the orchestration and configuration of the OpenShift cluster during the bootstrap process. | ||
The control plane provider will be responsible for creating (and destroying) the bootstrap node, and provisioning the control plane nodes once the bootstrap node is ready. | ||
The bootstrap provider will be responsible for generating the correct ignition data for the bootstrap node, control plane nodes and worker nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do the bootstrap provider and MCO intersect? Is the bootstrap provider just creating ignition stubs pointing to the MCO for worker and control plane (as the installer does today)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's correct. I think in the future we could start to merge some of the responsibilities, but at the moment, we assume no connectivity between the guest cluster and management cluster, so, having the ability for the guests to pull from the management isn't expected
|
||
## Proposal | ||
|
||
A Cluster API control plane and bootstrap provider will be created to handle the orchestration and configuration of the OpenShift cluster during the bootstrap process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these providers? Are they controllers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the idea is building small controllers that handle the different parts of the bootstrap process and report back via status objects
I plan to flesh out what they do exactly in the implementation details section later
|
||
#### Phase 1 | ||
|
||
1. Leverage an existing Cluster API control plane to provision infrastructure for OpenShift clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the content here is great, but these strike me more as implementation details than goals. Consider moving this to the proposal section, and replacing with goals, which would be things like "Enable day-2 management of infrastructure" (I'm not sure if that is a valid goal for this phase, but is just an example).
|
||
The `cluster` phase will now skip the `ignition-configs` phase and will instead apply the Cluster API resources generated in the `manifests` phase to the Cluster API control plane. | ||
|
||
The installer will directly apply the Cluster API resources to the Cluster API control plane. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds a lot like oc apply
. Is there a future where the 2 command line tools converge at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's up to the installer team and oc folks to decide, but you're right, it is effectively just running an oc apply at this point to take manifests from the laptop or whatever and apply them to the control plane for CAPI.
I know there's prior art for the installer embedding binaries so it's possible we leverage and subprocess the oc binary for this purpose rather than re-inventing the wheel. This would align with a current exploration avenue of using subprocesses to run the temporary CAPI control plane for provisioning
|
||
#### Opinionated installer generated infrastructure definitions | ||
|
||
The installer binary will be updated to transform the existing install config into Cluster API resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an upstream tool like this for CAPI or do they rely on everyone using the APIs directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This very much depends on the distro and where you're using CAPI. A lot of folks interact directly with the resources, especially those who are leveraging CAPI on their own infrastructure, like certain customers I'm aware of. Then there are people who have integrated it into their product, some of those have wrapped it. Eg Tanzu exposes an abstraction on top of CAPI resources rather than exposing the resources directly. So they will have something similar to this logic.
This secret will be referenced in the `OpenShiftControlPlane` spec. | ||
|
||
To allow the user to customise manifests, the installer will take all manifests from the `manifests` and `openshift` folders and wrap them into secrets to be applied to the cluster namespace. | ||
Each secret will be annotated to indicate that it should be included in the ignition generation phase, and to identify whether it was a `manifest` file or `openshift` file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What sort of validation can we do for user-provided manifests? Something as simple as a syntax error won't be caught until the secret is unpacked so the manifest inside it can be applied to the new cluster, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, but I don't think we do any validation there today, so I think that is a pre-existing problem. The installer currently loads the files from disk and puts them into the bootstrap ignition directly, so if any file has been edited and is malformed today, it would result in the same UX
|
||
The bootstrap provider will read the `OpenShiftControlPlane` spec to determine the install state and manifests required to complete the bootstrap ignition generation and will reconstruct the required structure for the installer to complete the `ignition-configs` phase in cluster. | ||
|
||
Once all resources are applied, the installer will watch the Cluster API control plane resource status to determine when the cluster is ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How quickly can we turn a failure into an error message on the console where the user ran openshift-install? How many layers of APIs need to propagate the error message?
Can the installer receive fine-grained status/logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At each phase we know which resources need to be checked. So, the installer can do the following:
- Watch the status of the Cluster object, until it reports that it is provisioned
- While watching the
Cluster
, untilCluster.Status.InfrastructureReady true
, watch InfraCluster, eg AWSCluster for status - Once
Cluster.Status.InfrastructureReady: true
, watchOpenShiftControlPlane.Status
untilInitialized: true
Then it goes to watching clusteroperators in the way it already does.
So I'd expect the installer to start interpreting the conditions on these two objects and reporting the errors when they change. In my experience they are pretty good at updating the status.
That said, we can also stream the logs from the controllers in a debug mode if we wanted to. Perhaps rather than dumping to the terminal output, we could capture the CAPI logs into a debug file/folder, which I think we already do for terraform
|
||
There are 2 ways to achieve this: | ||
* By using the installer and customising the Cluster API resources generated by the installer. | ||
* By manually crafting the infrastructure resources and applying them to the Cluster API control plane, for example, for externally supported platforms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous section talked about manifests wrapped in Secrets, too. Would users be expected to do that if they were using the API directly?
And the install state secret?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous section talked about manifests wrapped in Secrets, too. Would users be expected to do that if they were using the API directly?
Yes, to a degree. If they want to customise the manifests, they need to generate them on their own machine and then pass them into the cluster somehow. Assuming we are still using CAPI, it's reasonable to expect that the installer can take the customised manifests and do this wrapping for them.
For the install state, I think this will be a temporary workaround to avoid rewriting the whole of the installer in one go, but yes, again, I'd expect probably that the installer will be responsible for this. That said, the installer can reconstruct the install state from the install-config. So if a user uploaded the install-config and the manifests correctly, that would be sufficient IIUC.
There's a bit more experimentation required here to achieve some of this I think
#### OpenShiftControlPlane | ||
|
||
The `OpenShiftControlPlane` resource will be the configuration for the control plane provider implementation for Cluster API. | ||
It must adhere to the upstream Cluster API [control plane provider API contract][control-plane-api-contract]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much of what's described below is part of that contract versus unique to OpenShift?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entirety of machineTemplate
is part of the contract, I will try to make that obvious.
The additions on top of that such as which manifests to load and the install state secret are openshift specific and need to finessing via a POC to make sure they're what we want before we go to far down this road
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale Intending to get back to this and address feedback soon |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
/remove-lifecycle rotten |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
#1555 is changing the enhancement template in a way that will cause the header check in the linter job to fail for existing PRs. If this PR is merged within the development period for 4.16 you may override the linter if the only failures are caused by issues with the headers (please make sure the markdown formatting is correct). If this PR is not merged before 4.16 development closes, please update the enhancement to conform to the new template. |
Rotten enhancement proposals close after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Reopen the proposal by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
8 similar comments
(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
(automated message) This pull request is closed with lifecycle/rotten. It does not appear to be linked to a valid Jira ticket. Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
We are exploring the option of installing OpenShift via Cluster API, by creating a Bootstrap and ControlPlane provider implementation as well as some supplemental infrastructure provisioning controllers. This enhancement details the expected workflow for this, assuming that we already have a working Cluster API ControlPlane.