Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adopt existing AKS clusters #1173

Closed
nprokopic opened this issue Feb 11, 2021 · 35 comments · Fixed by #4697
Closed

Adopt existing AKS clusters #1173

nprokopic opened this issue Feb 11, 2021 · 35 comments · Fixed by #4697
Assignees
Labels
area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type kind/proposal Issues or PRs related to proposals. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Milestone

Comments

@nprokopic
Copy link
Contributor

Goals

  1. Support in CAPZ to adopt an existing AKS cluster, by creating required CAPI and CAPZ objects that describe it.

Non-Goals/Future Work

  1. Support in CAPZ to adopt an existing Kubernetes cluster that was created outside of CAPZ (with infrastructure similar to how it's created now with AzureCluster and other related objects), by creating required CAPI and CAPZ objects that describe it.

User Story

As a user of Cluster API Azure, I would like to start managing my existing AKS clusters in a CAPZ management cluster, by creating CAPI and CAPZ objects that are describing my existing clusters, so that I don't have to create new clusters and migrate my workload, as that could cause significant downtime, or it might be even impossible.

Detailed Description

For an existing AKS cluster, that is using preferably only features that are supported by CAPZ, I want to create CAPI (Cluster, MachinePool) and CAPZ (AzureManagedCluster, AzureManagedControlPlane, AzureManagedMachinePool) objects that I will submit to management cluster. CAPZ controller will start reconciling CAPZ objects and it will gracefully handle the existence of an already created AKS cluster and it will treat it as it was created by CAPZ.

It would be useful to have an e2e test for this scenario.

Contract changes [optional]

It is possible that some changes in CAPZ managed cluster object defaulting and validation would be required, for example setting default SSH key in AzureManagedControlPlane is causing issues in my current AKS cluster adoption experiments, as Azure API is returning error Changing property 'linuxProfile.ssh.publicKeys.keyData' is not allowed.. I will create a separate issue for this.

Data model changes [optional]

It is possible that some changes in managed cluster CRDs would be required (e.g. to solve the above issue with SSH keys, or in other similar cases).

/kind proposal

@nprokopic
Copy link
Contributor Author

Current blockers for this:

  1. OS type not explicitly specified for the managed node pool, described here OS type is not set for managed node pool when calling Azure API #1174

  2. SSH key update error

"error creating AzureManagedControlPlane default/aks6adopt: failed to reconcile managed cluster: failed to reconcile managed cluster aks6adopt: failed to create or update managed cluster, failed to begin operation: containerservice.ManagedClustersClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code=\"PropertyChangeNotAllowed\" Message=\"Changing property 'linuxProfile.ssh.publicKeys.keyData' is not allowed.\" Target=\"linuxProfile.ssh.publicKeys.keyData\"" "controller"="azuremanagedcontrolplane" "name"="aks6adopt" "namespace"="default"

I had my public key configured here, but same would probably happen even without it, as the defaulting would set a newly generated key if one was not set already. This should be solved by checking LinuxProfile.SSH on existing cluster object (which is already fetched in code) and then not try to change the values.

Possibly if AKS cluster is initially created with an SSH key, and then if that same SSH key is set in AzureManagedControlPlane, this error might not happen. Need to check this.

@CecileRobertMichon
Copy link
Contributor

cc @alexeldeib

@nprokopic
Copy link
Contributor Author

About SSH keys, in theory it might work if the AKS cluster was created with a single SSH key, and that same key is set in the CR.

I didn't make it work yet, I'll update here after some more testing (mid next week, as I'm AFK in the beginning of the week).

@nprokopic
Copy link
Contributor Author

Some concrete actions here would be to add an e2e test for adopting an existing AKS cluster with features that are currently supported in CAPZ. We can the iterate from there by adding more AKS features to CAPZ (I would happily submit a proposal for what to add next, or pick up some existing issue) and updating e2e test to cover those.

(Before any of these, we probably need at least some basic e2e test for AKS cluster creation, as I understood Cecile there is none atm.)

@CecileRobertMichon CecileRobertMichon added the area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type label Mar 16, 2021
@alexeldeib
Copy link
Contributor

somehow I missed this one. looks like #1175 fixed part of the blocker. re: SSH keys, it should work without change. i'd be interested to know if there are any blockers here for the adoption piece. we definitely want e2e either way.

@k8s-triage-robot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 27, 2021
@alexeldeib
Copy link
Contributor

/remove-lifecycle stale
/help-wanted

we've added e2e since I made the previous comment. If someone could manually test this (easy), that would be enough to close this out IMO.

we could also add a separate e2e for a pre-created AKS cluster, but I'm not sure how much value that delivers relative to the work vs. polishing up other parts of the implementation.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 27, 2021
@nprokopic
Copy link
Contributor Author

we've added e2e since I made the previous comment. If someone could manually test this (easy), that would be enough to close this out IMO.

I have tested this some time ago, I believe when I opened this issue and I got it working after fixing #1175.

we could also add a separate e2e for a pre-created AKS cluster, but I'm not sure how much value that delivers relative to the work vs. polishing up other parts of the implementation.

Having an e2e that ensures it is working would be useful (and also some docs describing how it works), since from what I saw even small subtle changes can break this, e.g. managed cluster creation works the same with or without #1174 (it does not need the default field value to be set), but it was required for adopting existing managed cluster.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 8, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 8, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@CecileRobertMichon
Copy link
Contributor

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Jan 10, 2022
@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 8, 2022
@CecileRobertMichon
Copy link
Contributor

/remove-lifecycle rotten
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Dec 8, 2022
@kisahm
Copy link

kisahm commented Jan 6, 2023

+1

any news on this topic? are there some ongoing activities?

@CecileRobertMichon
Copy link
Contributor

@kisahm I don't believe anyone is currently working on this but @mtougeron mentioned this topic a while back at CAPI office hours

cc @jackfrancis @nojnhuh

@ericgoode
Copy link

I saw a note above that said that he had this working. Is there any way to find out what was done to get it working?

@mtougeron
Copy link
Member

I'm writing this up today. I should have something to share for you soon.

@ericgoode
Copy link

It looks like in 1.9.2 I can adopt a CAPI created cluster by just reapplying the generated config files and it figures it all out. I am going to test doing a az cli created cluster as well.

@mtougeron
Copy link
Member

There are ways to do it, I've written up my notes here but there are some gotchas in the process depending on how you created the original cluster. I'm working with @dtzar to try and figure out if we can adjust capz or call those out better. Some of them might be resolved when capz is refactored to use ASO as well.

@dtzar
Copy link
Contributor

dtzar commented Aug 2, 2023

One of the challenges here is that if you generate or create a CAPI/CAPZ YAML file for a cluster where features do not exist yet OR when you submit that YAML to the cluster and there is a mutating webhook, at this point the source of truth (CAPZ YAML) does not match what is deployed. See kubernetes-sigs/cluster-api#8668

The big question is: can we have an officially supported feature which allows this? We'd need to work through the potential dangers/pitfalls. e.g. CAPZ yaml feature is now supported for AKS and it does not match the configuration of the AKS feature which is already deployed on the cluster. How does this get handled?

Tangent related - this would be possible today in pure ASO using their asoctl command.

@dtzar
Copy link
Contributor

dtzar commented Sep 25, 2023

Posting this related project for reference: https://github.com/tobiasgiese/cluster-api-migration

@dtzar dtzar added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 28, 2023
@dtzar dtzar moved this to Todo in CAPZ Planning Sep 28, 2023
@nojnhuh
Copy link
Contributor

nojnhuh commented Feb 12, 2024

/assign

@nojnhuh
Copy link
Contributor

nojnhuh commented Mar 12, 2024

I followed @mtougeron's steps here that he linked above by cheating a bit and creating a CAPZ AKS cluster, orphaning it (remove it from CAPZ management but leave the resources running in Azure), and reapplying the same templates and the cluster gets adopted fine AFAICT.

I'll work on documenting this and adding an e2e test but am not anticipating needing any other changes to enable this.

@dtzar
Copy link
Contributor

dtzar commented Mar 12, 2024

@nojnhuh There is a case where there might be a mutating webhook though, correct? So if that was in play this would fail, no?

@nojnhuh
Copy link
Contributor

nojnhuh commented Mar 12, 2024

@dtzar AKS features that are unknown to CAPZ won't ultimately get set in the ASO YAML so those won't be disabled. And if we make more features available in CAPZ and ensure they default to null, then a CAPZ upgrade managing an adopted cluster shouldn't claim control over those fields automatically.

And the mutating webhook is always in play, but if you craft the CAPZ YAML such that the defaulting either does nothing or does what you want, then I don't see where that would be causing any issues.

@dtzar dtzar added this to the v1.15 milestone Apr 4, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in CAPZ Planning Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type kind/proposal Issues or PRs related to proposals. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

10 participants