-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add proposal for Azure Service Operator #3113
Conversation
Skipping CI for Draft Pull Request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First draft is looking good! You thought of all the gotchas that I can think of.
|
||
- Leverage existing e2e tests | ||
- Add unit tests for new ASO integration | ||
- Run one-off tests against large clusters to catch performance regressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should talk about telemetry somewhere. Currently we have traces and metrics for every SDK call made in CAPZ https://capz.sigs.k8s.io/developers/development.html#viewing-telemetry, if we move to ASO we will lose that. @mattchr does ASO currently emit traces/metrics for SDK calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like ASO exposes azure_successful_requests_total
, azure_failed_requests_total
, and azure_requests_time_seconds
Prometheus metrics, but I don't see any OpenTelemetry integration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have any OpenTelemetry integration currently. We have prom metrics for every SDK call made, but not traces. As I mentioned in my other comment this is something we'd be open to improving, although I'm not sure how we'd get distributed tracing to work through CRs (so that you could have a top-level trace that spanned N ASO resource creations for example)
First draft looks great to me as well, thank you for putting it together! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks everyone for the feedback so far! I've addressed that for now in the form of bullet points and will start filling those sections out more.
|
||
- Leverage existing e2e tests | ||
- Add unit tests for new ASO integration | ||
- Run one-off tests against large clusters to catch performance regressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like ASO exposes azure_successful_requests_total
, azure_failed_requests_total
, and azure_requests_time_seconds
Prometheus metrics, but I don't see any OpenTelemetry integration.
|
||
### Graduation Criteria | ||
|
||
ASO integration will not be kept behind a feature flag or matriculate through the usual alpha, beta, and stable phases. Instead, the transition will be made one Azure service interface at a time so as to distribute potential impact over time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that this seems prudent.
Azure or Kubernetes API limits with fewer or smaller workload clusters being managed. | ||
- Management cluster will have to manage many more Kubernetes resources per | ||
workload cluster | ||
- Because ASO has not yet been proven as a mission-critical interface to Azure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well-phrased. I agree with this as a risk.
I think it makes a good bit of sense to make a shared bet. As you called out, ASO is solving the "2. Interfacing with the Azure platform to manage creating, updating, and deleting that infrastructure" problem, so it should end up reducing the work CAPZ has to do on that stuff, but this is a risk as obviously the Azure Go SDK has much broader adoption and is more mature (GA) than ASO is currently.
used instead of the API or SDK directly | ||
- Conflicting user installations of ASO or ASO resources | ||
- Future breaking changes in ASO | ||
- Lower-fidelity telemetry compared to what CAPZ tracks currently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something we'd love to work with you guys on I think. We have some basic telemetry exposed already: https://azure.github.io/azure-service-operator/introduction/metrics/ - if you gave us a list of what exactly you wanted (or were losing in this migration) we could work to expose that data.
Or is the issue here more than you had integrations into the Azure SDK to track aggregate metrics such as "time it takes to fully provision a cluster" that you'd be losing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the little bit I've used CAPZ's tracing, I've found it helpful to have a breakdown of how long each step in a single CAPZ reconciliation takes. Since that includes Azure API calls currently, I think my main concern was losing that kind of association between a CAPZ reconciliation and Azure API calls. I updated this section to mention that I don't think that would really matter though since Azure API calls would be happening in ASO completely out-of-band with CAPZ reconciliations. Or at least recreating that mapping seems like it would be unnecessarily difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There have been discussions about tracking resource lifecycle and some related KEP work: https://groups.google.com/g/kubebuilder/c/tNI6ZpQ2loM/m/8rSX6HKVDgAJ. Correlation is going to be difficult. However, we might be able to trace with observed generation and namespace/name to get something close enough.
|
||
CAPZ interacts with some Azure services that do not represent infrastructure, and thus cannot be represented in ASO. Resource Health, for example, is "reconciled" by CAPZ currently by getting a resource's health status and reflecting that in the corresponding CAPZ resource, but does not create or update any distinct Azure resources. The new SDK could be used to implement this existing functionality without affecting other service interfaces' use of ASO. Implementing Resource Health in ASO is being tracked in https://github.com/Azure/azure-service-operator/issues/2762. | ||
|
||
Also, use of the `clusterctl move` command will require extra manual steps to move ASO resources as documented here: https://azure.github.io/azure-service-operator/introduction/frequently-asked-questions/#what-is-the-best-practice-for-transferring-aso-resources-from-one-cluster-to-another. Specifically, before `clusterctl move` is run, each ASO resource under the ownership hierarchy of a Cluster must have its `serviceoperator.azure.com/reconcile-policy` annotation set to `skip`. The necessary ASO resources can be enumerated by invoking `clusterctl move --dry-run -v 1`. `clusterctl move` will automatically detect and move the ASO resources. Then after `clusterctl move` is complete, the annotation should be changed back to its previous state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, before
clusterctl move
is run, each ASO resource under the ownership hierarchy of a Cluster must have itsserviceoperator.azure.com/reconcile-policy
annotation set toskip
that's not a great experience for users. They shouldn't have to care or even know about ASO as it's an implementation detail of CAPZ and not something they opt into. I think it's okay for a user to have to apply the annotation in the context where they are directly using ASO, but in the case where the CAPZ controller is the one "using" ASO to provision resources, the CAPZ controller should be the one applying these annotations. This might be tricky and might require some changes to clusterctl move
but we should really try to avoid manual intervention from the user.
Codecov ReportPatch coverage has no change and project coverage change:
Additional details and impacted files@@ Coverage Diff @@
## main #3113 +/- ##
===========================================
+ Coverage 40.42% 51.50% +11.07%
===========================================
Files 241 182 -59
Lines 29560 18054 -11506
===========================================
- Hits 11951 9298 -2653
+ Misses 16700 8229 -8471
+ Partials 909 527 -382 see 109 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
I just pushed a couple small changes adding updates on cc @dtzar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
LGTM label has been added. Git tree hash: 0e3340eac8bb9c27333df20f45e2318541b27837
|
Officially starting lazy consensus on this, ending EOD 14 April (end of next week). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I think this summarizes the pros/cons of using ASO quite well.
I will leave the actual decision of if the pros outweigh the cons to you experts as I don't have great visibility into the costs/benefits for CAPZ as a project when comparing ASO to something like the track2 SDKs.
/lgtm I can’t add any more than what many others have said before me in these PR threads. Great work @nojnhuh! |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CecileRobertMichon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Great work! Kudos @nojnhuh! 🚀 |
Time for slash hold cancel? 🤠 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold cancel
/pony |
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind design
What this PR does / why we need it: This PR adds a proposal suggesting the adoption of Azure Service Operator in CAPZ to manage infrastructure in Azure instead of the Azure SDK.
Special notes for your reviewer:
Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
TODOs:
Release note: