New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add proposal for Azure Service Operator #3113

Merged

k8s-ci-robot merged 1 commit into kubernetes-sigs:main from nojnhuh:aso-proposal

Apr 17, 2023

Contributor

nojnhuh commented Jan 27, 2023

What type of PR is this?
/kind design

What this PR does / why we need it: This PR adds a proposal suggesting the adoption of Azure Service Operator in CAPZ to manage infrastructure in Azure instead of the Azure SDK.

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

NONE

Contributor

k8s-ci-robot commented Jan 27, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot added release-note-none do-not-merge/work-in-progress kind/design cncf-cla: yes labels

k8s-ci-robot requested review from CecileRobertMichon and mboersma

January 27, 2023 21:41

k8s-ci-robot added the size/L label

mboersma reviewed

View reviewed changes

Contributor

mboersma left a comment

First draft is looking good! You thought of all the gotchas that I can think of.

dtzar mentioned this pull request

Enhancement Proposal: Break out cloud resource reconciliation #416

Closed

jackfrancis reviewed

View reviewed changes

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

jackfrancis reviewed

View reviewed changes

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

dtzar reviewed

View reviewed changes

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

CecileRobertMichon reviewed

View reviewed changes

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

CecileRobertMichon reviewed

View reviewed changes

docs/proposals/20230123-azure-service-operator.md Show resolved Hide resolved

CecileRobertMichon reviewed

View reviewed changes

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

CecileRobertMichon reviewed

View reviewed changes

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

CecileRobertMichon reviewed

View reviewed changes

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

CecileRobertMichon reviewed

View reviewed changes

docs/proposals/20230123-azure-service-operator.md Outdated

+              - Leverage existing e2e tests
+              - Add unit tests for new ASO integration
+              - Run one-off tests against large clusters to catch performance regressions

Contributor

CecileRobertMichon Jan 31, 2023

we should talk about telemetry somewhere. Currently we have traces and metrics for every SDK call made in CAPZ https://capz.sigs.k8s.io/developers/development.html#viewing-telemetry, if we move to ASO we will lose that. @mattchr does ASO currently emit traces/metrics for SDK calls?

Contributor Author

nojnhuh Jan 31, 2023

It looks like ASO exposes azure_successful_requests_total, azure_failed_requests_total, and azure_requests_time_seconds Prometheus metrics, but I don't see any OpenTelemetry integration.

matthchr Feb 3, 2023

We don't have any OpenTelemetry integration currently. We have prom metrics for every SDK call made, but not traces. As I mentioned in my other comment this is something we'd be open to improving, although I'm not sure how we'd get distributed tracing to work through CRs (so that you could have a top-level trace that spanned N ASO resource creations for example)

CecileRobertMichon reviewed

View reviewed changes

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

Member

nawazkh commented Jan 31, 2023

First draft looks great to me as well, thank you for putting it together!

nojnhuh commented

View reviewed changes

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

nojnhuh commented

View reviewed changes

Contributor Author

nojnhuh left a comment

Thanks everyone for the feedback so far! I've addressed that for now in the form of bullet points and will start filling those sections out more.

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

docs/proposals/20230123-azure-service-operator.md Show resolved Hide resolved

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

docs/proposals/20230123-azure-service-operator.md Outdated

+              - Leverage existing e2e tests
+              - Add unit tests for new ASO integration
+              - Run one-off tests against large clusters to catch performance regressions

Contributor Author

nojnhuh Jan 31, 2023

It looks like ASO exposes azure_successful_requests_total, azure_failed_requests_total, and azure_requests_time_seconds Prometheus metrics, but I don't see any OpenTelemetry integration.

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

CecileRobertMichon reviewed

View reviewed changes

docs/proposals/20230123-azure-service-operator.md Show resolved Hide resolved

CecileRobertMichon reviewed

View reviewed changes

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

matthchr reviewed

View reviewed changes

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

matthchr reviewed

View reviewed changes

docs/proposals/20230123-azure-service-operator.md Outdated


		### Graduation Criteria

		ASO integration will not be kept behind a feature flag or matriculate through the usual alpha, beta, and stable phases. Instead, the transition will be made one Azure service interface at a time so as to distribute potential impact over time.

matthchr Feb 3, 2023

Agree that this seems prudent.

docs/proposals/20230123-azure-service-operator.md Show resolved Hide resolved

docs/proposals/20230123-azure-service-operator.md Outdated

+                Azure or Kubernetes API limits with fewer or smaller workload clusters being managed.
+                - Management cluster will have to manage many more Kubernetes resources per
+                  workload cluster
+              - Because ASO has not yet been proven as a mission-critical interface to Azure

matthchr Feb 3, 2023

Well-phrased. I agree with this as a risk.

I think it makes a good bit of sense to make a shared bet. As you called out, ASO is solving the "2. Interfacing with the Azure platform to manage creating, updating, and deleting that infrastructure" problem, so it should end up reducing the work CAPZ has to do on that stuff, but this is a risk as obviously the Azure Go SDK has much broader adoption and is more mature (GA) than ASO is currently.

docs/proposals/20230123-azure-service-operator.md Outdated

+                used instead of the API or SDK directly
+              - Conflicting user installations of ASO or ASO resources
+              - Future breaking changes in ASO
+              - Lower-fidelity telemetry compared to what CAPZ tracks currently

matthchr Feb 3, 2023

This is something we'd love to work with you guys on I think. We have some basic telemetry exposed already: https://azure.github.io/azure-service-operator/introduction/metrics/ - if you gave us a list of what exactly you wanted (or were losing in this migration) we could work to expose that data.

Or is the issue here more than you had integrations into the Azure SDK to track aggregate metrics such as "time it takes to fully provision a cluster" that you'd be losing?

Contributor Author

nojnhuh Feb 22, 2023

For the little bit I've used CAPZ's tracing, I've found it helpful to have a breakdown of how long each step in a single CAPZ reconciliation takes. Since that includes Azure API calls currently, I think my main concern was losing that kind of association between a CAPZ reconciliation and Azure API calls. I updated this section to mention that I don't think that would really matter though since Azure API calls would be happening in ASO completely out-of-band with CAPZ reconciliations. Or at least recreating that mapping seems like it would be unnecessarily difficult.

Contributor

devigned Feb 27, 2023

There have been discussions about tracking resource lifecycle and some related KEP work: https://groups.google.com/g/kubebuilder/c/tNI6ZpQ2loM/m/8rSX6HKVDgAJ. Correlation is going to be difficult. However, we might be able to trace with observed generation and namespace/name to get something close enough.

docs/proposals/20230123-azure-service-operator.md Outdated Show resolved Hide resolved

CecileRobertMichon reviewed

View reviewed changes

docs/proposals/20230123-azure-service-operator.md Show resolved Hide resolved

k8s-ci-robot requested a review from devigned

February 3, 2023 23:23

dtzar reviewed

View reviewed changes

docs/proposals/20230123-azure-service-operator.md Show resolved Hide resolved

CecileRobertMichon reviewed

View reviewed changes

docs/proposals/20230123-azure-service-operator.md Outdated


		CAPZ interacts with some Azure services that do not represent infrastructure, and thus cannot be represented in ASO. Resource Health, for example, is "reconciled" by CAPZ currently by getting a resource's health status and reflecting that in the corresponding CAPZ resource, but does not create or update any distinct Azure resources. The new SDK could be used to implement this existing functionality without affecting other service interfaces' use of ASO. Implementing Resource Health in ASO is being tracked in https://github.com/Azure/azure-service-operator/issues/2762.

		Also, use of the `clusterctl move` command will require extra manual steps to move ASO resources as documented here: https://azure.github.io/azure-service-operator/introduction/frequently-asked-questions/#what-is-the-best-practice-for-transferring-aso-resources-from-one-cluster-to-another. Specifically, before `clusterctl move` is run, each ASO resource under the ownership hierarchy of a Cluster must have its `serviceoperator.azure.com/reconcile-policy` annotation set to `skip`. The necessary ASO resources can be enumerated by invoking `clusterctl move --dry-run -v 1`. `clusterctl move` will automatically detect and move the ASO resources. Then after `clusterctl move` is complete, the annotation should be changed back to its previous state.

Contributor

CecileRobertMichon Apr 4, 2023

Specifically, before clusterctl move is run, each ASO resource under the ownership hierarchy of a Cluster must have its serviceoperator.azure.com/reconcile-policy annotation set to skip

that's not a great experience for users. They shouldn't have to care or even know about ASO as it's an implementation detail of CAPZ and not something they opt into. I think it's okay for a user to have to apply the annotation in the context where they are directly using ASO, but in the case where the CAPZ controller is the one "using" ASO to provision resources, the CAPZ controller should be the one applying these annotations. This might be tricky and might require some changes to clusterctl move but we should really try to avoid manual intervention from the user.

nojnhuh mentioned this pull request

Allow providers to react when Clusters are paused and indicate in the Cluster's status when those actions are finished kubernetes-sigs/cluster-api#8473

Closed

codecov-commenter commented Apr 4, 2023

Codecov Report

Patch coverage has no change and project coverage change: +11.07 🎉

Comparison is base (4fc2041) 40.42% compared to head (e20a876) 51.50%.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #3113       +/-   ##
===========================================
+ Coverage   40.42%   51.50%   +11.07%     
===========================================
  Files         241      182       -59     
  Lines       29560    18054    -11506     
===========================================
- Hits        11951     9298     -2653     
+ Misses      16700     8229     -8471     
+ Partials      909      527      -382

see 109 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

nojnhuh mentioned this pull request

[Feature] Export azure sdk ratelimit metric #2790

Closed

Contributor Author

nojnhuh commented Apr 6, 2023

I just pushed a couple small changes adding updates on clusterctl move and the gap in services that CAPZ uses that ASO doesn't support yet. Overall I think the proposal is complete even though there are a few identified gaps, but I think those are mostly implementation details don't affect how feasible it is overall to use ASO, so I'd advocate for starting lazy consensus on this again soon.

CecileRobertMichon reviewed

View reviewed changes

Contributor

CecileRobertMichon left a comment

/lgtm

k8s-ci-robot added the lgtm label

Contributor

k8s-ci-robot commented Apr 6, 2023

LGTM label has been added.

Git tree hash: 0e3340eac8bb9c27333df20f45e2318541b27837


          add Azure Service Operator proposal

f6768f9

Contributor Author

nojnhuh commented Apr 7, 2023

Officially starting lazy consensus on this, ending EOD 14 April (end of next week).

nojnhuh force-pushed the aso-proposal branch from 0984695 to f6768f9 Compare

April 7, 2023 18:27

nojnhuh mentioned this pull request

Integrate Azure Service Operator #3402

Closed

matthchr approved these changes

View reviewed changes

matthchr left a comment •

edited

Loading

LGTM.

I think this summarizes the pros/cons of using ASO quite well.

I will leave the actual decision of if the pros outweigh the cons to you experts as I don't have great visibility into the costs/benefits for CAPZ as a project when comparing ASO to something like the track2 SDKs.

Contributor

jackfrancis commented Apr 13, 2023

/lgtm

I can’t add any more than what many others have said before me in these PR threads.

Great work @nojnhuh!

k8s-ci-robot assigned jackfrancis

Contributor

CecileRobertMichon commented Apr 13, 2023

/approve
/hold for lazy consensus expiration

Contributor

k8s-ci-robot commented Apr 13, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [CecileRobertMichon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the approved label

Member

nawazkh commented Apr 14, 2023

Great work! Kudos @nojnhuh! 🚀
/lgtm

k8s-ci-robot assigned nawazkh

Contributor Author

nojnhuh commented Apr 17, 2023

Time for slash hold cancel? 🤠

CecileRobertMichon reviewed

View reviewed changes

Contributor

CecileRobertMichon left a comment

/hold cancel

k8s-ci-robot removed the do-not-merge/hold label

Contributor

CecileRobertMichon commented Apr 17, 2023

/pony

Contributor

k8s-ci-robot commented Apr 17, 2023

@CecileRobertMichon:

In response to this:

/pony

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot merged commit 35f837e into kubernetes-sigs:main

nojnhuh deleted the aso-proposal branch

April 17, 2023 15:37

nojnhuh mentioned this pull request

add asogroups #3574

Merged

4 tasks

nojnhuh mentioned this pull request

feat(aso): add azure workload identity label to pod template #4452

Closed

4 tasks

nojnhuh mentioned this pull request

Add experimental ASO-based API #4713

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

willie-yao willie-yao left review comments

CecileRobertMichon CecileRobertMichon left review comments

dtzar dtzar left review comments

jackfrancis jackfrancis left review comments

mboersma mboersma left review comments

mtougeron mtougeron left review comments

matthchr matthchr approved these changes

Jont828 Awaiting requested review from Jont828

sonasingh46 Awaiting requested review from sonasingh46

devigned Awaiting requested review from devigned

Labels

approved cncf-cla: yes kind/design lgtm release-note-none size/L