Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix managed cluster diff so aks added fields are omitted #1800

Merged
merged 5 commits into from
Nov 11, 2021

Conversation

devigned
Copy link
Contributor

@devigned devigned commented Oct 27, 2021

What type of PR is this?
/kind bug

What this PR does / why we need it:
When an Azure Managed Cluster reconciles, will get into a never ending update loop trying to update the managed cluster due to the spec not matching the AKS computed fields for SKU and LoadBalancer related fields.

This PR does the following:

  • defaults the SKU if not specified in the spec
  • Sets the existing LB properties for the managed cluster to nil if the spec generated properties are nil
  • Sets the EffectiveOutboundIPs field to nil for the managed cluster if spec generated LB properties are not nil
  • Adds support for transient error handling in CP and MP controllers to backoff for known transient failure states.
  • AMMPs reconcile only after the CP is initialized since CP creation includes the initial specs for the AMMPs, this does not delay creation of those resources, but rather, waits until they first exist before checking state.

I don't particularly like this solution as it is brittle in the face of new computed defaults from the AKS side. I would much rather see a more robust solution where both the spec generated managed cluster and the fetched managed cluster are normalized, then diffed to ensure we are only comparing the properties that should trigger an update.

/cc @richardchen331 @alexeldeib

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

fix managed cluster endless reconcile loop due to a differences in AKS computed fields and spec generated fields

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Oct 27, 2021
@k8s-ci-robot
Copy link
Contributor

@devigned: GitHub didn't allow me to request PR reviews from the following users: richardchen331.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What type of PR is this?
/kind bug

What this PR does / why we need it:
When an Azure Managed Cluster reconciles, will get into a never ending update loop trying to update the managed cluster due to the spec not matching the AKS computed fields for SKU and LoadBalancer related fields.

This PR does the following:

  • defaults the SKU if not specified in the spec
  • Sets the existing LB properties for the managed cluster to nil if the spec generated properties are nil
  • Sets the EffectiveOutboundIPs field to nil for the managed cluster if spec generated LB properties are not nil

I don't particularly like this solution as it is brittle in the face of new computed defaults from the AKS side. I would much rather see a more robust solution where both the spec generated managed cluster and the fetched managed cluster are normalized, then diffed to ensure we are only comparing the properties that should trigger an update.

/cc @richardchen331 @alexeldeib

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

fix managed cluster endless reconcile loop due to a differences in AKS computed fields and spec generated fields

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Oct 27, 2021
@CecileRobertMichon
Copy link
Contributor

/test pull-cluster-api-provider-azure-e2e-exp

@devigned devigned force-pushed the fix-managedcluster-diff branch from 345fdb6 to 980431f Compare October 27, 2021 21:00
@devigned
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-exp

@devigned devigned force-pushed the fix-managedcluster-diff branch from 980431f to e205c14 Compare October 27, 2021 21:13
@devigned
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-exp

@devigned devigned force-pushed the fix-managedcluster-diff branch from e205c14 to 66fa35c Compare October 27, 2021 22:11
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 27, 2021
@devigned
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-exp

@devigned devigned force-pushed the fix-managedcluster-diff branch from 98fb4b6 to bf10ce4 Compare October 27, 2021 23:29
@devigned
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-exp

@devigned devigned force-pushed the fix-managedcluster-diff branch 2 times, most recently from 9731287 to b71c3e3 Compare October 28, 2021 13:45
@devigned
Copy link
Contributor Author

/hold

Still investigating.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 28, 2021
@devigned
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-exp

@devigned devigned force-pushed the fix-managedcluster-diff branch from b71c3e3 to 429ec98 Compare October 28, 2021 15:49
@devigned
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-exp

@devigned devigned force-pushed the fix-managedcluster-diff branch 2 times, most recently from 3b8e8dd to fab59e5 Compare October 28, 2021 23:20
@devigned devigned force-pushed the fix-managedcluster-diff branch from fc72529 to 68e0cd0 Compare November 8, 2021 18:50
@devigned
Copy link
Contributor Author

devigned commented Nov 8, 2021

/test pull-cluster-api-provider-azure-e2e-exp

2 similar comments
@devigned
Copy link
Contributor Author

devigned commented Nov 8, 2021

/test pull-cluster-api-provider-azure-e2e-exp

@devigned
Copy link
Contributor Author

devigned commented Nov 9, 2021

/test pull-cluster-api-provider-azure-e2e-exp

@devigned
Copy link
Contributor Author

devigned commented Nov 10, 2021

Update: After debugging with @CecileRobertMichon and @jackfrancis we have determined the VMSSs created by the AKS cluster exist in the nodeResourceGroup, but the API is responding with an empty list. We are determining the root cause of this unexpected API response.

/hold

@LochanRn
Copy link
Member

Update: After debugging with @CecileRobertMichon and @jackfrancis we have determined the VMSSs created by the AKS cluster exist in the nodeResourceGroup, but the API is responding with an empty list. We are determining the root cause of this unexpected API response.

/hold

Is this only with all the type of nodepools ? or just user nodepools ?

@devigned
Copy link
Contributor Author

Is this only with all the type of nodepools ? or just user nodepools ?

I don't think it has anything to do with the type of node pools. We have reached out to msft teams responsible for this API, so I should have a definitive answer soon.

@jackfrancis
Copy link
Contributor

@devigned given that we have identified the source of test failures are we able to temporarily reduce regional coverage to those that are more likely to overcome these errors, and then move the changes in this PR forward?

In other words, these changes are not exclusively intended to "fix" test failures, right? I think we should remove the various debug changes that were added to help debug, restore this PR to its state of intended change, and retest the "exp" job until we get a passing test. (Like, let's not block these changes on Azure platform errors that are largely out of our control.)

@devigned devigned force-pushed the fix-managedcluster-diff branch from 34d783f to 6189796 Compare November 10, 2021 17:31
@k8s-ci-robot
Copy link
Contributor

@devigned: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-apidiff 6a9b3b9 link false /test pull-cluster-api-provider-azure-apidiff

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@devigned
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-exp

Copy link
Contributor

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@CecileRobertMichon
Copy link
Contributor

@devigned is this ready or did you want to run a couple more retests?

@CecileRobertMichon
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 10, 2021
@CecileRobertMichon
Copy link
Contributor

actually

/hold

I'll let @devigned remove the hold when we're good from a testing standpoint

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 10, 2021
@devigned
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-exp

give it one more just for good measure.

@CecileRobertMichon
Copy link
Contributor

/lgtm
/approve
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 10, 2021
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 10, 2021
@k8s-ci-robot k8s-ci-robot merged commit 0ac1bef into kubernetes-sigs:main Nov 11, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.1 milestone Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants