Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zone redundant NAT Gateways #331

Merged
merged 11 commits into from
Jan 17, 2022
Merged

Conversation

kon-angelo
Copy link
Contributor

@kon-angelo kon-angelo commented Jun 30, 2021

How to categorize this PR?

/area control-plane
/kind enhancement
/platform azure

What this PR does / why we need it:
This PR allows the deployment of shoots with a new network setup for Azure. The new setup is akin to other extensions where it is possible to have finer-grained control over the networking of each availability zone. The main motivation is to have zone redundant NAT Gateways. Prior to this PR we used a single subnet and thus a single NAT Gateway instance. Using the new setup each subnet can potentially have its own NAT gateway instance and separate public IPs.

In addition, there is a migration path from our current zoned setup to this "enhanced" zone setup. For this to be possible, we impose the constraint that the first zone specified in the infrastructure must match the CIDR range of the existing workers subnet. The main reason is to preserve the subnet of an existing cluster and hence prevent having to teardown all the existing VMs to recreate the network setup. Instead, by preserving the original subnet we can have a controlled node migration (via MCM) to the new machineClasses and ensure minimal downtime during the migration.

Which issue(s) this PR fixes:
Fixes #https://github.com/orgs/gardener/projects/7#card-49971123

Special notes for your reviewer:
Currently missing:

  • documentation
  • a new terraform azurerm provider.

Unfortunately there is a major bug with with TF provider for Azure that we encountered (link) - when creating multiple "association" resources, the terraform provider is permanently stuck due to locking issues. With the current (at the time of writing) version we use it is happening consistently. The "custom" version of the terraformer used in this PR is build based on a newer version (2.60) and while the occurrence rate has fallen, it is still around 50% according to my tests and I consider it a showstopper until fixed. The custom terraform image is there to facilitate reviewers and testers.

We have a PR with a potential fix waiting for review (hashicorp/terraform-provider-azurerm#12267).

Release note:

Azure provider now supports a new network setup that allows for zone redundant NAT Gateways.

@gardener-robot gardener-robot added kind/api-change API change with impact on API users needs/second-opinion Needs second review by someone else area/control-plane Control plane related kind/enhancement Enhancement, improvement, extension platform/azure Microsoft Azure platform/infrastructure needs/review Needs review size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) labels Jun 30, 2021
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 30, 2021
@gardener-robot-ci-3 gardener-robot-ci-3 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 30, 2021
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 1, 2021
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 1, 2021
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 1, 2021
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 1, 2021
@testmachinery

This comment has been minimized.

@gardener gardener deleted a comment from testmachinery bot Jul 4, 2021
@kon-angelo

This comment has been minimized.

@testmachinery

This comment has been minimized.

@dkistner
Copy link
Member

dkistner commented Jul 5, 2021

/assign

@kon-angelo
Copy link
Contributor Author

/test

@testmachinery

This comment has been minimized.

@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 20, 2021
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 20, 2021
@dkistner dkistner added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 22, 2021
Copy link
Member

@dkistner dkistner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now also done the validation part.

After an intensive pair review on that on functional and code level I have only minor comments lefts. After checking and addressing them where necessary I think we are finally good with this PR.

func ValidateInfrastructureConfigAgainstCloudProfile(oldInfra, infra *apisazure.InfrastructureConfig, shootRegion string, cloudProfile *gardencorev1beta1.CloudProfile, fld *field.Path) field.ErrorList {
allErrs := field.ErrorList{}

if len(infra.Networks.Zones) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor hint: You could use your helper IsUsingSingleSubnetLayout() here ;)

}

allErrs = append(allErrs, validateVnetConfig(&config, infra.ResourceGroup, workerCIDR, nodes, pods, services, zonesPath, vNetPath)...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this before the if else block?
Then we could eliminate the if else by returning after the helper.IsUsingSingleSubnetLayout if block is completed.

@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Nov 29, 2021
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Nov 29, 2021
@gardener-robot gardener-robot added the needs/rebase Needs git rebase label Dec 1, 2021
@gardener-robot
Copy link

@kon-angelo You need rebase this pull request with latest master branch. Please check.

@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 13, 2021
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 13, 2021
@kon-angelo
Copy link
Contributor Author

/test

@testmachinery
Copy link

testmachinery bot commented Dec 13, 2021

Testrun: e2e-rz54w
Workflow: e2e-rz54w-wf
Phase: Failed

+---------------------+---------------------+--------+----------+
|        NAME         |        STEP         | PHASE  | DURATION |
+---------------------+---------------------+--------+----------+
| infrastructure-test | infrastructure-test | Failed | 22m25s   |
+---------------------+---------------------+--------+----------+

@kon-angelo
Copy link
Contributor Author

/test

@testmachinery
Copy link

testmachinery bot commented Dec 13, 2021

Testrun: e2e-jzbw7
Workflow: e2e-jzbw7-wf
Phase: Succeeded

+---------------------+---------------------+-----------+----------+
|        NAME         |        STEP         |   PHASE   | DURATION |
+---------------------+---------------------+-----------+----------+
| infrastructure-test | infrastructure-test | Succeeded | 33m17s   |
+---------------------+---------------------+-----------+----------+

@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Dec 13, 2021
@dkistner
Copy link
Member

/lgtm
/unassign

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/rebase Needs git rebase needs/review Needs review needs/second-opinion Needs second review by someone else status/author-action Issue requires issue author action labels Dec 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related kind/api-change API change with impact on API users kind/enhancement Enhancement, improvement, extension kind/test Test needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) platform/azure Microsoft Azure platform/infrastructure reviewed/lgtm Has approval for merging size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants