Skip to content
This repository was archived by the owner on Jan 11, 2023. It is now read-only.

Scaling Windows agent pools for Availability Sets and VMSS #2859

Merged
merged 23 commits into from
Jun 1, 2018
Merged

Scaling Windows agent pools for Availability Sets and VMSS #2859

merged 23 commits into from
Jun 1, 2018

Conversation

danigian
Copy link
Contributor

@danigian danigian commented May 4, 2018

What this PR does / why we need it:
This code implements the scaling for Windows agent pools (both for Availability Sets and Virtual Machine Scale Sets)

Which issue this PR fixes:
fixes #1972 #2649

Special notes for your reviewer:

If applicable:

  • documentation
  • unit tests
  • tested backward compatibility (ie. deploy with previous version, upgrade with this branch)

Release note:

@ghost ghost assigned danigian May 4, 2018
@ghost ghost added the in progress label May 4, 2018
@sozercan
Copy link
Member

sozercan commented May 6, 2018

For VMSS, you should be able to scale using portal or CLI, shouldn't need ./acs-engine scale

@danigian
Copy link
Contributor Author

danigian commented May 7, 2018

@sozercan I agree.
I thought that having a unified experience with acs-engine scale for AS and VMSS could have been a good idea

@qmalexander
Copy link

So @danigian . How is it going with this?

thomastaylor312 and others added 2 commits May 18, 2018 12:47
This updates client-go to v7.0.0 and adds the needed dependencies on
`k8s.io/api`. Also fixes a small issue with conflicting `uuid` library
versions
make build-vendor passed!
@danigian
Copy link
Contributor Author

@CecileRobertMichon waiting for PR #2954 to be approved as it solves the issues I had (#2869).
In order to test it, I merged changes from that PR into my branch and right now I have both AS and VMSS scaling (out and in) working again for Windows Agent Pools.

@CecileRobertMichon
Copy link
Contributor

That's great news thank you @danigian! cc: @jackfrancis

@CecileRobertMichon
Copy link
Contributor

@Damaggio #2954 just got merged so let's move this forward 🙂 The unit tests are failing because of lint / go import errors so I'll let you fix those.

@acs-bot acs-bot added size/XL and removed size/XXL labels May 22, 2018
@acs-bot acs-bot removed the size/XL label May 22, 2018
@acs-bot acs-bot added size/M and removed size/XXL labels May 22, 2018
@danigian
Copy link
Contributor Author

@CecileRobertMichon everything should be ok now
I tested the following scenarios:

  • AvailabilitySets Scale-out and Scale-in (both in hybrid and windows only clusters) with and without pods already assigned to the nodes
  • VirtualMachineScaleSets Scale-out and Scale-in (as the previous line)

Please do more tests if you want.

Thanks

@CecileRobertMichon CecileRobertMichon changed the title [WIP] Scaling Windows agent pools for Availability Sets and VMSS Scaling Windows agent pools for Availability Sets and VMSS May 24, 2018
@codecov
Copy link

codecov bot commented May 24, 2018

Codecov Report

Merging #2859 into master will decrease coverage by 0.01%.
The diff coverage is 32.35%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2859      +/-   ##
==========================================
- Coverage   51.59%   51.57%   -0.02%     
==========================================
  Files          97       97              
  Lines       14834    14861      +27     
==========================================
+ Hits         7654     7665      +11     
- Misses       6482     6496      +14     
- Partials      698      700       +2
Impacted Files Coverage Δ
pkg/operations/kubernetesupgrade/upgradecluster.go 46.85% <0%> (ø) ⬆️
cmd/scale.go 12.68% <5.55%> (-0.25%) ⬇️
pkg/armhelpers/utils/util.go 63.09% <71.42%> (+1.66%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3482257...a351ae0. Read the comment docs.

@CecileRobertMichon
Copy link
Contributor

Running back compat scale + upgrade tests

cmd/scale.go Outdated
@@ -78,6 +79,7 @@ func newScaleCmd() *cobra.Command {
f.BoolVar(&sc.classicMode, "classic-mode", false, "enable classic parameters and outputs")
f.StringVar(&sc.agentPoolToScale, "node-pool", "", "node pool to scale")
f.StringVar(&sc.masterFQDN, "master-FQDN", "", "FQDN for the master load balancer, Needed to scale down Kubernetes agent pools")
f.BoolVar(&sc.windowsPool, "windows", false, "true if windows pool")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to have a windows flag? Would it be possible to determine if the agentpool is windows based on the names of the VMs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no other way or it is complicated, it's fine for now but we should add a section in docs about scaling windows agent pools and mention the flag there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will work on this now as it should not be a problem removing the flag, using instead the OsType of the OsDisk object in the StorageProfile (should be better than using rather the name which could change in time). In a first moment I added the flag because I thought that could have been useful, but actually it is not

Copy link
Contributor Author

@danigian danigian May 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CecileRobertMichon I ended up with using StorageProfile.ImageReference.Publisher to understand if it is a Windows Agent Pool as the OsType property in the OsDisk Object is not available when using VMSS.
Everything should work as before, but without the windows flag.

Please review this new commit

continue
}

osPublisher := *vm.StorageProfile.ImageReference.Publisher
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be cleaner to use *vm.StorageProfile.OsDisk.OsType == Windows instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be cleaner, however I decided to use the osPublisher as it is the same check done for VMSS (which are not defining OsDisk.OsType).
Do you want me to change this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes sense. All good then.

cmd/scale.go Outdated
if sc.windowsPool {
_, _, poolindex, err = utils.WindowsVMSSNameParts(*vmss.Name)
osPublisher := *vmss.VirtualMachineProfile.StorageProfile.ImageReference.Publisher
if strings.EqualFold(osPublisher, "MicrosoftWindowsServer") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can have a bool var isWindows defined above and re-use it here instead of checking again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about what you mean here, sorry. The check happens only one time: when the vmss corresponding to the agent pool, we want to scale, is found.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I read the code too fast the first time and didn't realize this was VMSS.

@CecileRobertMichon
Copy link
Contributor

CecileRobertMichon commented May 30, 2018

I just ran an scale on \examples\windows\kubernetes.json and got the following error:

+ docker run --rm -v /var/lib/jenkins/workspace/k8s-scale:/go/src/github.com/Azure/acs-engine -w /go/src/github.com/Azure/acs-engine -e RESOURCE_GROUP=kubernetes-northcentralus-66441 -e REGION=northcentralus quay.io/deis/go-dev:v1.12.0 ./bin/acs-engine scale --subscription-id 3014546b-7d1c-4f80-8523-f24a9976fe6a --deployment-dir _output/kubernetes-northcentralus-66441 --location northcentralus --resource-group kubernetes-northcentralus-66441 --master-FQDN kubernetes-northcentralus-66441.northcentralus.cloudapp.azure.com --node-pool windowspool2 --new-node-count 5 --auth-method client_secret --client-id xxx --client-secret xxx
INFO[0000] validating...                                
INFO[0003] Name suffix: %s 20789360                     
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/Azure/acs-engine/cmd.(*scaleCmd).run(0xc4203ab800, 0xc4203f2d80, 0xc4203a1cc0, 0x0, 0x14, 0x0, 0x0)
	/go/src/github.com/Azure/acs-engine/cmd/scale.go:240 +0x2c9f
github.com/Azure/acs-engine/cmd.newScaleCmd.func1(0xc4203f2d80, 0xc4203a1cc0, 0x0, 0x14, 0x0, 0x0)
	/go/src/github.com/Azure/acs-engine/cmd/scale.go:69 +0x52
github.com/Azure/acs-engine/vendor/github.com/spf13/cobra.(*Command).execute(0xc4203f2d80, 0xc4203a1b80, 0x14, 0x14, 0xc4203f2d80, 0xc4203a1b80)
	/go/src/github.com/Azure/acs-engine/vendor/github.com/spf13/cobra/command.go:647 +0x3e4
github.com/Azure/acs-engine/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc4203f2000, 0xc4203f2d80, 0xc4203f2b40, 0xc4203f2900)
	/go/src/github.com/Azure/acs-engine/vendor/github.com/spf13/cobra/command.go:726 +0x2d4
github.com/Azure/acs-engine/vendor/github.com/spf13/cobra.(*Command).Execute(0xc4203f2000, 0xc4200e6008, 0x10e79f3)
	/go/src/github.com/Azure/acs-engine/vendor/github.com/spf13/cobra/command.go:685 +0x2b
main.main()
	/go/src/github.com/Azure/acs-engine/main.go:12 +0x74

Am I missing something?

@danigian
Copy link
Contributor Author

danigian commented May 31, 2018

@CecileRobertMichon i can't reproduce the error. This is what I've got now:

root@186d526c1672:/gopath/src/github.com/Azure/acs-engine# ./bin/acs-engine deploy --api-model damaggioscale.json -l northcentralus -g ncusrg --auth-method client_secret --client-id xxx --client-secret xxx --subscription-id xxx
INFO[0013] Starting ARM Deployment (ncusrg-1688597576). This will take some time... 
INFO[0577] Finished ARM Deployment (ncusrg-1688597576). Succeeded 

root@186d526c1672:/gopath/src/github.com/Azure/acs-engine# ./bin/acs-engine scale --deployment-dir _output/damaggioscale -l northcentralus -g ncusrg --master-FQDN damaggioscale.northcentralus.cloudapp.azure.com --node-pool windowspool2 --new-node-count 5 --auth-method client_secret --client-id xxx --client-secret xxx --subscription-id xxx
INFO[0000] validating...                                
INFO[0004] Name suffix: %s 36123647                     
0
1
INFO[0004] Found no resources with type Microsoft.Network/routeTables in the template.  source="scaling command line"
INFO[0004] Starting ARM Deployment (ncusrg-1312361338). This will take some time... 
INFO[0546] Finished ARM Deployment (ncusrg-1312361338). Succeeded

This is what I got from a deployment point of view
2018-05-31 10_45_43-
and this is what I see in kubectl get no
2018-05-31 10_52_00-ubuntudamaggio northeurope cloudapp azure com - remote desktop connection

Let me know

Update: I was also able to correctly run the upgrade on this cluster
2018-05-31 12_12_52-ubuntudamaggio northeurope cloudapp azure com - remote desktop connection

@CecileRobertMichon

This comment has been minimized.

@CecileRobertMichon

This comment has been minimized.

@CecileRobertMichon
Copy link
Contributor

CecileRobertMichon commented May 31, 2018

Update: Jenkins test was broken, tested back compat manually and it passed. This should be ready to go pending e2e and code review from @jackfrancis.

/lgtm
/approve

@jackfrancis
Copy link
Member

/approve

@acs-bot
Copy link

acs-bot commented May 31, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon, jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [CecileRobertMichon,jackfrancis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@CecileRobertMichon CecileRobertMichon merged commit c02ec75 into Azure:master Jun 1, 2018
@ghost ghost removed the in progress label Jun 1, 2018
@CecileRobertMichon
Copy link
Contributor

Thank you @danigian!! 🎉

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot scale Windows agent pool (Kubernetes Cluster)
8 participants