Azure: use per-vmss vmssvm incremental cache #93107

bpineau · 2020-07-15T09:54:51Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Azure's cloud provider VMSS VMs API accesses are mediated through a cache holding and refreshing all VMSS together.

Due to that we hit VMSSVM.List API more often than we could: an instance's cache miss or expiration should only require a single VMSS re-list, while it's currently O(n) relative to the number of attached Scale Sets.

Under hard pressure (clusters with many attached VMSS that can't all be listed in one sequence of successive API calls) the controller manager might be stuck trying to re-list everything from scratch, then aborting the whole operation due to rate limits, affecting the whole Subscription.

This patch replaces the global VMSS VMs cache by per-VMSS VMs caches. Refreshes (VMSS VMs lists) are scoped to the single relevant VMSS; under severe throttling the various caches can be incrementally refreshed.

Which issue(s) this PR fixes:

Fixes #93106

Special notes for your reviewer:

We are assuming VMSS nodes are named from VMSS' computerNamePrefix+id (or vmssName+id, when computerNamePrefix isn't specified), as described https://docs.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-instance-ids and https://docs.microsoft.com/en-us/azure/templates/microsoft.compute/2018-10-01/virtualmachinescalesets#virtualmachinescalesetosprofile-object. Are there special cases not covered by that doc? if so we can probably complement that optimistic lookup (trying the vmss matching name prefix first, happy path) by a fallback to a scan over remaining scale sets. The non-optimal fallback path would still be an improvement as (in case of throttling) we keep partial results (per VMSS caches), refreshes are incremental.

Does this PR introduce a user-facing change?:

Azure: per VMSS VMSS VMs cache to prevent throttling on clusters having many attached VMSS

/assign @andyzhangx @feiskyer
/sig cloud-provider
/area provider/azure

k8s-ci-robot · 2020-07-15T09:54:59Z

Hi @bpineau. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andyzhangx · 2020-07-15T09:58:47Z

/ok-to-test
/priority important-soon
/sig cloud-provider
/area provider/azure
thanks for the contribution @bpineau

liggitt · 2020-07-18T18:30:18Z

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/93107/pull-kubernetes-node-e2e/1283340355466432512 flake is #93210

nilo19 · 2020-07-20T05:30:18Z

1/37 Tests Failed. | expand_less
-- | --
staticcheck expand_less6m57sErrors from staticcheck: vendor/k8s.io/legacy-cloud-providers/azure/azure_vmss_cache_test.go:118:2: this value of err is never used (SA4006)  Please review the above warnings. You can test via:   hack/verify-staticcheck.sh <failing package> If the above warnings do not make sense, you can exempt the line or file. See:   https://staticcheck.io/docs/#ignoring-problems | staticcheck expand_less | 6m57s | Errors from staticcheck: vendor/k8s.io/legacy-cloud-providers/azure/azure_vmss_cache_test.go:118:2: this value of err is never used (SA4006)  Please review the above warnings. You can test via:   hack/verify-staticcheck.sh <failing package> If the above warnings do not make sense, you can exempt the line or file. See:   https://staticcheck.io/docs/#ignoring-problems
staticcheck expand_less | 6m57s
Errors from staticcheck: vendor/k8s.io/legacy-cloud-providers/azure/azure_vmss_cache_test.go:118:2: this value of err is never used (SA4006)  Please review the above warnings. You can test via:   hack/verify-staticcheck.sh <failing package> If the above warnings do not make sense, you can exempt the line or file. See:   https://staticcheck.io/docs/#ignoring-problems

could you please fix the error and re-run the tests?

feiskyer · 2020-07-20T13:00:29Z

staging/src/k8s.io/legacy-cloud-providers/azure/azure_vmss.go

+		return node, nil
+	}
+
+	if len(nodeName) < 6 {


could we use getScaleSetVMInstanceID() here to check whether a node is VMSS instance or not?

feiskyer

Thanks for fixing the issue. Add it to v1.19 milestone since it would fix a serious bug for large number of VMSS scenarios.
/milestone v1.19

feiskyer · 2020-07-20T13:09:13Z

/retest

Azure's cloud provider VMSS VMs API accesses are mediated through a cache holding and refreshing all VMSS together. Due to that we hit VMSSVM.List API more often than we could: an instance's cache miss or expiration should only require a single VMSS re-list, while it's currently O(n) relative to the number of attached Scale Sets. Under hard pressure (clusters with many attached VMSS that can't all be listed in one sequence of successive API calls) the controller manager might be stuck trying to re-list everything from scratch, then aborting the whole operation; then re-trying and re-triggering API rate-limits, affecting the whole Subscription. This patch replaces the global VMSS VMs cache by per-VMSS VMs caches. Refreshes (VMSS VMs lists) are scoped to the single relevant VMSS; under severe throttling the various caches can be incrementally refreshed. Signed-off-by: Benjamin Pineau <[email protected]>

Signed-off-by: Benjamin Pineau <[email protected]>

feiskyer · 2020-07-21T02:38:40Z

/retest

nilo19 · 2020-07-21T06:34:05Z

/retest

feiskyer

/lgtm
/approve
/retest

k8s-ci-robot · 2020-07-21T08:41:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bpineau, feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/legacy-cloud-providers/azure/OWNERS~~ [feiskyer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fejta-bot · 2020-07-21T13:14:48Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

fejta-bot · 2020-07-21T16:44:38Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

k8s-ci-robot · 2020-07-21T18:39:03Z

@bpineau: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-kubernetes-conformance-kind-ga-only-parallel	`fcb3f1f`	link	`/test pull-kubernetes-conformance-kind-ga-only-parallel`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

fejta-bot · 2020-07-21T21:59:37Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

Cherry pick of #93107: Azure: use per-vmss vmssvm incremental cache

k8s-ci-robot assigned andyzhangx Jul 15, 2020

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Jul 15, 2020

k8s-ci-robot assigned feiskyer Jul 15, 2020

k8s-ci-robot requested review from andyzhangx and khenidak July 15, 2020 09:55

k8s-ci-robot added the area/cloudprovider label Jul 15, 2020

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 18, 2020

bpineau force-pushed the azure-per-vmss-vmssvm-incremental-cache branch from 212d0ca to c795853 Compare July 19, 2020 08:23

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 19, 2020

bpineau force-pushed the azure-per-vmss-vmssvm-incremental-cache branch from c795853 to ceef4c4 Compare July 19, 2020 14:30

bpineau force-pushed the azure-per-vmss-vmssvm-incremental-cache branch 2 times, most recently from 3abf8fb to 61812bf Compare July 20, 2020 11:23

feiskyer reviewed Jul 20, 2020

View reviewed changes

k8s-ci-robot added this to the v1.19 milestone Jul 20, 2020

bpineau added 2 commits July 20, 2020 18:35

Tests fixes for Azure per-VMSS VMs caches

fcb3f1f

Signed-off-by: Benjamin Pineau <[email protected]>

bpineau force-pushed the azure-per-vmss-vmssvm-incremental-cache branch from 61812bf to fcb3f1f Compare July 20, 2020 16:37

feiskyer reviewed Jul 21, 2020

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 21, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 21, 2020

k8s-ci-robot merged commit 5df1e53 into kubernetes:master Jul 21, 2020

This was referenced Jul 27, 2020

Cherry pick of #93107: Azure: use per-vmss vmssvm incremental cache #93468

Merged

Cherry pick of #93107: Azure: use per-vmss vmssvm incremental cache #93469

Merged

Cherry pick of #93107: Azure: use per-vmss vmssvm incremental cache #93470

Merged

k8s-ci-robot added a commit that referenced this pull request Aug 3, 2020

Merge pull request #93468 from nilo19/bug/cherry-pick-93107-1.16

c00066e

Cherry pick of #93107: Azure: use per-vmss vmssvm incremental cache

k8s-ci-robot added a commit that referenced this pull request Aug 4, 2020

Merge pull request #93470 from nilo19/bug/cherry-pick-93107-1.18

4bcde71

Cherry pick of #93107: Azure: use per-vmss vmssvm incremental cache

k8s-ci-robot added a commit that referenced this pull request Aug 7, 2020

Merge pull request #93469 from nilo19/bug/cherry-pick-93107-1.17

950fc5e

Cherry pick of #93107: Azure: use per-vmss vmssvm incremental cache

This was referenced Sep 15, 2020

Cherry pick of #94355: Ensure getPrimaryInterfaceID not panic when network interfaces for Azure VMSS are null #94801

Merged

Cherry pick of #94355: Ensure getPrimaryInterfaceID not panic when network interfaces for Azure VMSS are null #94802

Merged

JulienBalestra deleted the azure-per-vmss-vmssvm-incremental-cache branch April 22, 2021 10:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure: use per-vmss vmssvm incremental cache #93107

Azure: use per-vmss vmssvm incremental cache #93107

bpineau commented Jul 15, 2020

k8s-ci-robot commented Jul 15, 2020

andyzhangx commented Jul 15, 2020

liggitt commented Jul 18, 2020

nilo19 commented Jul 20, 2020

feiskyer Jul 20, 2020

feiskyer left a comment

feiskyer commented Jul 20, 2020

feiskyer commented Jul 21, 2020

nilo19 commented Jul 21, 2020

feiskyer left a comment

k8s-ci-robot commented Jul 21, 2020

fejta-bot commented Jul 21, 2020

fejta-bot commented Jul 21, 2020

k8s-ci-robot commented Jul 21, 2020

fejta-bot commented Jul 21, 2020

Azure: use per-vmss vmssvm incremental cache #93107

Azure: use per-vmss vmssvm incremental cache #93107

Conversation

bpineau commented Jul 15, 2020

k8s-ci-robot commented Jul 15, 2020

andyzhangx commented Jul 15, 2020

liggitt commented Jul 18, 2020

nilo19 commented Jul 20, 2020

feiskyer Jul 20, 2020

Choose a reason for hiding this comment

feiskyer left a comment

Choose a reason for hiding this comment

feiskyer commented Jul 20, 2020

feiskyer commented Jul 21, 2020

nilo19 commented Jul 21, 2020

feiskyer left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jul 21, 2020

fejta-bot commented Jul 21, 2020

fejta-bot commented Jul 21, 2020

k8s-ci-robot commented Jul 21, 2020

fejta-bot commented Jul 21, 2020