Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure: use per-vmss vmssvm incremental cache #93107

Merged

Conversation

bpineau
Copy link
Contributor

@bpineau bpineau commented Jul 15, 2020

What type of PR is this?

/kind bug

What this PR does / why we need it:

Azure's cloud provider VMSS VMs API accesses are mediated through a cache holding and refreshing all VMSS together.

Due to that we hit VMSSVM.List API more often than we could: an instance's cache miss or expiration should only require a single VMSS re-list, while it's currently O(n) relative to the number of attached Scale Sets.

Under hard pressure (clusters with many attached VMSS that can't all be listed in one sequence of successive API calls) the controller manager might be stuck trying to re-list everything from scratch, then aborting the whole operation due to rate limits, affecting the whole Subscription.

This patch replaces the global VMSS VMs cache by per-VMSS VMs caches. Refreshes (VMSS VMs lists) are scoped to the single relevant VMSS; under severe throttling the various caches can be incrementally refreshed.

Which issue(s) this PR fixes:

Fixes #93106

Special notes for your reviewer:

We are assuming VMSS nodes are named from VMSS' computerNamePrefix+id (or vmssName+id, when computerNamePrefix isn't specified), as described https://docs.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-instance-ids and https://docs.microsoft.com/en-us/azure/templates/microsoft.compute/2018-10-01/virtualmachinescalesets#virtualmachinescalesetosprofile-object. Are there special cases not covered by that doc? if so we can probably complement that optimistic lookup (trying the vmss matching name prefix first, happy path) by a fallback to a scan over remaining scale sets. The non-optimal fallback path would still be an improvement as (in case of throttling) we keep partial results (per VMSS caches), refreshes are incremental.

Does this PR introduce a user-facing change?:

Azure: per VMSS VMSS VMs cache to prevent throttling on clusters having many attached VMSS

/assign @andyzhangx @feiskyer
/sig cloud-provider
/area provider/azure

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Jul 15, 2020
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 15, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @bpineau. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andyzhangx
Copy link
Member

/ok-to-test
/priority important-soon
/sig cloud-provider
/area provider/azure
thanks for the contribution @bpineau

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 15, 2020
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 18, 2020
@bpineau bpineau force-pushed the azure-per-vmss-vmssvm-incremental-cache branch from 212d0ca to c795853 Compare July 19, 2020 08:23
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 19, 2020
@bpineau bpineau force-pushed the azure-per-vmss-vmssvm-incremental-cache branch from c795853 to ceef4c4 Compare July 19, 2020 14:30
@nilo19
Copy link
Member

nilo19 commented Jul 20, 2020

1/37 Tests Failed. | expand_less
-- | --
staticcheck expand_less6m57sErrors from staticcheck: vendor/k8s.io/legacy-cloud-providers/azure/azure_vmss_cache_test.go:118:2: this value of err is never used (SA4006)  Please review the above warnings. You can test via:   hack/verify-staticcheck.sh <failing package> If the above warnings do not make sense, you can exempt the line or file. See:   https://staticcheck.io/docs/#ignoring-problems | staticcheck expand_less | 6m57s | Errors from staticcheck: vendor/k8s.io/legacy-cloud-providers/azure/azure_vmss_cache_test.go:118:2: this value of err is never used (SA4006)  Please review the above warnings. You can test via:   hack/verify-staticcheck.sh <failing package> If the above warnings do not make sense, you can exempt the line or file. See:   https://staticcheck.io/docs/#ignoring-problems
staticcheck expand_less | 6m57s
Errors from staticcheck: vendor/k8s.io/legacy-cloud-providers/azure/azure_vmss_cache_test.go:118:2: this value of err is never used (SA4006)  Please review the above warnings. You can test via:   hack/verify-staticcheck.sh <failing package> If the above warnings do not make sense, you can exempt the line or file. See:   https://staticcheck.io/docs/#ignoring-problems

could you please fix the error and re-run the tests?

@bpineau bpineau force-pushed the azure-per-vmss-vmssvm-incremental-cache branch 2 times, most recently from 3abf8fb to 61812bf Compare July 20, 2020 11:23
return node, nil
}

if len(nodeName) < 6 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use getScaleSetVMInstanceID() here to check whether a node is VMSS instance or not?

Copy link
Member

@feiskyer feiskyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the issue. Add it to v1.19 milestone since it would fix a serious bug for large number of VMSS scenarios.
/milestone v1.19

@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Jul 20, 2020
@feiskyer
Copy link
Member

/retest

bpineau added 2 commits July 20, 2020 18:35
Azure's cloud provider VMSS VMs API accesses are mediated through
a cache holding and refreshing all VMSS together.

Due to that we hit VMSSVM.List API more often than we could: an
instance's cache miss or expiration should only require a single
VMSS re-list, while it's currently O(n) relative to the number of
attached Scale Sets.

Under hard pressure (clusters with many attached VMSS that can't all
be listed in one sequence of successive API calls) the controller
manager might be stuck trying to re-list everything from scratch,
then aborting the whole operation; then re-trying and re-triggering
API rate-limits, affecting the whole Subscription.

This patch replaces the global VMSS VMs cache by per-VMSS VMs caches.
Refreshes (VMSS VMs lists) are scoped to the single relevant VMSS; under
severe throttling the various caches can be incrementally refreshed.

Signed-off-by: Benjamin Pineau <[email protected]>
@bpineau bpineau force-pushed the azure-per-vmss-vmssvm-incremental-cache branch from 61812bf to fcb3f1f Compare July 20, 2020 16:37
@feiskyer
Copy link
Member

/retest

1 similar comment
@nilo19
Copy link
Member

nilo19 commented Jul 21, 2020

/retest

Copy link
Member

@feiskyer feiskyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 21, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bpineau, feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 21, 2020
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot
Copy link
Contributor

@bpineau: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-conformance-kind-ga-only-parallel fcb3f1f link /test pull-kubernetes-conformance-kind-ga-only-parallel

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Azure cloud provider hits the VMSS VMs List API more than it should
7 participants