Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use getvmss api for spot instances in azure #6470

Merged

Conversation

gandhipr
Copy link
Contributor

@gandhipr gandhipr commented Jan 25, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds support for using getVMSS api for instances in order to collect most recent count in the spot nodepool

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

getVMSS api will be called when using spot instances for having better updated information

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 25, 2024
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider area/cluster-autoscaler approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 25, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2024
@gandhipr gandhipr force-pushed the prachigandhi/azure-getvmss-spot branch from e6c51a3 to 6ca161e Compare March 19, 2024 16:48
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 19, 2024
Copy link
Contributor

@tallaxes tallaxes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments. Please also update the PR name and description, and add a short summary of user-facing change

Comment on lines +257 to +264
if getVmssSizeRefreshPeriod := os.Getenv("AZURE_GET_VMSS_SIZE_REFRESH_PERIOD"); getVmssSizeRefreshPeriod != "" {
cfg.GetVmssSizeRefreshPeriod, err = strconv.Atoi(getVmssSizeRefreshPeriod)
if err != nil {
return nil, fmt.Errorf("failed to parse AZURE_GET_VMSS_SIZE_REFRESH_PERIOD %q: %v", getVmssSizeRefreshPeriod, err)
}
} else {
cfg.GetVmssSizeRefreshPeriod = VmssSizeRefreshPeriodDefault
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic uses either environment variable or default, never the value from cloud config file. If we don't support setting this from cloud config file - need to remove from docs. If we need to support it - should update the logic. (It looks like the same applies to the EnableDynamicInstanceList above?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Looks like we're doing the same for others as well like - vmssVmsCacheTTL, vmssVmsCacheTTL, etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, those lack the "else" part, will preserve config file setting in the absence of env override

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the cloud config file settings will already have been primed into the cfg object? If so this should work

...
} else if cfg.GetVmssSizeRefreshPeriod != 0 {
         cfg.GetVmssSizeRefreshPeriod = VmssSizeRefreshPeriodDefault
}

Is that what you're thinking @tallaxes?

Copy link
Contributor

@comtalyst comtalyst Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, line 127 will result in it being taken from cloud config file already.

GetVmssSizeRefreshPeriod int `json:"getVmssSizeRefreshPeriod" yaml:"getVmssSizeRefreshPeriod"`

Line 155-162 should take care of the parsing:

if configReader != nil {
	body, err := ioutil.ReadAll(configReader)
	if err != nil {
		return nil, fmt.Errorf("failed to read config: %v", err)
	}
	err = json.Unmarshal(body, cfg)
	if err != nil {
		return nil, fmt.Errorf("failed to unmarshal config body: %v", err)
	}
}

Then, on line 163 and below, the environment variables will be used only when cloud config file does not exist.

else {
	cfg.Cloud = os.Getenv("ARM_CLOUD")
	...
}

The problem I am seeing is when cloud config file exists, but does not contain getVmssSizeRefreshPeriod, and we set AZURE_GET_VMSS_SIZE_REFRESH_PERIOD, then this AZURE_GET_VMSS_SIZE_REFRESH_PERIOD will never be used and this configuration will goes to default.

My proposal: consider each variable individually rather than entire file at once when determining whether to use an environment variable. But not in this PR.

Copy link
Contributor

@rakechill rakechill Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To your second point:

My proposal: consider each variable individually rather than entire file at once when determining whether to use an environment variable. But not in this PR.

Do you mean:

  1. read from config file fully and populate cfg
  2. then for each variable, check to see if it is already set in cfg
  3. if it's not, check to see if the corresponding env var is set + set in cfg if it is
  4. finally, if no cfg or env var is set, set it to default

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the assumption here is that people may set values with a mix of cloud config file and env vars?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not true, right?

The logic is:

if env var is set AND not empty string, then convert string to int and set cfg field
else: use default

It is true (right?, you may help me check again). All these environment variable parsing, from line 163 to 314, is encased in a giant else statement. It will reach this part only if configReader == nil, which, if we track it down, indicates that the user never provided the path to the config file.

if configReader != nil {
	body, err := ioutil.ReadAll(configReader)
	if err != nil {
		return nil, fmt.Errorf("failed to read config: %v", err)
	}
	err = json.Unmarshal(body, cfg)
	if err != nil {
		return nil, fmt.Errorf("failed to unmarshal config body: %v", err)
	}
}
Then, on line 163 and below, the environment variables will be used only when cloud config file does not exist.

else {
	cfg.Cloud = os.Getenv("ARM_CLOUD")
	...
	// What we are looking at
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean:

  1. read from config file fully and populate cfg
  2. then for each variable, check to see if it is already set in cfg
  3. if it's not, check to see if the corresponding env var is set + set in cfg if it is
  4. finally, if no cfg or env var is set, set it to default

That's one way to go with it---with config file > env > default.
Another way is env > config file > default.

I actually prefer the latter more, given env being more delicate(?) than the config file, as well as having a use case I see in AKS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I'm going to leave the logic as is.

However, want to explicitly call out that we are not handling a key case for most of these variables:

  • If config is not nil, but doesn't set a field
  • the env var for that field will never be picked up
  • the default for that field (if present) will never be picked up

@comtalyst is aiming to address this scenario in his PR to defork cloud-provider-azure

It's a rather large PR so we may determine that this logic is better to be merged separately. Either way, I think it's fine to keep this field using the same structure as the others and address this in a follow-up PR.

cluster-autoscaler/cloudprovider/azure/azure_scale_set.go Outdated Show resolved Hide resolved
cluster-autoscaler/cloudprovider/azure/azure_scale_set.go Outdated Show resolved Hide resolved
cluster-autoscaler/cloudprovider/azure/README.md Outdated Show resolved Hide resolved
@jackfrancis
Copy link
Contributor

/retest

@gandhipr gandhipr changed the title Prachigandhi/azure getvmss spot use getvmss api for spot instances in azure Apr 17, 2024
@gandhipr gandhipr force-pushed the prachigandhi/azure-getvmss-spot branch from 12fb2e9 to eb99a37 Compare April 17, 2024 19:06
@x13n
Copy link
Member

x13n commented May 7, 2024

Assigning since looks like this is being reviewed by you already.

/assign @tallaxes @rakechill

@k8s-ci-robot
Copy link
Contributor

@x13n: GitHub didn't allow me to assign the following users: rakechill.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Assigning since looks like this is being reviewed by you already.

/assign @tallaxes @rakechill

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jackfrancis
Copy link
Contributor

/test pull-cluster-autoscaler-e2e-azure

@jackfrancis
Copy link
Contributor

/retest

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 14, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@rakechill
Copy link
Contributor

@x13n can you assign this to me?

@x13n
Copy link
Member

x13n commented Jun 27, 2024

Looks like I can, done!

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 3, 2024
@@ -128,6 +131,9 @@ type Config struct {
// Jitter in seconds subtracted from the VMSS cache TTL before the first refresh
VmssVmsCacheJitter int `json:"vmssVmsCacheJitter" yaml:"vmssVmsCacheJitter"`

// GetVmssSizeRefreshPeriod (seconds) defines how frequently to call GET VMSS API to fetch VMSS info per nodegroup instance
GetVmssSizeRefreshPeriod int `json:"getVmssSizeRefreshPeriod" yaml:"getVmssSizeRefreshPeriod"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this qualifies as a nit, but maybe rename all of these properties that start w/ "get" so that they don't confuse readers that they are getter funcs. I think "vmssGet...|VmssGet..." instead would work without regressing the semantics for human readability?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not quite applicable for this case. GetVMSS is the actual function we're calling. switching the order around almost makes it sound (to me) like we're calling a function that gets the size rather than the VMSS.

Copy link
Contributor

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 19, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: comtalyst, gandhipr, jackfrancis, rakechill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit c8e4721 into kubernetes:master Jul 19, 2024
7 checks passed
@@ -128,6 +131,9 @@ type Config struct {
// Jitter in seconds subtracted from the VMSS cache TTL before the first refresh
VmssVmsCacheJitter int `json:"vmssVmsCacheJitter" yaml:"vmssVmsCacheJitter"`

// GetVmssSizeRefreshPeriod (seconds) defines how frequently to call GET VMSS API to fetch VMSS info per nodegroup instance
GetVmssSizeRefreshPeriod int `json:"getVmssSizeRefreshPeriod" yaml:"getVmssSizeRefreshPeriod"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I missed this, but GetVmssSizeRefreshPeriod being an int introduces an inconsistency with the fork, where it is time.Duration. This time it is not too bad, but let us all be careful next time. I could see cases like this getting worse if there are more dependencies.

@rakechill
Copy link
Contributor

/cherry-pick cluster-autoscaler-release-1.27

@rakechill rakechill deleted the prachigandhi/azure-getvmss-spot branch July 26, 2024 17:15
@k8s-infra-cherrypick-robot

@rakechill: only kubernetes org members may request cherry picks. If you are already part of the org, make sure to change your membership to public. Otherwise you can still do the cherry-pick manually.

In response to this:

/cherry-pick cluster-autoscaler-release-1.27

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@willie-yao
Copy link
Contributor

/cherry-pick cluster-autoscaler-release-1.27

@k8s-infra-cherrypick-robot

@willie-yao: #6470 failed to apply on top of branch "cluster-autoscaler-release-1.27":

Applying: introduce getVmssSizeRefreshPeriod env var
Applying: logic for getvmss call for spot, uniform vmss
Using index info to reconstruct a base tree...
M	cluster-autoscaler/cloudprovider/azure/azure_config.go
M	cluster-autoscaler/cloudprovider/azure/azure_manager_test.go
M	cluster-autoscaler/cloudprovider/azure/azure_scale_set.go
M	cluster-autoscaler/cloudprovider/azure/azure_scale_set_test.go
Falling back to patching base and 3-way merge...
Auto-merging cluster-autoscaler/cloudprovider/azure/azure_scale_set_test.go
Auto-merging cluster-autoscaler/cloudprovider/azure/azure_scale_set.go
Auto-merging cluster-autoscaler/cloudprovider/azure/azure_manager_test.go
CONFLICT (content): Merge conflict in cluster-autoscaler/cloudprovider/azure/azure_manager_test.go
Auto-merging cluster-autoscaler/cloudprovider/azure/azure_config.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 logic for getvmss call for spot, uniform vmss
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick cluster-autoscaler-release-1.27

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants