Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ProvisioningRequest injector #6529

Merged
merged 3 commits into from
Feb 28, 2024

Conversation

yaroslava-serdiuk
Copy link
Contributor

@yaroslava-serdiuk yaroslava-serdiuk commented Feb 15, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Inject pods from ProvisioningRequest to unschedulable pods, so CA will try to run ScaleUp.

Part of implementation https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/provisioning-request.md

Does this PR introduce a user-facing change?

Support for Check-capacity ProvisioningClass is added. To enable the feature set --enable-provisioning-requests=true. If there is capacity in the cluster for ProvisioningRequest of Check-capacity ProvisioniningClass, the Provisioned=True condition will be set and the capacity will be reserved from scheduling other ProvisioningRequest for 10 minutes. 
Note: ClusterAutoscaler doesn't block other pods to be scheduled on reserved capacity; ClusterAutoscaler can scale down reserved capacity.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 15, 2024
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 15, 2024
@kisieland
Copy link
Contributor

/lgtm

@k8s-ci-robot
Copy link
Contributor

@kisieland: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yaroslava-serdiuk
Copy link
Contributor Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 22, 2024
@yaroslava-serdiuk
Copy link
Contributor Author

/unhold
/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 23, 2024
@jayantjain93
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 23, 2024
Copy link
Contributor

@BigDarkClown BigDarkClown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, otherwise lgtm

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2024
Copy link
Contributor

@BigDarkClown BigDarkClown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2024
@BigDarkClown
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BigDarkClown, jayantjain93, yaroslava-serdiuk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 28, 2024
@k8s-ci-robot k8s-ci-robot merged commit dffff4f into kubernetes:master Feb 28, 2024
6 checks passed
Edwinhr716 pushed a commit to Edwinhr716/autoscaler that referenced this pull request Mar 13, 2024
* Add ProvisioningRequests injector

* Add test case for Accepted conditions and add supported provreq classes list

* Use Passive clock
Edwinhr716 pushed a commit to Edwinhr716/autoscaler that referenced this pull request Mar 13, 2024
* Add ProvisioningRequests injector

* Add test case for Accepted conditions and add supported provreq classes list

* Use Passive clock
@yaroslava-serdiuk
Copy link
Contributor Author

#6814

@yaroslava-serdiuk yaroslava-serdiuk deleted the provreq-reconcile branch May 28, 2024 23:34
aaronfern pushed a commit to gardener/autoscaler that referenced this pull request Jul 25, 2024
* Comment to explain why test is done on STS ownerRef

* add informer argument to clusterapi provider builder

This change adds the informer factory as an argument to the
`buildCloudProvider` function for clusterapi so that building with tags
will work properly.

* Add informer argument to the CloudProviders builder.

* clusterapi: add missing error check

* Add instanceType/region support in Helm chart for Hetzner cloud provider

* doc: cluster-autoscaler: Oracle provider: Add small security note

* doc: cluster-autoscaler: Oracle provider: Add small security note

* doc: cluster-autoscaler: Oracle provider: Add small security note

* Update charts/cluster-autoscaler/README.md

* Update Auto Labels of Subprojects

* check empty ProviderID in ali NodeGroupForNode

* add gce constructor with custom timeout

* update README.md.gotmpl and added Helm docs for Hetzner Cloud

* bump chart version

* use older helm-docs version and remove empty line in values comment

* add missing line breaks

* Update charts/cluster-autoscaler/Chart.yaml

Co-authored-by: Shubham <[email protected]>

* Reduce log spam in AtomicResizeFilteringProcessor

Also, introduce default per-node logging quotas. For now, identical to
the per-pod ones.

* Bump golang in /vertical-pod-autoscaler/pkg/updater

Bumps golang from 1.21.6 to 1.22.0.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

* Bump golang in /vertical-pod-autoscaler/pkg/recommender

Bumps golang from 1.21.6 to 1.22.0.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

* Bump golang in /vertical-pod-autoscaler/pkg/admission-controller

Bumps golang from 1.21.6 to 1.22.0.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

* Update Chart.yaml

* Move estimatorBuilder from AutoscalingContext to Orchestrator Init

* VPA: bump golang.org/x/net to fix CVE-2023-39325

The version of golang.org/x/net currently used is vulnerable to
https://avd.aquasec.com/nvd/2023/cve-2023-39325/, bump it to fix that.

* Bump go version.

* Fix e2e test setup

* helm: enable clusterapi namespace autodiscovery

* Fix expectedToRegister to respect instances with nil status

* add option to keep node group backoff on OutOfResource error

* remove changes to backoff interface

* attach errors to scale-up request and add comments

* revert optionally keeping node group backoff

* remove RemoveBackoff from updateScaleRequests

* Add ProvisioningRequestProcessor (kubernetes#6488)

* Add kube-env to MigInfoProvider

* CA: GCE: add pricing for new Z3 machines

* Introduce LocalSSDSizeProvider interface for GCE

* Use KubeEnv in gce/templates.go

* Add templateName to kube-env to ensure that correct value is cached

* Add unit-tests

* extract create group to function

* Merged PR 1379: added retry for creatingAzureManager in case of throttled requests

added retry for forceRefresh in case of throttled requests
ran tests
MallocNanoZone=0 go test -race k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure -- passed

and commented out unit test -- commented out as it takes 10 minutes to complete

func TestCreateAzureManagerWithRetryError(t *testing.T) {
	ctrl := gomock.NewController(t)
	defer ctrl.Finish()
	mockVMClient := mockvmclient.NewMockInterface(ctrl)
	mockVMSSClient := mockvmssclient.NewMockInterface(ctrl)
	mockVMSSClient.EXPECT().List(gomock.Any(), "fakeId").Return([]compute.VirtualMachineScaleSet{}, retry.NewError(true, errors.New("test"))).AnyTimes()
	mockAzClient := &azClient{
		virtualMachinesClient:         mockVMClient,
		virtualMachineScaleSetsClient: mockVMSSClient,
	}
	manager, err := createAzureManagerInternal(strings.NewReader(validAzureCfg), cloudprovider.NodeGroupDiscoveryOptions{}, config.AutoscalingOptions{}, mockAzClient)
	assert.Nil(t, manager)
	assert.NotNil(t, err)
}

* docs: update outdated/deprecated taints in the examples

Refactor references to taints & tolerations, replacing master key
with control-plane across all the example YAMLs.

Signed-off-by: Feruzjon Muyassarov <[email protected]>

* CA FAQ: clarify the point about scheduling constraints blocking scale-down

* Add warning about vendor removal to Makefile build target

Signed-off-by: Feruzjon Muyassarov <[email protected]>

* fix: add missing ephemeral-storage resource definition

* Add BuildTestNodeWithAllocatable test utility method.

* Add ProvisioningRequest injector (kubernetes#6529)

* Add ProvisioningRequests injector

* Add test case for Accepted conditions and add supported provreq classes list

* Use Passive clock

* Consider preemption policy for expandable pods

* Fix a bug where atomic scale-down failure could affect subsequent atomic scale-downs

* Update gce_price_info.go

* Migrate from satori/go.uuid to google/uuid

* Delay force refresh by DefaultInterval when OCI GetNodePool call returns 404

* CA: update dependencies to k8s v1.30.0-alpha.3, go1.21.8

* Bump golang in /vertical-pod-autoscaler/pkg/admission-controller

Bumps golang from 1.22.0 to 1.22.1.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* Bump golang in /vertical-pod-autoscaler/pkg/updater

Bumps golang from 1.22.0 to 1.22.1.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* Bump golang in /vertical-pod-autoscaler/pkg/recommender

Bumps golang from 1.22.0 to 1.22.1.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* Update expander options for the AWS cloud provider README

* Remove shadow err variable in deleteCreatedNodesWithErros func

* fix memory leak in NodeDeleteTracker

* CA - Add 1.29 to version compatibility matrix

* ClusterAutoscaler: Put APIs in a separate go module

Signed-off-by: Yuki Iwai <[email protected]>

* Extend update-deps.sh so that we can automatically update k8s libraries in the apis pkg

Signed-off-by: Yuki Iwai <[email protected]>

* Clean up update-deps.sh

Signed-off-by: Yuki Iwai <[email protected]>

* Update apis version to v1.29.2

Signed-off-by: Yuki Iwai <[email protected]>

* Allow to override rancher provider settings

Currently it is only possible to set provider settings over yaml file.

This commit introduces env variables to override URL, token and cluster name.

If particular environment variable is set it overrides value supplied in yaml file.

Signed-off-by: Dinar Valeev <[email protected]>
Co-authored-by: Donovan Muller <[email protected]>

* Bump VPA version to 1.1.0

* Deprecate the Linode Cluster Autoscaler provider

Signed-off-by: Ondrej Kokes <[email protected]>

* add price info for n4

* update n4 price info format

* Set "pd-balanced" as DefaultBootDiskType

It is a default since v1.24
Ref: https://cloud.google.com/kubernetes-engine/docs/how-to/custom-boot-disks#specify

* Clarify VPA and HPA limitations

Signed-off-by: Luke Addison <[email protected]>

* Update ionos-cloud-sdk-go and mocks

* Update provider code

* Add cloud API request metrics.
* Fix and update README

* Ignore ionos-cloud-sdk-go spelling

* fix n4 price format

* Add listManagedInstancesResults to GceCache.

* [clusterapi] Do not skip nodegroups with minSize=maxSize

* [clusterapi] Update tests for nodegroups with minSize=maxSize

* add tests

* made changes to support MIGs that use regional instance templates

* modified current unit tests to support the new modifications

* added comment to InstanceTemplateNameType

* Ran hack/go-fmtupdate.h on mig_info_provider_test.go

* Use KubeEnv in gce/templates.go

* Add templateName to kube-env to ensure that correct value is cached

* rebased and resolved conflicts

* added fix for unit tests

* changed InstanceTemplateNameType to InstanceTemplateName

* separated url parser to its own function, created unit test for the function

* separated url parser to its own function, created unit test for the function

* added unit test with regional MIG

* Migrate GCE client to server side operation wait

* Track type of node group created/deleted in auto-provisioned group metrics.

* trigger tests

* fix comment

* Add AtomicScaleUp method to NodeGroup interface

* Add an option to Cluster Autoscaler that allows triggering new loops
more frequently: based on new unschedulable pods and every time a
previous iteration was productive.

* Refactor StartDeletion usage patterns and enforce periodic scaledown status processor calls.

* Bump golang to 1.22

* updated admission-controller to have adjustable --min-tls-version and --tls-ciphers

* CA: Move the ProvisioningRequest CRD to apis module

Signed-off-by: Yuki Iwai <[email protected]>

* Bump default VPA version to 1.1.0

As part of the 1.1.0 release: kubernetes#6388

* Format README

* Add chart versions

* Add script to update required chart versions in README

* Add chart version column in version matrix

* Move cluster-autoscaler update-chart-version-readme script to /hack

* Only check recent revisions when updating README

* Update min cluster-autoscaler chart for Kubernetes 1.29

* Remove unused NodeInfoProcessor

* Fix broken link in README.md to point to equinixmetal readme

* review comments - simplify retry logic

* CA: Before we perform go test, synchronizing go vendor

Signed-off-by: Yuki Iwai <[email protected]>

* Cleanup ProvReq wrapper

* Make the Estimate func accept pods grouped.

The grouping should be made by the schedulability equivalence
meaning we can introduce optimizations to the binpacking.

Introduce a benchmark that estimates capacity needed for 51k pods,
which can be grouped to two equivalence groups 50k and 1k.

* Update CAPI docs

Add a link to the sample manifest and update the image used in the
example.

Signed-off-by: Lennart Jern <[email protected]>

* Bump golang in /vertical-pod-autoscaler/pkg/updater

Bumps golang from 1.22.1 to 1.22.2.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* Bump golang in /vertical-pod-autoscaler/pkg/admission-controller

Bumps golang from 1.22.1 to 1.22.2.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* Bump golang in /vertical-pod-autoscaler/pkg/recommender

Bumps golang from 1.22.1 to 1.22.2.

---
updated-dependencies:
- dependency-name: golang
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

* Introduce binbacking optimization for similar pods.

The optimization uses the fact that pods which are equivalent do not
need to be check multiple times against already filled nodes.
This changes the time complexity from O(pods*nodes) to O(pods).

* CA: Fix apis vendoring

* Add g6 EC2 instance type for AWS

* Copyright boilerplate

* Lower errors verbosity for kube-env label missing

* parentController may be nil when owner isn't scalable

* Update ProvisioningClass API Group

* Fix Autoscaling for worker nodes with invalid ProviderID

This change fixes a bug that arises when the user's cluster includes worker nodes not from Hetzner Cloud, such as a Hetzner Dedicated server or any server resource other than Hetzner. It also corrects the behavior when a server has been physically deleted from Hetzner Cloud.

Signed-off-by: Maksim Paskal <[email protected]>

* Add tests for Pods owner that doesn't implement /scale

* Add provreqOrchestrator that handle ProvReq classes (kubernetes#6627)

* Add provreqOrchestrator that handle ProvReq classes

* Review remarks

* Review remarks

* Cluster Autoscaler: Sync k8s.io dependencies to k/k v1.30.0, bump Go to 1.22.2

* [v1.30] fix(hetzner): hostname label is not considered

The Node Group info we currently return does not include the
`kubernetes.io/hostname` label, which is usually set on every node.

This causes issues when the user has an unscheduled pod with a
`topologySpreadConstraint` on `topologyKey: kubernetes.io/hostname`.
cluster-autoscaler is unable to fulfill this constraint and does not
scale up any of the node groups.

Related to kubernetes#6715

* Remove the flag for enabling ProvisioningRequests

The API is not stable yet, we don't want people to depend on the
current version.

* fix: scale up broken for providers not implementing NodeGroup.GetOptions()

Properly handle calls to `NodeGroup.GetOptions()` that return
`cloudprovider.ErrNotImplemented` in the scale up path.

* Add --enable-provisioning-requests flag

* [cluster-autoscaler-release-1.30] Fix ProvisioningRequest update (kubernetes#6825)

* Fix ProvisioningRequest update

* Review remarks

---------

Co-authored-by: Yaroslava Serdiuk <[email protected]>

* Update k/k vendor to 1.30.1 for CA 1.30

* sync changes

* added sync changes file

* golint fix

* update vpa vendor

* fixed volcengine

* ran gofmt

* synched azure

* synched azure

* synched IT

* removed IT log file

* addressed review comments

---------

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Feruzjon Muyassarov <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Dinar Valeev <[email protected]>
Signed-off-by: Ondrej Kokes <[email protected]>
Signed-off-by: Luke Addison <[email protected]>
Signed-off-by: Lennart Jern <[email protected]>
Signed-off-by: Maksim Paskal <[email protected]>
Co-authored-by: Kubernetes Prow Robot <[email protected]>
Co-authored-by: David Benque <[email protected]>
Co-authored-by: michael mccune <[email protected]>
Co-authored-by: shubham82 <[email protected]>
Co-authored-by: Markus Lehtonen <[email protected]>
Co-authored-by: Niklas Rosenstein <[email protected]>
Co-authored-by: Ky-Anh Huynh <[email protected]>
Co-authored-by: Niklas Rosenstein <[email protected]>
Co-authored-by: Guy Templeton <[email protected]>
Co-authored-by: daimaxiaxie <[email protected]>
Co-authored-by: daimaxiaxie <[email protected]>
Co-authored-by: Michal Pitr <[email protected]>
Co-authored-by: Daniel Kłobuszewski <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Artur Żyliński <[email protected]>
Co-authored-by: Alvaro Aleman <[email protected]>
Co-authored-by: Marco Voelz <[email protected]>
Co-authored-by: Jack Francis <[email protected]>
Co-authored-by: Yarin Miran <[email protected]>
Co-authored-by: Will Bowers <[email protected]>
Co-authored-by: Yaroslava Serdiuk <[email protected]>
Co-authored-by: Bartłomiej Wróblewski <[email protected]>
Co-authored-by: Anish Shah <[email protected]>
Co-authored-by: Mahmoud Atwa <[email protected]>
Co-authored-by: pawel siwek <[email protected]>
Co-authored-by: Miranda Craghead <[email protected]>
Co-authored-by: Feruzjon Muyassarov <[email protected]>
Co-authored-by: Kuba Tużnik <[email protected]>
Co-authored-by: Johnnie Ho <[email protected]>
Co-authored-by: Walid Ghallab <[email protected]>
Co-authored-by: Karol Wychowaniec <[email protected]>
Co-authored-by: oksanabaza <[email protected]>
Co-authored-by: Vijay Bhargav Eshappa <[email protected]>
Co-authored-by: David <[email protected]>
Co-authored-by: Damika Gamlath <[email protected]>
Co-authored-by: Ashish Pani <[email protected]>
Co-authored-by: Yuki Iwai <[email protected]>
Co-authored-by: Dinar Valeev <[email protected]>
Co-authored-by: Donovan Muller <[email protected]>
Co-authored-by: Luiz Antonio <[email protected]>
Co-authored-by: Ondrej Kokes <[email protected]>
Co-authored-by: Yuan <[email protected]>
Co-authored-by: Luke Addison <[email protected]>
Co-authored-by: Mario Valderrama <[email protected]>
Co-authored-by: Max Fedotov <[email protected]>
Co-authored-by: Daniel-Redeploy <[email protected]>
Co-authored-by: Edwinhr716 <[email protected]>
Co-authored-by: Maksym Fuhol <[email protected]>
Co-authored-by: Allen Mun <[email protected]>
Co-authored-by: mewa <[email protected]>
Co-authored-by: Aayush Rangwala <[email protected]>
Co-authored-by: prachigandhi <[email protected]>
Co-authored-by: Daniel Gutowski <[email protected]>
Co-authored-by: Lennart Jern <[email protected]>
Co-authored-by: mendelski <[email protected]>
Co-authored-by: ceuity <[email protected]>
Co-authored-by: Maksim Paskal <[email protected]>
Co-authored-by: Julian Tölle <[email protected]>
Co-authored-by: k8s-infra-cherrypick-robot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants