fix: Fix scheduling benchmarking to not bail due to limits or same UUID #1930

jonathan-innis · 2025-01-25T02:05:22Z

Fixes #N/A

Description

This change fixes the scheduling benchmarking so that it accurately validates the actual scheduling function by making some modifications to the NodePools and pods that are input into the scheduling function so that it provisions nodes for all of the pods and doesn't bail early.

NOTE: We still have an issue where not all pods will schedule when going through these benchmarks due to the randomness generated from the pods -- we should resolve this so we ensure that all pods schedule and that we get more accurate benchmarking results.

Before Change

=== RUN   TestSchedulingProfile
scheduled 10110 against 32 nodes in total in 89.520585ms with minValues included false 112934.91882341921 pods/sec
400 instances 10 pods    2 nodes  930.06µs per scheduling     93.006µs per pod
400 instances 100 pods   5 nodes  1.190305ms per scheduling   11.903µs per pod
400 instances 500 pods   5 nodes  7.598401ms per scheduling   15.196µs per pod
400 instances 1000 pods  5 nodes  3.146214ms per scheduling   3.146µs per pod
400 instances 1500 pods  5 nodes  2.547879ms per scheduling   1.698µs per pod
400 instances 2000 pods  5 nodes  4.580475ms per scheduling   2.29µs per pod
400 instances 5000 pods  5 nodes  13.870389ms per scheduling  2.774µs per pod
--- PASS: TestSchedulingProfile (31.24s)
PASS

After Change

=== RUN   TestSchedulingProfile
scheduled 10110 against 1682 nodes in total in 14.009408027s 721.6579016411855 pods/sec
400 instances 10 pods    1 nodes    7.265591ms per scheduling    726.559µs per pod
400 instances 100 pods   16 nodes   272.166927ms per scheduling  2.721669ms per pod
400 instances 500 pods   83 nodes   998.235323ms per scheduling  1.99647ms per pod
400 instances 1000 pods  166 nodes  1.732335042s per scheduling  1.732335ms per pod
400 instances 1500 pods  250 nodes  1.111092833s per scheduling  740.728µs per pod
400 instances 2000 pods  333 nodes  1.607259417s per scheduling  803.629µs per pod
400 instances 5000 pods  833 nodes  7.663446417s per scheduling  1.532689ms per pod
--- PASS: TestSchedulingProfile (31.04s)
PASS

Trying this change on v1.0.5 (pre-scheduling improvements)

=== RUN   TestSchedulingProfile
scheduled 10110 against 1682 nodes in total in 2m12.738220963s 76.16495028073416 pods/sec
400 instances 10 pods    1 nodes    15.160012ms per scheduling      1.516001ms per pod
400 instances 100 pods   16 nodes   2.702137138s per scheduling     27.021371ms per pod
400 instances 500 pods   83 nodes   1.691394437s per scheduling     3.382788ms per pod
400 instances 1000 pods  166 nodes  1.72736325s per scheduling      1.727363ms per pod
400 instances 1500 pods  250 nodes  4.13193325s per scheduling      2.754622ms per pod
400 instances 2000 pods  333 nodes  9.276838084s per scheduling     4.638419ms per pod
400 instances 5000 pods  833 nodes  1m53.193394792s per scheduling  22.638678ms per pod
--- PASS: TestSchedulingProfile (209.84s)
PASS

You can see the insanely long scheduling loops that we started to get when we run this against 5000 nodes

How was this change tested?

go test -bench=.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2025-01-25T02:05:24Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coveralls · 2025-01-25T02:27:06Z

Pull Request Test Coverage Report for Build 12960791270

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.04%) to 81.141%

Totals
Change from base Build 12954850842:	0.04%
Covered Lines:	9078
Relevant Lines:	11188

💛 - Coveralls

rschalo · 2025-01-27T17:17:22Z

Nice fix.
/lgtm

k8s-ci-robot · 2025-01-27T17:17:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jonathan-innis, rschalo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jonathan-innis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Fix scheduling benchmarking to not bail due to limits

cf03fdd

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 25, 2025

k8s-ci-robot requested review from engedaam and tallaxes January 25, 2025 02:05

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 25, 2025

jonathan-innis changed the title ~~fix: Fix scheduling benchmarking to not bail due to limits~~ fix: Fix scheduling benchmarking to not bail due to limits or same UUID Jan 25, 2025

jonathan-innis marked this pull request as ready for review January 26, 2025 17:37

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 26, 2025

k8s-ci-robot requested review from jackfrancis and jmdeal January 26, 2025 17:37

jonathan-innis mentioned this pull request Jan 27, 2025

feat: support new topologySpread scheduling constraints #852

Merged

k8s-ci-robot assigned rschalo Jan 27, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2025

rschalo approved these changes Jan 27, 2025

View reviewed changes

k8s-ci-robot merged commit 07e3d6e into kubernetes-sigs:main Jan 27, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fix scheduling benchmarking to not bail due to limits or same UUID #1930

fix: Fix scheduling benchmarking to not bail due to limits or same UUID #1930

jonathan-innis commented Jan 25, 2025 •

edited

Loading

k8s-ci-robot commented Jan 25, 2025

coveralls commented Jan 25, 2025

rschalo commented Jan 27, 2025

k8s-ci-robot commented Jan 27, 2025

fix: Fix scheduling benchmarking to not bail due to limits or same UUID #1930

fix: Fix scheduling benchmarking to not bail due to limits or same UUID #1930

Conversation

jonathan-innis commented Jan 25, 2025 • edited Loading

Before Change

After Change

Trying this change on v1.0.5 (pre-scheduling improvements)

k8s-ci-robot commented Jan 25, 2025

coveralls commented Jan 25, 2025

Pull Request Test Coverage Report for Build 12960791270

Details

💛 - Coveralls

rschalo commented Jan 27, 2025

k8s-ci-robot commented Jan 27, 2025

jonathan-innis commented Jan 25, 2025 •

edited

Loading