Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix scheduling benchmarking to not bail due to limits or same UUID #1930

Merged

Conversation

jonathan-innis
Copy link
Member

@jonathan-innis jonathan-innis commented Jan 25, 2025

Fixes #N/A

Description

This change fixes the scheduling benchmarking so that it accurately validates the actual scheduling function by making some modifications to the NodePools and pods that are input into the scheduling function so that it provisions nodes for all of the pods and doesn't bail early.

NOTE: We still have an issue where not all pods will schedule when going through these benchmarks due to the randomness generated from the pods -- we should resolve this so we ensure that all pods schedule and that we get more accurate benchmarking results.

Before Change

=== RUN   TestSchedulingProfile
scheduled 10110 against 32 nodes in total in 89.520585ms with minValues included false 112934.91882341921 pods/sec
400 instances 10 pods    2 nodes  930.06µs per scheduling     93.006µs per pod
400 instances 100 pods   5 nodes  1.190305ms per scheduling   11.903µs per pod
400 instances 500 pods   5 nodes  7.598401ms per scheduling   15.196µs per pod
400 instances 1000 pods  5 nodes  3.146214ms per scheduling   3.146µs per pod
400 instances 1500 pods  5 nodes  2.547879ms per scheduling   1.698µs per pod
400 instances 2000 pods  5 nodes  4.580475ms per scheduling   2.29µs per pod
400 instances 5000 pods  5 nodes  13.870389ms per scheduling  2.774µs per pod
--- PASS: TestSchedulingProfile (31.24s)
PASS

After Change

=== RUN   TestSchedulingProfile
scheduled 10110 against 1682 nodes in total in 14.009408027s 721.6579016411855 pods/sec
400 instances 10 pods    1 nodes    7.265591ms per scheduling    726.559µs per pod
400 instances 100 pods   16 nodes   272.166927ms per scheduling  2.721669ms per pod
400 instances 500 pods   83 nodes   998.235323ms per scheduling  1.99647ms per pod
400 instances 1000 pods  166 nodes  1.732335042s per scheduling  1.732335ms per pod
400 instances 1500 pods  250 nodes  1.111092833s per scheduling  740.728µs per pod
400 instances 2000 pods  333 nodes  1.607259417s per scheduling  803.629µs per pod
400 instances 5000 pods  833 nodes  7.663446417s per scheduling  1.532689ms per pod
--- PASS: TestSchedulingProfile (31.04s)
PASS

Trying this change on v1.0.5 (pre-scheduling improvements)

=== RUN   TestSchedulingProfile
scheduled 10110 against 1682 nodes in total in 2m12.738220963s 76.16495028073416 pods/sec
400 instances 10 pods    1 nodes    15.160012ms per scheduling      1.516001ms per pod
400 instances 100 pods   16 nodes   2.702137138s per scheduling     27.021371ms per pod
400 instances 500 pods   83 nodes   1.691394437s per scheduling     3.382788ms per pod
400 instances 1000 pods  166 nodes  1.72736325s per scheduling      1.727363ms per pod
400 instances 1500 pods  250 nodes  4.13193325s per scheduling      2.754622ms per pod
400 instances 2000 pods  333 nodes  9.276838084s per scheduling     4.638419ms per pod
400 instances 5000 pods  833 nodes  1m53.193394792s per scheduling  22.638678ms per pod
--- PASS: TestSchedulingProfile (209.84s)
PASS

You can see the insanely long scheduling loops that we started to get when we run this against 5000 nodes

How was this change tested?

go test -bench=.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 25, 2025
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 25, 2025
@jonathan-innis jonathan-innis changed the title fix: Fix scheduling benchmarking to not bail due to limits fix: Fix scheduling benchmarking to not bail due to limits or same UUID Jan 25, 2025
@coveralls
Copy link

Pull Request Test Coverage Report for Build 12960791270

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.04%) to 81.141%

Totals Coverage Status
Change from base Build 12954850842: 0.04%
Covered Lines: 9078
Relevant Lines: 11188

💛 - Coveralls

@jonathan-innis jonathan-innis marked this pull request as ready for review January 26, 2025 17:37
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 26, 2025
@rschalo
Copy link
Contributor

rschalo commented Jan 27, 2025

Nice fix.
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jonathan-innis, rschalo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 07e3d6e into kubernetes-sigs:main Jan 27, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants