Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add advance settings to fine tune DRS imbalance calculation #8521

Merged
merged 12 commits into from
Feb 13, 2024

Conversation

vishesh92
Copy link
Member

@vishesh92 vishesh92 commented Jan 17, 2024

Description

Doc PR: apache/cloudstack-documentation#374

This PR addresses two issues:

  1. Allow DRS to work as expected in clusters with hosts of different capacities - We do this by using ratio of free/total metric. This is controlled using a new setting drs.metric.use.ratio.
  2. Imbalance can be greater than 1. For condensed algorithm, migrations don't take place when this happens. - This happens when the distribution of metrics is skewed. We fix this by filtering out values where ratio is less than drs.imbalance.condensed.skip.threshold if drs.metric.type is free and greater than drs.imbalance.condensed.skip.threshold if drs.metric.type is used to remove the skew and check if we need drs or not.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link

codecov bot commented Jan 17, 2024

Codecov Report

Attention: 15 lines in your changes are missing coverage. Please review.

Comparison is base (b0ac787) 30.80% compared to head (b917ecf) 30.94%.
Report is 4 commits behind head on 4.19.

Files Patch % Lines
...apache/cloudstack/cluster/ClusterDrsAlgorithm.java 86.44% 3 Missing and 5 partials ⚠️
.../apache/cloudstack/metrics/MetricsServiceImpl.java 76.92% 2 Missing and 1 partial ⚠️
...he/cloudstack/response/ClusterMetricsResponse.java 50.00% 1 Missing and 1 partial ⚠️
ui/src/config/section/infra/clusters.js 0.00% 1 Missing ⚠️
ui/src/views/infra/ClusterDRSTab.vue 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.19    #8521      +/-   ##
============================================
+ Coverage     30.80%   30.94%   +0.14%     
- Complexity    34043    34220     +177     
============================================
  Files          5346     5346              
  Lines        375421   375496      +75     
  Branches      54598    54607       +9     
============================================
+ Hits         115647   116215     +568     
+ Misses       244489   243993     -496     
- Partials      15285    15288       +3     
Flag Coverage Δ
simulator-marvin-tests 24.83% <77.61%> (+0.19%) ⬆️
uitests 4.39% <0.00%> (-0.01%) ⬇️
unit-tests 16.55% <80.59%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8342

@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8349

@vishesh92
Copy link
Member Author

@blueorangutan test

@blueorangutan
Copy link

@vishesh92 a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@vishesh92 vishesh92 changed the title Refactoring to allow DRS in a cluster with hosts of different capacities Add advance settings to fine tune DRS imbalance calculation Jan 18, 2024
@vishesh92 vishesh92 marked this pull request as ready for review January 19, 2024 09:13
@kiranchavala
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@kiranchavala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8402

@vishesh92
Copy link
Member Author

@blueorangutan package

1 similar comment
@kiranchavala
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@kiranchavala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@kiranchavala
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@kiranchavala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8429

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8558

@vishesh92
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8567

@kiranchavala
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@kiranchavala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8569

@kiranchavala
Copy link
Contributor

@blueorangutan ui

@blueorangutan
Copy link

@kiranchavala a Jenkins job has been kicked to build UI QA env. I'll keep you posted as I make progress.

@blueorangutan
Copy link

UI build: ✔️
Live QA URL: https://qa.cloudstack.cloud/simulator/pr/8521 (QA-JID-274)

@kiranchavala
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@kiranchavala a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8584

@kiranchavala
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@kiranchavala a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-9128)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 53316 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8521-t9128-kvm-centos7.zip
Smoke tests completed. 127 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_secure_vm_migration Error 275.94 test_vm_life_cycle.py
test_02_unsecure_vm_migration Error 226.46 test_vm_life_cycle.py
test_03_secured_to_nonsecured_vm_migration Error 155.55 test_vm_life_cycle.py
test_04_nonsecured_to_secured_vm_migration Error 155.48 test_vm_life_cycle.py
test_02_redundant_VPC_default_routes Failure 395.79 test_vpc_redundant.py
test_05_rvpc_multi_tiers Failure 463.75 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 463.77 test_vpc_redundant.py

Copy link
Contributor

@kiranchavala kiranchavala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the new Advanced DRS related settings manually and also the UI improvements

drs.metric.type
drs.metric.use.ratio
drs.imbalance.condensed.skip.threshold

Scenarios

Drs.imbalance= 0.9 and drs.alogirithm=condensed >> The vms got migrated on a single host

Drs.imbalance= 0.4 and drs.alogirthm=balanced >> The vms got distributed across the hosts

Screenshots

resize

drsvalues1

drsvalues2

Logs

Condensed

2024-02-08 05:35:54,311 DEBUG [o.a.c.c.Balanced] (qtp1491755116-22:ctx-ecfb388b ctx-9739cec5) (logid:8253e2f5) Cluster 2 needs DRS. Imbalance: 0.7158188976374373 Threshold: 0.5 Algorithm: balanced DRS metric: memory Metric Type: used Use ratio: false

2024-02-08 05:35:54,405 DEBUG [o.a.c.c.Balanced] (qtp1491755116-22:ctx-ecfb388b ctx-9739cec5) (logid:8253e2f5) Cluster 2 pre-imbalance: 0.7158188976374373 post-imbalance: 0.7820295697311479 Algorithm: balanced VM: f659a6f7-2915-4a18-9217-11cf8a03dce1 srcHost: 4 destHost: a16c16d2-a2bb-4dbf-93fb-838568f3a2fd

2024-02-08 05:35:54,413 DEBUG [o.a.c.c.Balanced] (qtp1491755116-22:ctx-ecfb388b ctx-9739cec5) (logid:8253e2f5) Cluster 2 pre-imbalance: 0.7158188976374373 post-imbalance: 0.34015067152490375 Algorithm: balanced VM: f659a6f7-2915-4a18-9217-11cf8a03dce1 srcHost: 4 destHost: b1f62d2b-fee5-4fab-b3c1-40ab3ad68a45

2024-02-08 05:35:54,413 DEBUG [o.a.c.c.ClusterDrsServiceImpl] (qtp1491755116-22:ctx-ecfb388b ctx-9739cec5) (logid:8253e2f5) Plan for VM e1096698-c0f3-4c75-8c9a-fc3526c938dd to migrate from host 722009bc-e65f-465c-a55f-1e26b70b93ae to host b1f62d2b-fee5-4fab-b3c1-40ab3ad68a45

2024-02-08 05:35:54,417 DEBUG [o.a.c.c.Balanced] (qtp1491755116-22:ctx-ecfb388b ctx-9739cec5) (logid:8253e2f5) Cluster 2 does not need DRS. Imbalance: 0.34015067152490375 Threshold: 0.5 Algorithm: balanced DRS metric: memory Metric Type: used Use ratio: false

Balanced

2024-02-08 05:49:16,865 DEBUG [o.a.c.c.Condensed] (qtp1491755116-22:ctx-dbbaeaf5 ctx-975fb795) (logid:d6ccccf6) Cluster 2 needs DRS. Imbalance: 0.34015067152490375 Threshold: 0.8999999761581421 Algorithm: condensed DRS metric: memory Metric Type: used Use ratio: false SkipThreshold: 0.95

2024-02-08 05:49:17,055 DEBUG [o.a.c.c.Condensed] (qtp1491755116-22:ctx-dbbaeaf5 ctx-975fb795) (logid:d6ccccf6) Cluster 2 pre-imbalance: 0.7820295697311479 post-imbalance: 1.0523488093445659 Algorithm: condensed VM: f659a6f7-2915-4a18-9217-11cf8a03dce1 srcHost: 4 destHost: a16c16d2-a2bb-4dbf-93fb-838568f3a2fd

2024-02-08 05:49:17,060 DEBUG [o.a.c.c.Condensed] (qtp1491755116-22:ctx-dbbaeaf5 ctx-975fb795) (logid:d6ccccf6) Cluster 2 pre-imbalance: 0.7820295697311479 post-imbalance: 0.6428243465332251 Algorithm: condensed VM: f659a6f7-2915-4a18-9217-11cf8a03dce1 srcHost: 4 destHost: b1f62d2b-fee5-4fab-b3c1-40ab3ad68a45

2024-02-08 05:49:16,965 DEBUG [o.a.c.c.ClusterDrsServiceImpl] (qtp1491755116-22:ctx-dbbaeaf5 ctx-975fb795) (logid:d6ccccf6) Plan for VM e1096698-c0f3-4c75-8c9a-fc3526c938dd to migrate from host b1f62d2b-fee5-4fab-b3c1-40ab3ad68a45 to host a16c16d2-a2bb-4dbf-93fb-838568f3a2fd

2024-02-08 05:49:17,060 DEBUG [o.a.c.c.ClusterDrsServiceImpl] (qtp1491755116-22:ctx-dbbaeaf5 ctx-975fb795) (logid:d6ccccf6) Plan for VM 0381cd82-f9dd-417d-836f-99dd300720e7 to migrate from host 722009bc-e65f-465c-a55f-1e26b70b93ae to host a16c16d2-a2bb-4dbf-93fb-838568f3a2fd

2024-02-08 05:49:17,064 DEBUG [o.a.c.c.Condensed] (qtp1491755116-22:ctx-dbbaeaf5 ctx-975fb795) (logid:d6ccccf6) Cluster 2 does not need DRS. Imbalance: 1.0523488093445659 Threshold: 0.8999999761581421 Algorithm: condensed DRS metric: memory Metric Type: used Use ratio: false SkipThreshold: 0.95

@rohityadavcloud
Copy link
Member

LGTM, thanks for the fixes and testing @vishesh92 @kiranchavala

@rohityadavcloud rohityadavcloud merged commit 1955d8f into apache:4.19 Feb 13, 2024
26 checks passed
@rohityadavcloud rohityadavcloud deleted the drs-fixes branch February 13, 2024 05:48
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Feb 20, 2024
)

* Use free/total instead of free metric to calculate imbalance

* Filter out hosts for condensed while checking imbalance

* Make DRS more configurable

* code refactor

* Add unit tests

* fixup

* Fix validation for drs.imbalance.condensed.skip.threshold

* Add logging and other minor changes for drs

* Add some logging for drs

* Change format for drs imbalance to string

* Show drs imbalance as percentage

* Fixup label for memorytotal in en.json
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: No status
Development

Successfully merging this pull request may close these issues.

6 participants