Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRS failed for cluster if have another cluster in zone #8629

Closed
yashi4engg opened this issue Feb 8, 2024 · 11 comments · Fixed by #9245
Closed

DRS failed for cluster if have another cluster in zone #8629

yashi4engg opened this issue Feb 8, 2024 · 11 comments · Fixed by #9245
Assignees
Milestone

Comments

@yashi4engg
Copy link

ISSUE TYPE
  • Bug Report
COMPONENT NAME

CLOUDSTACK VERSION

CONFIGURATION
OS / ENVIRONMENT
SUMMARY

We have zone1 where we have two clusters ex- cluster01 and cluster02 ,
Where cluster01 have 3 hypervisors with same h/w model and cluster02 has one hypervisor with dfifferent h/w model.
I have enabled drs setiings in globals setting and then disbaled it for cluster02 in cluster setting "drs.automatic.enable -- false" but enabled for cluster01 .

In the above scenario drs plan failed with below logs -

2024-02-08 10:26:33,532 DEBUG [c.c.s.ManagementServerImpl] (VMSchedulerPollTask:ctx-0bba1590) (logid:f0f7966b) Hosts having capacity and suitable for migration: [Host {"id":25,"name":"node-cluster01","type":"Routing","uuid":"5d145861-e4ad-4f94-a805-266711321d59"}, Host {"id":40,"name":"node-cluster02","type":"Routing","uuid":"66b47d2a-b047-452c-b5d3-65c160666b50"}, Host {"id":48,"name":"node-cluster01","type":"Routing","uuid":"da1be373-cd08-4ae1-948f-7592daabb3fc"}]
2024-02-08 10:26:33,535 ERROR [o.a.c.c.ClusterDrsServiceImpl] (VMSchedulerPollTask:ctx-0bba1590) (logid:f0f7966b) Unable to generate DRS plans for cluster Cluster-Z01 [id=5366c5fb-0ed0-4caf-b2c7-93ebea15a717]

If i disable host from cluster02 our drs works as expected and migrated VMs based on load . But when we have both clusters and all nodes in cluster enabled its failed to generate plan.

Below are cluster level settings for reference from DB.
621 | 1 | drs.automatic.enable | true |
| 622 | 1 | drs.automatic.interval | 10 |
| 623 | 1 | drs.imbalance | 0.4 |
| 624 | 16 | drs.automatic.enable | false

Even tried by keep it disabled in global settings and then enable it for just one cluster in cluster settings.

As our use case is where we have multiple clusters in same zone with different type of h/w.

STEPS TO REPRODUCE
EXPECTED RESULTS

@rohityadavcloud
Copy link
Member

cc @vishesh92

@vishesh92 vishesh92 self-assigned this Feb 8, 2024
@vishesh92
Copy link
Member

@yashi4engg I am not able to reproduce the issue. DRS runs independently for each cluster and this shouldn't be happening.

Can you share the complete logs including the error as well as more details about your clusters?

@yashi4engg
Copy link
Author

yashi4engg commented Feb 8, 2024

@vishesh92 -- While trying to reproduce above error. Below is our configs :-

Primary Storage - We are using zone wide primary storages for both clusters.

Cluster 1 have 3 hypervisors and able to run DRS if i disable hosts from cluster02 .

Cluster02 - have 1 hypervisors which is different hardware from hosts in cluster01. DRS is disabled in cluster settings.

@vishesh92
Copy link
Member

@yashi4engg The error log you have shared seems to be truncated. Can you share the complete error log?
Also, a cloudstack cluster supports only one hypervisor. It's possible the error is happening because of your current setup. The complete error logs might help in identifying the issue.

@yashi4engg
Copy link
Author

We were able to fix above errors and good to close this.

Now DRS is working as expected for us . It was due to zone level storage pool.

@yashi4engg
Copy link
Author

@vishesh92 -- it works fine where we have primary storage as NFS ..But its not working where we are using OCFS2 FS as primary storage pool in cluster.Even while generating DRS plan manually i am not seeing error it shows success but not getting any plan .

below are logs -
[root@test01 ~]# cat /var/log/cloudstack/management/management-server.log |grep -i edcf30ae
2024-04-16 10:43:57,568 DEBUG [o.a.c.c.ClusterDrsServiceImpl] (VMSchedulerPollTask:ctx-cf16f762) (logid:edcf30ae) ClusterDRS.poll is being called at 2024-04-16 14:44:00 GMT
2024-04-16 10:43:57,579 DEBUG [o.a.c.c.ClusterDrsServiceImpl] (VMSchedulerPollTask:ctx-cf16f762) (logid:edcf30ae) Removed 0 old drs migration plans
[root@test01 ~]# cat /var/log/cloudstack/management/management-server.log |grep -i edcf30ae
2024-04-16 10:43:57,568 DEBUG [o.a.c.c.ClusterDrsServiceImpl] (VMSchedulerPollTask:ctx-cf16f762) (logid:edcf30ae) ClusterDRS.poll is being called at 2024-04-16 14:44:00 GMT
2024-04-16 10:43:57,579 DEBUG [o.a.c.c.ClusterDrsServiceImpl] (VMSchedulerPollTask:ctx-cf16f762) (logid:edcf30ae) Removed 0 old drs migration plans

where as in my cluster VM count is like, i set drs.imbalance to 1.0 as well -
node1 - 150VM
node2 - 150 VM
node3 - 1 VM

@vishesh92
Copy link
Member

@yashi4engg Are you able to manually migrate a VM to another host in your cluster?

@vishesh92 vishesh92 moved this from Todo to Discuss in Apache CloudStack BugFest - Issues Jun 13, 2024
@vishesh92
Copy link
Member

vishesh92 commented Jun 13, 2024

@yashi4engg It's good to know that you are interested in DRS. But please open a different issue or a discussion if the problem is different. It makes it easier for the community to manage the issues.

I have created this PR to ensure that the destination host is part of the same cluster here: #9245 . This should fix the issue you previously described.
There were some changes in #8521 which will be released as part of 4.19.1.0 release. I am not sure if this will fix your issue or not. You can try out the current 4.19 branch to check.

Can you please create another issue with the following details:

  • Which drs.algorithm are you using?
  • Are all the hosts in the cluster homogeneous? Details about the host & VMs CPU/memory deployed in the cluster.
  • Is the storage pool cluster wide or zone wide?

@DaanHoogland
Copy link
Contributor

@yashi4engg do you have the possibility to test PR #9243?

@DaanHoogland DaanHoogland moved this from ready for Review to ready for Testing in Apache CloudStack BugFest - Issues Jun 18, 2024
@sureshanaparti sureshanaparti moved this from Todo to In Progress in Apache CloudStack 4.19.1 Jun 19, 2024
@sureshanaparti
Copy link
Contributor

Closing this, fixed in #9245. @yashi4engg you can create another issue with any other DRS issues/improvements.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Apache CloudStack 4.19.1 Jun 25, 2024
@github-project-automation github-project-automation bot moved this from ready for Testing to Done in Apache CloudStack BugFest - Issues Jun 25, 2024
@yashi4engg
Copy link
Author

@DaanHoogland - sorry as we dont have any VMWare cluster imported/added in cloudstack as of now to test this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants