-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crucible controller hits some limits when operating at scale #464
Comments
|
|
At first glance it appears that the big difference between what is being run here and the scalability work that we previously did is the way that endpoints are being used. In the previous scalability work, the emphasis was on getting as many engines running as possible and stress testing the various things that relate to that (https://github.com/perftool-incubator/roadblock code, individual roadblock invocations -- ie. timeouts, etc.). In that work we ran as many as 10,000 separate engines and as many as 40 endpoints using 10 baremetal hosts. The fact that we were able to synchronize 10,000+ (the engines + endpoints + controller) roadblock participants gives us fairly high confidence that roadblock is in good shape for this degree of scaling and maybe even much more. Since there are 400 separate VMs being used here and each one is a separate endpoint, it seems logical to initially focus on endpoint scaling as the likely problem. Each endpoint is going to be a separate process on the Crucible controller that is spawned by |
#469 is the first PR that is based on work to resolve this issue. |
A major piece of the work to address this issue is taking place here: #483 |
We are using
remotehost
endpoint in crucible for benchmarking a set of openstack VMs distributed across different provider networks. Initially we started with a small number of VMs and then started increasing the VM count which acts as servers and clients. Until 80 VMs(40 servers and 40 clients) we did not see any concerns.And after that we tried out with 400 VMs(200 servers and 200 clients). In this we saw that the test was carried out well.. but in the post processing we did hit errors like below.
[2024-02-09 07:06:29.184][STDOUT] sh: fork: retry: Resource temporarily unavailable [2024-02-09 07:06:29.187][STDOUT] sh: fork: retry: Resource temporarily unavailable
Since crucible services are running as containers.. there are some PID limits associated with it. As a workaround to the above error.. we did increase the PID limits for the containers.
Later after that.. we have tried out with 1400 VMs(700 servers and 700 clients) to understand the crucible controller limits. We could not go forward because of the
too many open files limit error
on crucible controller.Deploying endpoints endpoint-deploy-timeout adjusted to 190440 seconds engine-script-timeout adjusted to 190440 seconds Can't exec "/bin/sh": Too many open files at /opt/crucible/subprojects/core/rickshaw/rickshaw-run line 2170. Can't exec "/bin/sh": Too many open files at /opt/crucible/subprojects/core/rickshaw/rickshaw-run line 2170.
After some time... Roadblock timed out and the test failed. We have tried increasing open file limits on the crucible controller and still it seems that crucible is opening too many files or there is some limitation for the containers.
The text was updated successfully, but these errors were encountered: