-
Notifications
You must be signed in to change notification settings - Fork 38
Load test with 100+ CR instances #55
Comments
Experiment Setup:On a GKE cluster with the following specs:
I deployed a stripped-down version of
apiVersion: virtual.azure.tf.crossplane.io/v1alpha1
kind: VirtualNetwork
metadata:
annotations:
crossplane.io/external-name: /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-40
tf.crossplane.io/state: ...
creationTimestamp: "2021-09-17T12:41:34Z"
finalizers:
- finalizer.managedresource.crossplane.io
generation: 1
name: test-40
resourceVersion: "622141"
uid: d6b7a148-7b1d-432d-b665-8550647e5c8f
spec:
deletionPolicy: Delete
forProvider:
addressSpace:
- 10.0.0.0/16
dnsServers:
- 10.0.0.1
- 10.0.0.2
- 10.0.0.3
location: East US
name: test-40
resourceGroupName: alper
tags:
experiment: "2"
providerConfigRef:
name: example
status:
atProvider:
guid: 8b4aa7b0-8256-4ea2-b6b9-c6f86d6e2857
conditions:
- lastTransitionTime: "2021-09-17T13:00:48Z"
reason: Available
status: "True"
type: Ready
- lastTransitionTime: "2021-09-17T12:47:09Z"
reason: ReconcileSuccess
status: "True"
type: Synced I have done a set of experiments with The During these experiments, I also discovered a bug in the
Please note that I have not repeated these experiments and these results do not represent averages, however, the p99 workqueue wait times (exceeding ~8 min) look consistent with these measurements. One notable issue is that because we are only capable of using a single worker, currently we cannot utilize the available CPU and memory resources to terrajet-based providers: I'm planning to perform some further experiments to see whether we can utilize CPU & memory resources more efficiently to decrease transition-to-Ready times and to check some other metrics. |
Wonder how this compares to
We were concerned about the multiple gRPC servers eating up resources. From the graphs you shared, it doesn't look like that's the case since the memory & CPU usage doesn't seem to increase with number of |
Also, I think we need similar experiments for multiple kinds. As discussed in standup, we need to know more about the behavior of terrajet with resources whose creation take much longer than |
I'm planning to do another set of experiments with higher concurrency using the new I suspect because of so many process forks done by Terraform CLI, as we increase concurrency, the Terraform-based providers might become CPU-bound. That's something to be explored. If this turns out to be the case, and if we are not satisfied with our level of scaling, we might want to give shared gRPC servers a try. Thank you for taking a look into this @muvaf, very much appreciated! |
@ulucinar FYI, @negz let me know that we have a global rate limiter in most providers that limits number of reconciles to 1 per second. I have a hunch that this rate limiter affects the queue more than the concurrency number. Broadly though, I think this issue was opened to see whether we can work with 100+ instances with varying |
@ulucinar it'd be great if you can share your script and provide a base line for performance statement so that we can reproduce the tests but that's not in the original scope of this issue, so feel free to close this issue after the |
Here are the results from another set of experiments involving our target of 100 MRs in this issue: Experiment Setup:On a GKE cluster with the following specs:
Previous experiments have shown that Terraform-based providers are CPU-bound, and hence in this setup we are using nodes with higher CPU capacity in order to be able to scale up to 100 MRs. I deployed a stripped-down version of
apiVersion: lb.azure.tf.crossplane.io/v1alpha1
kind: Lb
metadata:
annotations:
crossplane.io/external-name: /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/loadBalancers/test-1
tf.crossplane.io/state: ...
creationTimestamp: "2021-09-27T13:38:48Z"
finalizers:
- finalizer.managedresource.crossplane.io
generation: 1
name: test-1
resourceVersion: "5862396"
uid: 2b97bbb4-791b-4449-8326-ef1ee5f13eb7
spec:
deletionPolicy: Delete
forProvider:
location: East US
name: test-1
resourceGroupName: alper
providerConfigRef:
name: example
status:
atProvider: {}
conditions:
- lastTransitionTime: "2021-09-27T13:43:56Z"
reason: Available
status: "True"
type: Ready
- lastTransitionTime: "2021-09-27T13:38:49Z"
reason: ReconcileSuccess
status: "True"
type: Synced As also observed in our previous experiments, CPU utilization shows a sharp increase in parallel to the increasing #-of-CRs in the cluster during the provisioning phase, and also during the de-provisioning phase. After all of the 100 MRs transition to the First, CPU utilization climbs just over 92% as we have many asynchronous concurrent Create calls, however the limits we have imposed on the concurrency prevent saturation. However, we also observe increasing workqueue wait times for both kinds of resources: The following graph shows the time to readiness periods, the time it takes for an MR to become Please note that these are not averages. The max time to readiness interval for an MR in these experiments has been Another interesting observation is that when all the MRs are deleted at 14:07:10 UTC (they acquired a non-zero |
Results from another set of experiments with the native provider Experiment Setup:On a GKE cluster with the following specs:
I deployed We will clearly benefit from increased concurrency here. The time to readiness intervals are distributed as follows for these When all of the We need to incorporate the recently proposed |
We will continue scale testing of Terrajet-based providers with the latest improvements. |
What problem are you facing?
Common controller works for our prototypes but we need to do more experiments to see its scalability. There are certain issues that need some data before taking action, like #38
How could Terrajet help solve your problem?
We can have a complex composition with 10+ resources and create 10 XRs using that. Then check the resource usage and errors across the board to see if we hit any limit. We can define some
limits
in the deployment of the controller and see at what point we start to seecontext deadline exceeded
errors, meaning we can't complete a single reconciliation pass with those limits.The text was updated successfully, but these errors were encountered: