Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Load test with 100+ CR instances #55

Closed
muvaf opened this issue Sep 9, 2021 · 9 comments
Closed

Load test with 100+ CR instances #55

muvaf opened this issue Sep 9, 2021 · 9 comments
Assignees
Labels
alpha enhancement New feature or request

Comments

@muvaf
Copy link
Member

muvaf commented Sep 9, 2021

What problem are you facing?

Common controller works for our prototypes but we need to do more experiments to see its scalability. There are certain issues that need some data before taking action, like #38

How could Terrajet help solve your problem?

We can have a complex composition with 10+ resources and create 10 XRs using that. Then check the resource usage and errors across the board to see if we hit any limit. We can define some limits in the deployment of the controller and see at what point we start to see context deadline exceeded errors, meaning we can't complete a single reconciliation pass with those limits.

@ulucinar
Copy link
Collaborator

ulucinar commented Sep 17, 2021

Experiment Setup:

On a GKE cluster with the following specs:

Machine family: General purpose e2-medium (2 vCPU, 4 GB memory)
3 workers
Control plane version - 1.20.9-gke.701

I deployed a stripped-down version of provider-tf-azure (with 22 MRs including the VirtualNetwork resource) as we would like to focus on the scalability dimension of #-of-CRs. The #-of-CRDs dimension has already been investigated here. The VirtualNetwork MRs are provisioned via a simple shell script, with the MR name and infrastructure object name suffixes being passed from the command-line. An example invocation of the generator script and an example generated MR manifest looks like the following:

$ ./manage-virtualnetworks.sh create $(seq 31 40)
apiVersion: virtual.azure.tf.crossplane.io/v1alpha1
kind: VirtualNetwork
metadata:
  annotations:
    crossplane.io/external-name: /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-40
    tf.crossplane.io/state: ...
  creationTimestamp: "2021-09-17T12:41:34Z"
  finalizers:
  - finalizer.managedresource.crossplane.io
  generation: 1
  name: test-40
  resourceVersion: "622141"
  uid: d6b7a148-7b1d-432d-b665-8550647e5c8f
spec:
  deletionPolicy: Delete
  forProvider:
    addressSpace:
    - 10.0.0.0/16
    dnsServers:
    - 10.0.0.1
    - 10.0.0.2
    - 10.0.0.3
    location: East US
    name: test-40
    resourceGroupName: alper
    tags:
      experiment: "2"
  providerConfigRef:
    name: example
status:
  atProvider:
    guid: 8b4aa7b0-8256-4ea2-b6b9-c6f86d6e2857
  conditions:
  - lastTransitionTime: "2021-09-17T13:00:48Z"
    reason: Available
    status: "True"
    type: Ready
  - lastTransitionTime: "2021-09-17T12:47:09Z"
    reason: ReconcileSuccess
    status: "True"
    type: Synced

I have done a set of experiments with provider-tf-azure using the free Azure VirtualNetwork resource. After an initial batch of 10 VirtualNetworks were provisioned and successfully transitioned to the Ready state, another batch of 10 more VirtualNetwork MRs were provisioned, and I observed that the MRs added in the latter batch were failing to transition to the Ready state. And the workqueue depth for the resource approached ~30:

queue-metrics

The VirtualNetwork controller has a default worker count of 1 in this setup, and as it can be observed in the Workqueue depth for VirtualNetworks graph, the workqueue of the VirtualNetwork controller quickly responds to these newcomer MRs. At each reconciliation, we are running Terraform pipelines using the Terraform CLI, and each pipeline is potentially forking multiple Terraform Azurerm provider plugins to have them communicate with Azure. Please note that we are not running the Terraform provider plugins as shared gRPC servers in this setup. As it can be observed in the Reconciliation Times for VirtualNetworks graph, the reconciliation times of the VirtualNetwork controller measured as the 99-th percentile over the last 5m (a common SLI used to give Kubernetes API latency SLOs) is above 20 s, due to the aforementioned Terraform pipelines we are running at each reconciliation loop. Also the 5 min average wait times reconciliation requests spend in the workqueue is increasing with the number of VirtualNetworks we have in the cluster. We have a single worker routine processing the slow Terraform pipelines, including the synchronous Obseervation pipeline.

During these experiments, I also discovered a bug in the tfcli library, where we do not consume an already available TF pipeline result if the pipeline had a chance to produce its result. After this bug is fixed with #67 and without increasing the maximum concurrent reconciler count for the VirtualNetwork controller, all of the 40 VirtualNetwork MRs provisioned in the cluster could successfully transition to the Ready state. I have taken rough measurements to give an idea of the time it took for the last batch of 10 resources to transition to Ready state:

test-31   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-31   15m
test-32   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-32   16m
test-33   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-33   16m
test-34   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-34   16m
test-35   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-35   17m
test-36   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-36   17m
test-37   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-37   17m
test-38   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-38   18m
test-39   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-39   18m
test-40   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-40   19m

Please note that I have not repeated these experiments and these results do not represent averages, however, the p99 workqueue wait times (exceeding ~8 min) look consistent with these measurements.

One notable issue is that because we are only capable of using a single worker, currently we cannot utilize the available CPU and memory resources to terrajet-based providers:
cpu-mem-metrics
This can become an issue if a terraform-based provider is being used to provision multiple objects of the same kind. crossplane-contrib/provider-jet-azure#4 adds an option to increase the maximum concurrent reconcilers for a resource without modifying the default value, so that we will be able to utilize CPU & memory if needed.

I'm planning to perform some further experiments to see whether we can utilize CPU & memory resources more efficiently to decrease transition-to-Ready times and to check some other metrics.

@muvaf
Copy link
Member Author

muvaf commented Sep 17, 2021

As it can be observed in the Reconciliation Times for VirtualNetworks graph, the reconciliation times of the VirtualNetwork controller measured as the 99-th percentile over the last 5m (a common SLI used to give Kubernetes API latency SLOs) is above 20 s, due to the aforementioned Terraform pipelines we are running at each reconciliation loop.

Wonder how this compares to provider-azure and terraform apply -refresh-only. I'd expect 20s be closer to be taken mostly by terraform operation.

At each reconciliation, we are running Terraform pipelines using the Terraform CLI, and each pipeline is potentially forking multiple Terraform Azurerm provider plugins to have them communicate with Azure. Please note that we are not running the Terraform provider plugins as shared gRPC servers in this setup.

We were concerned about the multiple gRPC servers eating up resources. From the graphs you shared, it doesn't look like that's the case since the memory & CPU usage doesn't seem to increase with number of VirtualNetworks. Do you think we need the shared gRPC server?

@muvaf
Copy link
Member Author

muvaf commented Sep 17, 2021

Also, I think we need similar experiments for multiple kinds. As discussed in standup, we need to know more about the behavior of terrajet with resources whose creation take much longer than VirtualNetwork, like databases or k8s clusters.

@ulucinar
Copy link
Collaborator

ulucinar commented Sep 17, 2021

We were concerned about the multiple gRPC servers eating up resources. From the graphs you shared, it doesn't look like that's the case since the memory & CPU usage doesn't seem to increase with number of VirtualNetworks. Do you think we need the shared gRPC server?

I'm planning to do another set of experiments with higher concurrency using the new -c option. I expect we will utilize both CPU & memory better and decrease time-to-completion, i.e. time to move a CR to the Ready state. Another thing to look into is to introduce updates (the experiments described above did not involve asynchronous Update operations, which will not block the workers, yield shorter queue waiting times, and thus utilize more resources.

I suspect because of so many process forks done by Terraform CLI, as we increase concurrency, the Terraform-based providers might become CPU-bound. That's something to be explored. If this turns out to be the case, and if we are not satisfied with our level of scaling, we might want to give shared gRPC servers a try.

Thank you for taking a look into this @muvaf, very much appreciated!

@muvaf
Copy link
Member Author

muvaf commented Sep 21, 2021

@ulucinar FYI, @negz let me know that we have a global rate limiter in most providers that limits number of reconciles to 1 per second. I have a hunch that this rate limiter affects the queue more than the concurrency number.

Broadly though, I think this issue was opened to see whether we can work with 100+ instances with varying kinds without crashing or exceeding the deadline to the point where provider is useless. So the answer to that question should be enough to close the issue.

@muvaf
Copy link
Member Author

muvaf commented Sep 22, 2021

@ulucinar it'd be great if you can share your script and provide a base line for performance statement so that we can reproduce the tests but that's not in the original scope of this issue, so feel free to close this issue after the 100+ instances with varying kinds without crashing or exceeding the deadline result or some other performance statement that we can use in the future repeatedly to measure the effects big changes.

@muvaf muvaf added alpha and removed post-alpha labels Sep 24, 2021
@luebken luebken added this to the Terrajet-Alpha milestone Sep 27, 2021
@ulucinar
Copy link
Collaborator

ulucinar commented Sep 28, 2021

Here are the results from another set of experiments involving our target of 100 MRs in this issue:

Experiment Setup:

On a GKE cluster with the following specs:

Machine family: General purpose e2-standard-4 (4 vCPU, 16 GB memory)
3 workers
Control plane version - 1.20.9-gke.1001

Previous experiments have shown that Terraform-based providers are CPU-bound, and hence in this setup we are using nodes with higher CPU capacity in order to be able to scale up to 100 MRs.

I deployed a stripped-down version of provider-tf-azure (with 32 MRs including the VirtualNetwork and Lb resources, Docker image: ulucinar/provider-tf-azure-controller:98a23918b69a778e4910f81483b7767c56cf41e5) containing the --concurrent-reconciles command-line option, which allows the provider to better utilize node resources. After experimenting with several max. concurrent reconciles options, I have chosen a value of 3 for this cluster. A total of 45 VirtualNetwork MRs and a total of 55 Lb MRs are provisioned simultaneously, with the MR name and infrastructure object name suffixes being passed from the command-line. An example invocation of the generator script and an example generated MR manifest looks like the following:

$ ./manage-mr.sh create ./loadbalancer.yaml $(seq 1 55)
apiVersion: lb.azure.tf.crossplane.io/v1alpha1
kind: Lb
metadata:
  annotations:
    crossplane.io/external-name: /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/loadBalancers/test-1
    tf.crossplane.io/state: ...
  creationTimestamp: "2021-09-27T13:38:48Z"
  finalizers:
  - finalizer.managedresource.crossplane.io
  generation: 1
  name: test-1
  resourceVersion: "5862396"
  uid: 2b97bbb4-791b-4449-8326-ef1ee5f13eb7
spec:
  deletionPolicy: Delete
  forProvider:
    location: East US
    name: test-1
    resourceGroupName: alper
  providerConfigRef:
    name: example
status:
  atProvider: {}
  conditions:
  - lastTransitionTime: "2021-09-27T13:43:56Z"
    reason: Available
    status: "True"
    type: Ready
  - lastTransitionTime: "2021-09-27T13:38:49Z"
    reason: ReconcileSuccess
    status: "True"
    type: Synced

As also observed in our previous experiments, CPU utilization shows a sharp increase in parallel to the increasing #-of-CRs in the cluster during the provisioning phase, and also during the de-provisioning phase. After all of the 100 MRs transition to the Ready state and the system stabilizes, all of the 100 MRs have been deleted simultaneously at 14:07:10 UTC. In the below graphs, the first peak CPU utilization at around 13:44 UTC is caused by the provisioning phase, and the second peak at around 14:13 UTC is caused by the de-provisioning phase.

azure-object-count

node-provider

provider-cpu-usage

nodes-cpu-mem-utilizations

First, CPU utilization climbs just over 92% as we have many asynchronous concurrent Create calls, however the limits we have imposed on the concurrency prevent saturation. However, we also observe increasing workqueue wait times for both kinds of resources:
avg-queue-wait-times

The following graph shows the time to readiness periods, the time it takes for an MR to become Ready (acquire the Ready status condition with status == True) measured from the time it's created:

ttr

Please note that these are not averages. The max time to readiness interval for an MR in these experiments has been 670 s and the min has been observed to be 221 s.

Another interesting observation is that when all the MRs are deleted at 14:07:10 UTC (they acquired a non-zero metadata.deletionTimestamp), CPU utilization starts to increase reaching a max. of ~94% at 14:13:30. And it takes ~1020 s (~17 min) to remove all of the 100 MRs from the cluster. Although a corresponding external resource is deleted via the Cloud API, it takes provider-tf-azure a longer time to dequeue a request from the workqueue and to make an observation for the deleted resource and to remove the finalizer.

@ulucinar
Copy link
Collaborator

ulucinar commented Sep 28, 2021

Results from another set of experiments with the native provider provider-azure:

Experiment Setup:

On a GKE cluster with the following specs:

Machine family: General purpose e2-standard-4 (4 vCPU, 16 GB memory)
3 workers
Control plane version - 1.20.9-gke.1001

I deployed provider-azure v0.17.0. Please note that the max. concurrent reconcilers is 1 for this version. The same script used in the previous experiments is employed with different template manifests to provision a total of 45 VirtualNetworks and a total of 205 Subnets. We cannot efficiently utilize node resources even with 250 MRs:

native-various-metrics

We will clearly benefit from increased concurrency here.

The time to readiness intervals are distributed as follows for these 250 MRs:

native-ttr

When all of the 250 MRs are deleted simultaneously at ~23:31:46 UTC, we observe a surge in workqueue lengths but only a slight increase in CPU utilization, and at ~23:34:30 UTC all MRs have been removed (in ~3 min). Please note that these deletion measurements are skewed because all of the 205 subnets that are deleted belonged to the same virtual network. Probably, when that virtual network was deleted, all observations for the subnets returned 404. Nevertheless, provider-azure needs to observe all of the subnets before it removes the associated finalizers from the Subnet MRs:

native-deletion-metrics

We need to incorporate the recently proposed --max-reconcile-rate command-line option for provider-azure (by @negz in crossplane/crossplane#2595) to make a fair comparison with provider-tf-azure, which already benefits from increased concurrency & better utilization of the CPU resources as described in the above experiments.

@ulucinar
Copy link
Collaborator

ulucinar commented Oct 7, 2021

We will continue scale testing of Terrajet-based providers with the latest improvements.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
alpha enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants