This document describes recommendations for performance tuning of the Diego Data Store.
- Component scaling guidelines
- Foundation Level Tuning
- Locket Performance Tuning
- SQL Performance Tuning
- Compensating for Envoy memory overhead
- VM Size Recommendation
The following components must be scaled vertically (more CPU cores and/or memory) after having scaled to at least 3 instances horizontally. After 3 instances, scaling them horizontally does not make sense since there is only one instance active at any given point in time.
auctioneer
- metrics to monitorbbs
- metrics to monitorroute_emitter
(only when running in global mode as opposed to cell-local mode)
The following components can be scaled horizontally as well as vertically:
file_server
locket
- metrics to monitorrep
- metrics to monitorrep_windows
- metrics to monitorroute_emitter
(only when running in cell-local mode)route_emitter_windows
(only when running in cell-local mode)ssh_proxy
The following jobs require more considered planning:
bbs
:- It is NOT recommended to use burstable
performance VMs, such as AWS
t2
-family instances. - The performance of the BBS depends significantly on the performance of its SQL database. A less performant SQL backend could reduce the throughput and increase the latency of the BBS requests.
- The BBS activity from API request load and internal activity are both directly proportional to the total number of running app instances (or running ActualLRPs, in pure Diego terms). If the number of instances that the deployment supports increases without a corresponding increase in VM resources, BBS API response times may increase instead.
- It is NOT recommended to use burstable
performance VMs, such as AWS
rep
:- Although the
rep
is a horizontally scalable component, the resources available to eachrep
on its VM (typically called a "Diego cell") affect the total number of app instance and task containers that can run on that VM. For example, if therep
is running on a VM with 20GB of memory, it can only run 20 app instances that each have a 1-GB memory limit. This constraint also applies to available disk capacity. - In case it is not possible for an operator to deploy larger cell VMs or to
increase the number of cell VMs, an operator can overcommit memory and disk
by setting the following properties on the
rep
job:diego.executor.memory_capacity_mb
diego.executor.disk_capacity_mb
Operators that overcommit cell capacity should be extremely careful not to run out of physical memory or disk capacity on the cells.
- Although the
locket
:- It is NOT recommended to use burstable
performance VMs, such as AWS
t2
-family instances. - The performance of the Locket instances depends significantly on the performance of its SQL database. A less performant SQL backend could reduce the throughput and increase the latency of the Locket requests, which may in turn affect the availability of services such as the BBS, the auctioneer, and the cell reps that maintain locks and presences in Locket.
- Note: Although
locket
is a horizontally scalable job, in cf-deployment it is deployed on thediego-api
instance group along with thebbs
job. In that case we recommend still to scale the instance group vertically.
- It is NOT recommended to use burstable
performance VMs, such as AWS
The Diego team currently benchmarks the BBS and Locket together on a VM with 16
CPU cores and 60GB memory. The MySQL and Postgres backends have the same number
of cores and memory. This setup can handle load from 1000 simulated cells
(running rep
and route-emitter
) with a total of 250K LRPs.
- Tips for Large Deployments with CF Networking and Silk Release
- ARP Cache Limit for Large foundations
- Known Loggregator Scaling Issues
The maximum number of connections from the active BBS to the SQL database can
be set using the diego.bbs.sql.max_open_connections
property on the bbs
job, and the maximum number of idle connections can be set using
diego.bbs.sql.max_idle_connections
. By default
diego.bbs.sql.max_idle_connections
is set to the same value as
diego.bbs.sql.max_open_connections
to avoid recreating connections to the
database uneccesarily.
The maximum number of connections from each Locket instance to the database can
be set using the database.max_open_connections
property on the locket
job.
Unlike the BBS, the Locket job does not permit the maximum number of idle
connections to be set independently, and always sets it to the same value as
database.max_open_connections
.
In a cf-deployment-based CF cluster, an operator can the maximum number of connections from Diego components (BBS and Locket) to the SQL backend using the following formula:
<number of diego-api instances> ```
- The `diego.bbs.sql.max_open_connections` parameter contributes only once
because there is only one active BBS instance.
- The actual number of active connections may be significantly lower than this
maximum, depending on the scale of the app workload that the CF cluster
supports.
- If other components connect to the same SQL database you will need to add
their maximum number of connections to get an accurate figure.
### SQL deployment configuration
Operators can use the following
[cf-deployment](https://github.com/cloudfoundry/cf-deployment)-compatible
[operations files](http://bosh.io/docs/cli-ops-files.html) to tune their MySQL
or Postgres databases to support a large CF cluster:
- MySQL: [mysql.yml](../operations/benchmarks/mysql.yml)
- Postgres: [postgres.yml](../operations/benchmarks/postgres.yml)
These operations files are the ones used in the Diego team's 250K-instance
benchmark tests, and operators may freely change the sizing and scaling
parameters in them to match the resource needs of their own CF clusters.
10K app instances: Minimum 7.5GB Memory and 2vCPU 20K app instances: Minimum 15GB Memory and 4vCPU
10k app instances: Minimum 3.75GB memory and 1vCPU