Some users have larger deployments than we regularly test with. We have heard of large deployments with 500-1000 diego cells. These deployments have specific considerations that smaller deployments don't need to worry about.
Please submit a PR or create an issue if you have come across other large deployment considerations.
The silk daemon on some diego cells fails because it cannot get a lease.
Increase the size of the silk-controller.network
CIDR in the silk controller
spec.
The silk daemon begins using too much CPU on the cells. This causes the app health checks to fail, which causes the apps to evacuate the cell.
The silk daemon is deployed on every cell. It is in charge of getting the IP leases for every other cell from the silk controller. The silk daemon calls out to the silk controller every 5 seconds (by default) to get updated lease information. Every time it gets new information the silk daemon does some linux system calls to set up the networking. This can take a long time (relatively) and get expensive when there are a lot of cells with new leases. This causes the silk daemons to use a lot of CPU.
Change the property lease_poll_interval_seconds
on the silk-daemon job to be
greater than 5 seconds. This will cause the silk-daemon to poll the
silk-controller less frequently and thus make linux system calls less
frequently. However, increasing this property means that when a cell gets a new
lease (this happens when a cell is rolled, recreated, or for whatever reason it
doesn't renew it's lease properly) it will take longer for the other cells to
know how to route container-to-container traffic to it. To start with, we
suggest setting this property to 300 seconds (5 minutes). Then you can tweak
accordingly.
Silk daemon fails to converge leases. Errors in the silk-daemon logs might look like this:
{
"timestamp": "TIME",
"source": "cfnetworking.silk-daemon",
"message": "cfnetworking.silk-daemon.poll-cycle",
"log_level": 2,
"data": {
"error":"converge leases: del neigh with ip/hwaddr 10.255.21.2 : no such file or directory"
}
}
Also kernel logs might look like this:
neighbour: arp_cache: neighbor table overflow
ARP cache on the diego cell is not large enough to handle the number of entries the silk-daemon is trying to write.
Increase the ARP cache size on the diego cells.
-
Look at the current size of your ARP cache
- ssh onto a diego-cell and become root
- inspect following kernel variables
sysctl net.ipv4.neigh.default.gc_thresh1 sysctl net.ipv4.neigh.default.gc_thresh2 sysctl net.ipv4.neigh.default.gc_thresh3
-
Manually increase ARP cache size on the cell. This is good for fixing the issue in the moment, but isn't a good long term soluation because the values will be reset when the cell is recreated.
- set new, larger values for the kernel variables. These sizes were used successfully for a deployment of ~800 cells.
sudo sysctl -w net.ipv4.neigh.default.gc_thresh3=8192; sudo sysctl -w net.ipv4.neigh.default.gc_thresh2=4096; sudo sysctl -w net.ipv4.neigh.default.gc_thresh1=2048;
- set new, larger values for the kernel variables. These sizes were used successfully for a deployment of ~800 cells.
-
For a more permanent solution, set these variables by adding the os-conf-release sysctl job to the deigo-cell instance group. A conf file will be autogenerated into
/etc/stsctl.d/71-bosh-os-conf-sysctl.conf
.- the manifest changes will look similar to this:
instance_groups: - name: diego-cell jobs: - name: sysctl properties: sysctl: - net.ipv4.neigh.default.gc_thresh3=8192 - net.ipv4.neigh.default.gc_thresh2=4096 - net.ipv4.neigh.default.gc_thresh1=2048 release: os-conf ... releases: - name: "os-conf" version: "20.0.0" url: "https://bosh.io/d/github.com/cloudfoundry/os-conf-release?v=20.0.0" sha1: "a60187f038d45e2886db9df82b72a9ab5fdcc49d"
- the manifest changes will look similar to this:
To our knowledge no one has actually run into this problem, even in the largest of deployments. However our team is asked about this, so it seems important to cover it.
The quick answer is that you are limited to 65,635 apps used in network policies. This results in at least 32,767 network policies.
Container networking policies are implemented using linux marks. Each source and destination app in a networking policy is assigned a mark at the policy creation time. If the source or destination app already has a mark assigned to it from a different policy, then the app uses that mark and does not get a new one. The overlay network for container networking uses VXLAN. VXLAN limits the marks to 16-bits. With 16 bits there are 2^16 (or 65,536) distinct values for marks. The first mark is saved and not given to apps, so that results in 65,535 marks available for apps.
Let's imagine that there are 65,535 different apps. A user could create 32,767 network policies from appA --> appB, where appA and appB are only ever used in ONE network policy. Each of the 32,767 policies includes two apps (the source and the destination) and each of those apps needs a mark. This would result in 65,634 marks. This would reach the upper limits of network policies.
Let's imagine that there are 5 apps. Let's say a user wants all 5 apps to be able to talk to everyother app. This would result in 25 network policies. However, this would only use up 5 marks (one per app). There are still 65,630 marks available for other apps. This scenario shows how the more "overlapping" the policies are, the more policies you can have.