-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple network overlay cidrs #148
Conversation
e0d4d1a
to
d809803
Compare
0644b39
to
f0ec2b1
Compare
b93c8eb
to
1ed48b0
Compare
Now the silk-controller network property can be an array of CIDRs. Previously it was a single string of one CIDR. ow the template creates a config with an array for this property, whether or not the operator provided an array or a single string. Previously allowed example: * "10.255.0.0/16" Now allowed examples: * "10.255.0.0/16" (still allowed for backwards compatibility) * ["10.255.0.0/16"] * ["10.50.0.0/23", "10.100.0.0/22", "10.250.0.0/23"]
Now the silk-daemon network property can be an array of CIDRs. Previously it was a single string of one CIDR. Now the template creates a config with an array for this property, whether or not the operator provided an array or a single string. Previously allowed example: * "10.255.0.0/16" Now allowed examples: * "10.255.0.0/16" (still allowed for backwards compatibility) * ["10.255.0.0/16"] * ["10.50.0.0/23", "10.100.0.0/22", "10.250.0.0/23"]
1ed48b0
to
14c5ca5
Compare
3bb4ebc
to
ebad4fb
Compare
src/code.cloudfoundry.org/silk/controller/integration/integration_test.go
Outdated
Show resolved
Hide resolved
src/code.cloudfoundry.org/silk/controller/integration/integration_test.go
Outdated
Show resolved
Hide resolved
src/code.cloudfoundry.org/silk/controller/leaser/cidrpool_test.go
Outdated
Show resolved
Hide resolved
src/code.cloudfoundry.org/silk/controller/leaser/cidrpool_test.go
Outdated
Show resolved
Hide resolved
src/code.cloudfoundry.org/silk/controller/leaser/cidrpool_test.go
Outdated
Show resolved
Hide resolved
src/code.cloudfoundry.org/silk/controller/leaser/cidrpool_test.go
Outdated
Show resolved
Hide resolved
src/code.cloudfoundry.org/silk/controller/leaser/cidrpool_test.go
Outdated
Show resolved
Hide resolved
if s == "" { | ||
break | ||
} | ||
|
||
results[s]++ | ||
taken = append(taken, s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question about taken
src/code.cloudfoundry.org/silk/controller/leaser/cidrpool_test.go
Outdated
Show resolved
Hide resolved
src/code.cloudfoundry.org/lib/multiple-cidr-network/multiple_cidr_network.go
Show resolved
Hide resolved
src/code.cloudfoundry.org/silk/daemon/integration/integration_test.go
Outdated
Show resolved
Hide resolved
src/code.cloudfoundry.org/cni-wrapper-plugin/integration/cni_wrapper_plugin_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great! I mentiond some typos + requests for test/code clarity, and maybe a couple new test cases if you think it's worth adding them.
The silk-cni-(wrapper) NO LONGER uses the silk-controller network property via a bosh link. In a future commit, the silk-cni-wrapper will get the local VM's lease from within the code itself. It does not need to know the entire overlay network.
The vxlan-policy-agent uses the silk-controller network property via a bosh link. Now the silk-controller network property can be an array of CIDRs. Previously it was a single string of one CIDR. Now the template creates a config with an array for this property, whether or not the operator provided an array or a single string. Previously allowed example: * "10.255.0.0/16" Now allowed examples: * "10.255.0.0/16" (still allowed for backwards compatibility) * ["10.255.0.0/16"] * ["10.50.0.0/23", "10.100.0.0/22", "10.250.0.0/23"]
This new package abstracts away the issue of having an overlaynetwork that is made of multiple network CIDRs. This package includes 4 functions: * NewMultipleCIDRNetwork(cidrs []string) - this takes an array of CIDR strings, turns them into net.IPnet's and returns a new MultipleCIDRNetwork. This function will error if any CIDR provided is invalid. * Contains(ip net.IP) - this function returns a boolean if an IP is in any of the CIDRs. * Length() - this function returns the number of CIDRs in this network. * WhichNetworkContains(ip net.IP) - this function returns the net.IPnet of the specific CIDR that contains a specific IP.
Now the silk-controller will give out leases from all of the CIDRs provided in the overlay network. The first block of the first CIDR is reserved for SingleIP usage. The first block of the other CIDRs are also reserved to setup the networking between CIDRs. For example, the bosh properties... * network: ["10.255.0.0/23", "10.255.10.0/23", "10.255.20.0/22"] * subnet_prefix_length: 24 Will result in... * 10.255.0.0/24 --> reserved for single IP + networking setup * 10.255.1.0/24 --> possible lease for a Diego Cell * 10.255.10.0/24 --> reserved for networking setup * 10.255.11.0/24 --> possible lease for a Diego Cell * 10.255.20.0/24 --> reserved for networking setup * 10.255.21.0/24 --> possible lease for a Diego Cell * 10.255.22.0/24 --> possible lease for a Diego Cell * 10.255.23.0/24 --> possible lease for a Diego Cell
e257065
to
79d651d
Compare
Run the following commands to see these changes on a real machine * routes: ip route show * addresses: ip addr show silk-vtep * arp: ip neigh show * bridge: bridge fdb show silk-vtep For example with the following setup: * network: ["10.255.0.0/23", "10.255.10.0/23", "10.255.20.0/22"] * subnet_prefix_length: 24 * 4 Diego Cells 1. Routes - added for every subnet lease except the VM the daemon is running on. ``` 10.255.1.0/24 via 10.255.1.0 dev silk-vtep 10.255.21.0/24 via 10.255.21.0 dev silk-vtep 10.255.23.0/24 via 10.255.23.0 dev silk-vtep ``` 2. Addresses - added for every CIDR in the overlay network. The CIDR that the daemon's VM's lease is in will use the first IP from the VM's lease. In this case that is 10.255.22.0. ``` inet 10.255.0.0/23 brd 10.255.1.255 scope link silk-vtep inet 10.255.10.0/23 brd 10.255.11.255 scope link silk-vtep inet 10.255.22.0/22 brd 10.255.23.255 scope link silk-vtep ``` 3. ARP - added for every subnet lease except the VM that the daemon is running on. ``` 10.255.23.0 dev silk-vtep lladdr ee:ee:0a:ff:17:00 PERMANENT 10.255.1.0 dev silk-vtep lladdr ee:ee:0a:ff:01:00 PERMANENT 10.255.21.0 dev silk-vtep lladdr ee:ee:0a:ff:15:00 PERMANENT ``` 4. Bridge - added for every VM with a lease except the VM that the daemon is running on. ``` ee:ee:0a:ff:17:00 dev silk-vtep dst 10.0.1.13 self permanent ee:ee:0a:ff:01:00 dev silk-vtep dst 10.0.1.18 self permanent ee:ee:0a:ff:15:00 dev silk-vtep dst 10.0.1.19 self permanent ```
The vxlan-policy-agent only uses the overlay network when c2c network policies is turned off. In that case the vxlan-policy-agent creates iptables rules to ALLOW traffic where is the src and dst are both within the overlay network.
The silk-cni-wrapper writes MASQUERADE rules to the Diego Cell so that non-c2c traffic will be MASQUERADEd. Before this change the MASQUERADE rule would look like... -A POSTROUTING -s <APP-OVERLAY-IP> ! -d <OVERLAY-NETWORK-CIDR> ! -o silk-vtep -j MASQUERADE This means... * During POSTROUTING * When the source of the traffic is an app on the diego cell * And the traffic is NOT going to anywhere in the overlay network * And the traffic is NOT going to the silk-vtep interface (which only happens when going to the overlay network on other cells) * Then MASQUERADE this traffic After this change the MASQUERADE rule now looks like... -A POSTROUTING -s <APP-OVERLAY-IP> ! -d <OVERLAY-NETWORK-SUBNET-LEASE-FOR-THIS-CELL> ! -o silk-vtep -j MASQUERADE This means... * During POSTROUTING * When the source of the traffic is an app on the diego cell * And the traffic is NOT going to anywhere on the overlay network on this cell * And the traffic is NOT going to the silk-vtep interface (which only happens when going to the overlay network on other cells) * Then MASQUERADE this traffic I think there are two futher refactors that could be done here. * I think silk-cni and silk-cni-wrapper could be combined. * I think there could be one MASQUERADE rule per cell instead of one per app. That rule would be... * -A POSTROUTING -s <OVERLAY-NETWORK-SUBNET-LEASE-FOR-THIS-CELL> ! -d <OVERLAY-NETWORK-SUBNET-LEASE-FOR-THIS-CELL> ! -o silk-vtep -j MASQUERADE
78f5246
to
d0909db
Compare
👉 There is a lot of info in each individual commit message
High level summary
As a CF operator
I want to provide multiple smaller cidrs for my overlay network instead of one big cidr
So that I don't have to find a huge amount of contiguous IPs to give to CF
Overlay network context
By default the overlay network is
10.255.0.0/16
. This IP range is reserved for private networks and this range is never publicly routable on the global internet.Every time a container is created in CF it gets a unique overlay IP associated with it. This allows internal routes to resolve to individual IPs so that c2c traffic can work.
🥧 Overlay network bosh properties - a pie metaphor
There are 2 bosh properties concerning the overlay network. Before going into the details, I am going to talk about them metaphorically.
Imagine the entire overlay network is a pie. One bosh property describes the type and size of the pie.
The purpose of this pie is to serve the hungry hungry Diegos. Each Diego Cell is hungry (hungry) and needs a slice of pie. Without a slice of pie, the Diego Cell cannot start (it has no energy!). This is an egalitarian kitchen, so each Diego Cell must get an equal slice of pie.
The second bosh property determines the pie slice size that each Diego Cell gets. The smaller the slice of pie that each Diego Cell gets, the less “energy” the Cell has, and the fewer apps can run on each Cell. The larger the slice of pie is that each Diego Cell gets, the more “energy” the cell has, and more apps can run on each Cell.
If the size of the pie stays constant, smaller slices of pie means there can be more Diego Cells (who are fed less each) and larger slices means there is a smaller number of possible Diego Cells (who are fed more each).
Overlay network bosh properties (for real, no metaphors)
An operator provides a cidr for the overlay network via this
network
bosh property on the silk controller job. By default this value is10.255.0.0/16
.An operator provides a subnet_prefix_length via this bosh property on the silk controller job. This determines the number of IPs each Diego Cell gets. This determines how many apps can run on a single Diego Cell. A Diego Cell can only run as many apps at one time as it has IPs. By default the subnet_prefix_length is
/24
.Let’s go through some examples of different values for these properties
network
issubnet_prefix_length
isnetwork
issubnet_prefix_length
isnetwork
issubnet_prefix_length
isProblem Summary - more 🥧 metaphors
You have 1000 hungry hungry Diegos. How do you feed them? Can you make one pie big enough to serve all 1000 of them? That sounds hard. Do they even make ovens big enough? This is the conundrum that CF operators currently face. Currently the only option is to bake one pie large enough to serve all their Diegos.
Wouldn’t it be great if our operators could bake 3 smaller pies to feed those hungry hungry Diegos? That is what this feature is all about.
Solution Summary
Currently an operator needs to provide a continuous cidr for the overlay network. This PR will let operators provide multiple non-consecutive CIDRs for their overlay network. For example, this PR makes:
["10.255.0.0/23", "10.255.10.0/23", "10.255.20.0/22"]
a valid overlay network property value.How big does my multi-CIDR network need to be?
The overlay network CIDRs will still be cut up into smaller subnet "leases" for the Diego Cells. One "lease" from each CIDR in the overlay network is reserved to setup the networking and will not be distributed to Diego Cells. Because of this the overlay network CIDRs must have a larger prefix length than the
subnet_prefix_length
.Let's go through more examples...
network
issubnet_prefix_length
isnetwork
issubnet_prefix_length
issubnet_prefix_length
network
issubnet_prefix_length
isInvalid examples
network
subnet_prefix_length
"10.255.220.0/24"
.network
subnet_prefix_length
"10.255.220.0/32"
.Acceptance Criteria
These are the scenarios that I am going to manually test:
10.255.0.0/16
["10.255.0.0/23", "10.255.10.0/23", "10.255.20.0/22"]
Backward Compatibility
Breaking Change? No. I did a lot of work to make sure this wouldn't break any current deployments.
What's left
[x] Complete manual QA testing of all of the above acceptance criteria
[x] Acceptance tests
[x] Docs