Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple network overlay cidrs #148

Merged
merged 11 commits into from
Jan 29, 2025
Merged

Conversation

ameowlia
Copy link
Member

@ameowlia ameowlia commented Nov 14, 2024

👉 There is a lot of info in each individual commit message

High level summary

As a CF operator
I want to provide multiple smaller cidrs for my overlay network instead of one big cidr
So that I don't have to find a huge amount of contiguous IPs to give to CF

Overlay network context

By default the overlay network is 10.255.0.0/16. This IP range is reserved for private networks and this range is never publicly routable on the global internet.

Every time a container is created in CF it gets a unique overlay IP associated with it. This allows internal routes to resolve to individual IPs so that c2c traffic can work.

🥧 Overlay network bosh properties - a pie metaphor

There are 2 bosh properties concerning the overlay network. Before going into the details, I am going to talk about them metaphorically.

Imagine the entire overlay network is a pie. One bosh property describes the type and size of the pie.

The purpose of this pie is to serve the hungry hungry Diegos. Each Diego Cell is hungry (hungry) and needs a slice of pie. Without a slice of pie, the Diego Cell cannot start (it has no energy!). This is an egalitarian kitchen, so each Diego Cell must get an equal slice of pie.

The second bosh property determines the pie slice size that each Diego Cell gets. The smaller the slice of pie that each Diego Cell gets, the less “energy” the Cell has, and the fewer apps can run on each Cell. The larger the slice of pie is that each Diego Cell gets, the more “energy” the cell has, and more apps can run on each Cell.

If the size of the pie stays constant, smaller slices of pie means there can be more Diego Cells (who are fed less each) and larger slices means there is a smaller number of possible Diego Cells (who are fed more each).

Metaphorical bosh property name Real bosh property name Description
size_of_network_overlay_pie network Defines the entire “pie” (network overlay). A range of all of the IPs in the entire overlay network.
size_of_network_overlay_pie_slice subnet_prefix_length The size of the “pie slice” (subnet) each Diego Cell gets to allocate to apps.

Overlay network bosh properties (for real, no metaphors)

An operator provides a cidr for the overlay network via this network bosh property on the silk controller job. By default this value is10.255.0.0/16.

An operator provides a subnet_prefix_length via this bosh property on the silk controller job. This determines the number of IPs each Diego Cell gets. This determines how many apps can run on a single Diego Cell. A Diego Cell can only run as many apps at one time as it has IPs. By default the subnet_prefix_length is /24.

Let’s go through some examples of different values for these properties

Example 1 Default Property Values
when network is "10.255.0.0/16"
when subnet_prefix_length is 24
Total number of IPs in the network (A) /16 cidr has 65,536 IPs
Number of IPs per Diego Cell (B) /24 cidr has 256 IPs
Max number of Diego Cells supported A/B - 1 = 255 Diego Cells
Example 2 Fewer apps per cell
when network is "10.255.0.0/16"
when subnet_prefix_length is 25
Total number of IPs in the network (A) /16 cidr has 65,536 IPs
Number of IPs per Diego Cell (B) /25 cidr has 128 IPs
Max number of Diego Cells supported A/B -1 = 511 Diego Cells
Example 3 fewer apps per cell and smaller network
when network is "10.255.0.0/17"
when subnet_prefix_length is 25
Total number of IPs in the network (A) /17 cidr has 32,768 IPs
Number of IPs per Diego Cell (B) /25 cidr has 128 IPs
Max number of Diego Cells supported A/B - 1= 255 Diego Cells

Problem Summary - more 🥧 metaphors

You have 1000 hungry hungry Diegos. How do you feed them? Can you make one pie big enough to serve all 1000 of them? That sounds hard. Do they even make ovens big enough? This is the conundrum that CF operators currently face. Currently the only option is to bake one pie large enough to serve all their Diegos.

Wouldn’t it be great if our operators could bake 3 smaller pies to feed those hungry hungry Diegos? That is what this feature is all about.

Solution Summary

Currently an operator needs to provide a continuous cidr for the overlay network. This PR will let operators provide multiple non-consecutive CIDRs for their overlay network. For example, this PR makes: ["10.255.0.0/23", "10.255.10.0/23", "10.255.20.0/22"]a valid overlay network property value.

How big does my multi-CIDR network need to be?

The overlay network CIDRs will still be cut up into smaller subnet "leases" for the Diego Cells. One "lease" from each CIDR in the overlay network is reserved to setup the networking and will not be distributed to Diego Cells. Because of this the overlay network CIDRs must have a larger prefix length than the subnet_prefix_length.

Let's go through more examples...

Example 4 2 CIDRs in the overlay network
when network is ["10.255.0.0/20", "10.255.96.0/21"]
when subnet_prefix_length is 24
Total number of IPs in the first network CIDR (A) /20 cidr has 4,096 IPs
Total number of IPs in the first network CIDR (A') /21 cidr has 2,048 IPs
Number of IPs per Diego Cell (B) /24 cidr has 256 IPs
Number of CIDRs in the overlay network (C) 2
Max number of Diego Cells supported ((A+A')/B) - (C) = 22 Diego Cells
Example 5 3 CIDRs in the overlay network
when network is ["10.255.0.0/20", "10.255.96.0/21", "10.255.200.0/23"]
when subnet_prefix_length is 24
Total number of IPs in the first network CIDR (A) /20 cidr has 4,096 IPs
Total number of IPs in the first network CIDR (A') /21 cidr has 2,048 IPs
Total number of IPs in the first network CIDR (A'') /23 cidr has 512 IPs
Number of IPs per Diego Cell (B) /24 cidr has 256 IPs
Number of CIDRs in the overlay network (C) 3
Max number of Diego Cells supported ((A+A'+A'')/B) - (C) = 23 Diego Cells
Example 6 4 CIDRs in the overlay network and smaller subnet_prefix_length
when network is ["10.255.0.0/20", "10.255.96.0/21", "10.255.200.0/23", "10.255.220.0/24"]
when subnet_prefix_length is 25
Total number of IPs in the first network CIDR (A) /20 cidr has 4,096 IPs
Total number of IPs in the first network CIDR (A') /21 cidr has 2,048 IPs
Total number of IPs in the first network CIDR (A'') /23 cidr has 512 IPs
Total number of IPs in the first network CIDR (A''') /24 cidr has 256 IPs
Number of IPs per Diego Cell (B) /25 cidr has 128 IPs
Number of CIDRs in the overlay network (C) 4
Max number of Diego Cells supported ((A+A'+A''+A''')/B) - (C) = 50 Diego Cells

Invalid examples

🚫 Example 7 Wrong sized CIDRs
network ["10.255.0.0/20", "10.255.96.0/21", "10.255.200.0/23", "10.255.220.0/24"]
subnet_prefix_length 24
invalid reason subnet_prefix_length must be smaller than all CIDRs in the overlay network. In this case it is the same size as "10.255.220.0/24".
🚫 Example 8 Wrong sized CIDRs
network ["10.255.0.0/20", "10.255.96.0/21", "10.255.200.0/23", "10.255.220.0/32"]
subnet_prefix_length 24
invalid reason subnet_prefix_length must be smaller than all CIDRs in the overlay network. In this case it larger than "10.255.220.0/32".

Acceptance Criteria

These are the scenarios that I am going to manually test:

  1. Backwards compatibility: when network is set to 10.255.0.0/16
    • Test c2c without network policies
    • Test c2c with network policies
    • Test masquerading
    • Test single IPs
    • Test disable_container_networking_policies == true
  2. New stuff: when network is set to ["10.255.0.0/23", "10.255.10.0/23", "10.255.20.0/22"]
    • Test c2c without network policies
    • Test c2c with network policies (same cell, same cidr, and across cidrs)
    • Test masquerading
    • Test disable_container_networking_policies == true

Backward Compatibility

Breaking Change? No. I did a lot of work to make sure this wouldn't break any current deployments.

What's left

[x] Complete manual QA testing of all of the above acceptance criteria
[x] Acceptance tests
[x] Docs

@ameowlia ameowlia force-pushed the multiple-network-overlay-cidrs branch 2 times, most recently from e0d4d1a to d809803 Compare November 14, 2024 21:18
@ameowlia ameowlia force-pushed the multiple-network-overlay-cidrs branch from 0644b39 to f0ec2b1 Compare January 6, 2025 19:26
@ameowlia ameowlia force-pushed the multiple-network-overlay-cidrs branch 2 times, most recently from b93c8eb to 1ed48b0 Compare January 16, 2025 16:17
Now the silk-controller network property can be an array of CIDRs.
Previously it was a single string of one CIDR.  ow the template creates
a config with an array for this property, whether or not the operator
provided an array or a single string.

Previously allowed example:
* "10.255.0.0/16"

Now allowed examples:
* "10.255.0.0/16" (still allowed for backwards compatibility)
* ["10.255.0.0/16"]
* ["10.50.0.0/23", "10.100.0.0/22", "10.250.0.0/23"]
Now the silk-daemon network property can be an array of CIDRs.
Previously it was a single string of one CIDR.  Now the template creates
a config with an array for this property, whether or not the operator
provided an array or a single string.

Previously allowed example:
* "10.255.0.0/16"

Now allowed examples:
* "10.255.0.0/16" (still allowed for backwards compatibility)
* ["10.255.0.0/16"]
* ["10.50.0.0/23", "10.100.0.0/22", "10.250.0.0/23"]
@ameowlia ameowlia force-pushed the multiple-network-overlay-cidrs branch from 1ed48b0 to 14c5ca5 Compare January 16, 2025 16:42
@ameowlia ameowlia force-pushed the multiple-network-overlay-cidrs branch 3 times, most recently from 3bb4ebc to ebad4fb Compare January 17, 2025 16:49
jobs/silk-cni/spec Outdated Show resolved Hide resolved
if s == "" {
break
}

results[s]++
taken = append(taken, s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question about taken

Copy link
Contributor

@geofffranks geofffranks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great! I mentiond some typos + requests for test/code clarity, and maybe a couple new test cases if you think it's worth adding them.

The silk-cni-(wrapper) NO LONGER uses the silk-controller network
property via a bosh link.

In a future commit, the silk-cni-wrapper will get the local VM's lease
from within the code itself. It does not need to know the entire overlay
network.
The vxlan-policy-agent uses the silk-controller network property via a
bosh link.

Now the silk-controller network property can be an array of CIDRs.
Previously it was a single string of one CIDR. Now the template creates
a config with an array for this property, whether or not the operator
provided an array or a single string.

Previously allowed example:
* "10.255.0.0/16"

Now allowed examples:
* "10.255.0.0/16" (still allowed for backwards compatibility)
* ["10.255.0.0/16"]
* ["10.50.0.0/23", "10.100.0.0/22", "10.250.0.0/23"]
This new package abstracts away the issue of having an overlaynetwork that is made of
multiple network CIDRs.

This package includes 4 functions:

* NewMultipleCIDRNetwork(cidrs []string) - this takes an array of CIDR
  strings, turns them into net.IPnet's and returns a new
  MultipleCIDRNetwork. This function will error if any CIDR provided is
  invalid.
* Contains(ip net.IP) - this function returns a boolean if an IP is in
  any of the CIDRs.
* Length() - this function returns the number of CIDRs in this network.
* WhichNetworkContains(ip net.IP) - this function returns the net.IPnet
  of the specific CIDR that contains a specific IP.
Now the silk-controller will give out leases from all of the CIDRs
provided in the overlay network. The first block of the first CIDR is
reserved for SingleIP usage. The first block of the other CIDRs are also
reserved to setup the networking between CIDRs.

For example, the bosh properties...
* network: ["10.255.0.0/23", "10.255.10.0/23", "10.255.20.0/22"]
* subnet_prefix_length: 24

Will result in...
* 10.255.0.0/24 --> reserved for single IP + networking setup
* 10.255.1.0/24 --> possible lease for a Diego Cell

* 10.255.10.0/24 --> reserved for networking setup
* 10.255.11.0/24 --> possible lease for a Diego Cell

* 10.255.20.0/24 --> reserved for networking setup
* 10.255.21.0/24 --> possible lease for a Diego Cell
* 10.255.22.0/24 --> possible lease for a Diego Cell
* 10.255.23.0/24 --> possible lease for a Diego Cell
@ameowlia ameowlia force-pushed the multiple-network-overlay-cidrs branch from e257065 to 79d651d Compare January 22, 2025 18:41
Run the following commands to see these changes on a real machine
* routes: ip route show
* addresses: ip addr show silk-vtep
* arp: ip neigh show
* bridge: bridge fdb show silk-vtep

For example with the following setup:
* network: ["10.255.0.0/23", "10.255.10.0/23", "10.255.20.0/22"]
* subnet_prefix_length: 24
* 4 Diego Cells

1. Routes - added for every subnet lease except the VM the daemon is
   running on.

```
10.255.1.0/24 via 10.255.1.0 dev silk-vtep
10.255.21.0/24 via 10.255.21.0 dev silk-vtep
10.255.23.0/24 via 10.255.23.0 dev silk-vtep
```

2. Addresses - added for every CIDR in the overlay network. The CIDR
   that the daemon's VM's lease is in will use the first IP from the
   VM's lease. In this case that is 10.255.22.0.

```
inet 10.255.0.0/23 brd 10.255.1.255 scope link silk-vtep
inet 10.255.10.0/23 brd 10.255.11.255 scope link silk-vtep
inet 10.255.22.0/22 brd 10.255.23.255 scope link silk-vtep
```

3. ARP - added for every subnet lease except the VM that the daemon is
   running on.

```
10.255.23.0 dev silk-vtep lladdr ee:ee:0a:ff:17:00 PERMANENT
10.255.1.0 dev silk-vtep lladdr ee:ee:0a:ff:01:00 PERMANENT
10.255.21.0 dev silk-vtep lladdr ee:ee:0a:ff:15:00 PERMANENT
```

4. Bridge - added for every VM with a lease except the VM that the
   daemon is running on.

```
ee:ee:0a:ff:17:00 dev silk-vtep dst 10.0.1.13 self permanent
ee:ee:0a:ff:01:00 dev silk-vtep dst 10.0.1.18 self permanent
ee:ee:0a:ff:15:00 dev silk-vtep dst 10.0.1.19 self permanent
```
The vxlan-policy-agent only uses the overlay network when c2c network
policies is turned off. In that case the vxlan-policy-agent creates
iptables rules to ALLOW traffic where is the src and dst are both within
the overlay network.
The silk-cni-wrapper writes MASQUERADE rules to the Diego Cell so that
non-c2c traffic will be MASQUERADEd.

Before this change the MASQUERADE rule would look like...
-A POSTROUTING -s <APP-OVERLAY-IP> ! -d <OVERLAY-NETWORK-CIDR> ! -o silk-vtep -j MASQUERADE

This means...
* During POSTROUTING
* When the source of the traffic is an app on the diego cell
* And the traffic is NOT going to anywhere in the overlay network
* And the traffic is NOT going to the silk-vtep interface (which only
  happens when going to the overlay network on other cells)
* Then MASQUERADE this traffic

After this change the MASQUERADE rule now looks like...
-A POSTROUTING -s <APP-OVERLAY-IP> ! -d <OVERLAY-NETWORK-SUBNET-LEASE-FOR-THIS-CELL> ! -o silk-vtep -j MASQUERADE

This means...
* During POSTROUTING
* When the source of the traffic is an app on the diego cell
* And the traffic is NOT going to anywhere on the overlay network on
  this cell
* And the traffic is NOT going to the silk-vtep interface (which only
  happens when going to the overlay network on other cells)
* Then MASQUERADE this traffic

I think there are two futher refactors that could be done here.
* I think silk-cni and silk-cni-wrapper could be combined.
* I think there could be one MASQUERADE rule per cell instead of one per
  app. That rule would be...
* -A POSTROUTING -s <OVERLAY-NETWORK-SUBNET-LEASE-FOR-THIS-CELL> ! -d <OVERLAY-NETWORK-SUBNET-LEASE-FOR-THIS-CELL> ! -o silk-vtep -j MASQUERADE
@geofffranks geofffranks merged commit 712942e into develop Jan 29, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

2 participants