Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for private subnet instance groups with NAT Gateway #428

Closed
7 tasks
tazjin opened this issue Sep 11, 2016 · 42 comments
Closed
7 tasks

Support for private subnet instance groups with NAT Gateway #428

tazjin opened this issue Sep 11, 2016 · 42 comments
Assignees
Milestone

Comments

@tazjin
Copy link
Contributor

tazjin commented Sep 11, 2016

After some discussions with @chrislovecnm I'm using this issue to summarise what we need to do to support instances in private subnets with NAT gateways.

Problem

Currently all instance groups created by kops are placed in public subnets. This may not be desirable in all use-cases. There are related open issues about this (#232, #266 which should maybe be closed, #220, #196).

As the simplest use-case kops should support launching instance groups into private subnets with Amazon's managed NAT gateways as the default route.

In addition a feature to specify a default route may be desirable for use-cases where NAT is handled differently, as suggested by @ProTip.

AWS resources

In order to set this up several resources are required. We need:

  1. At least one public subnet (can be a subnet in which we have public instance groups with nodes / masters)
  2. An Elastic IP to associate with each NAT gateway
  3. At least one NAT gateway resource in a public subnet, associated with an elastic IP.
  4. A route table and IGW entry for the public subnet (kops currently creates this).
  5. A route table and entry for sending traffic to the NAT gateway from the private subnet.
  6. Correct route table associations for each subnet.

Open questions

  • Right now kops creates a single route-table and internet gateway for all subnets. We may need to split this up into at least a public and private route-table. Which routes must exist in both tables?
  • AWS NAT gateways are redundant by default, but only within their specified availability zone. Is it possible to specify multiple default routes as a sort of "load-balanced" NAT setup? Or should we have a route table per AZ with private subnets, with corresponding NAT gateway?

Implementation

After the open questions are answered (and unless somebody else comes up with any new ones!) I think the implementation steps are roughly these:

  • Split out the current single route table into something more dynamic.
  • Set up dependent resources if an IG is set to private
  • Add awstask for NAT gateway creation.
  • ???
  • Profit!

Dump your thoughts in the comments and I'll update this as we go along. I'm willing to spend time on this but have limited Go experience, so if someone familiar with the code base has time to answer questions that may come up I'd be grateful.

@chrislovecnm
Copy link
Contributor

cc: @kris-nova

@oviis
Copy link

oviis commented Sep 13, 2016

Hi Guys,
thanks a lot for managing this.
First of all i love kops, kubernetes and AWS. :-)

I have create a sample kops cluster named "k8s-test-evironment-com" in 3 AWS AZ's(eu-west) and output this to terraform, then i have start to manage the routing in an extra file "isolated_cluster_sample.tf".
Then i have need to answer the same questions as you before, but i have decide at the end, to create a NAT and EIP per AZ.
The tricky path was to tag the route table WITHOUT THE "KubernetesCluster" tag!!!
With this tag and 2 routes the k8s services network doesn't work!!!
That cost me 2 days ;-)

My working sample terraform code for isolated nodes is an addition of the generated subnet by kops. Also the NAT gateways are allocated to the generated public networks.
This sample can be used as a template to change the golang code for generation afterwords.

For testing, you need following steps:

  • You can add this sample file to your generated kops dir, nearby "kubernetes.tf".
  • change the cluster name as your generated cluster name(isolated_cluster_sample.tf)
  • change the nodes ASG "vpc_zone_identifier" to the private subnets(kubernetes.tf)
  • change the nodes launch_configuration "associate_public_ip_address = false"(kubernetes.tf)

start file "isolated_cluster_sample.tf"
`#----------------------------------------
#begin private subnets
#generated VPC CIDR in this case "10.10.0.0/16"
#AWS-AZS="eu-west-1a,eu-west-1b,eu-west-1c"
#----------------------------------------

resource "aws_subnet" "eu-west-1a-k8s-test-evironment-com_private" {
vpc_id = "${aws_vpc.k8s-test-evironment-com.id}"
cidr_block = "10.10.128.0/19"
availability_zone = "eu-west-1a"
tags = {
KubernetesCluster = "k8s.test.evironment.com"
Name = "eu-west-1a.k8s.test.evironment.com"
}
}

resource "aws_subnet" "eu-west-1b-k8s-test-evironment-com_private" {
vpc_id = "${aws_vpc.k8s-test-evironment-com.id}"
cidr_block = "10.10.160.0/19"
availability_zone = "eu-west-1b"
tags = {
KubernetesCluster = "k8s.test.evironment.com"
Name = "eu-west-1b.k8s.test.evironment.com"
}
}

resource "aws_subnet" "eu-west-1c-k8s-test-evironment-com_private" {
vpc_id = "${aws_vpc.k8s-test-evironment-com.id}"
cidr_block = "10.10.192.0/19"
availability_zone = "eu-west-1c"
tags = {
KubernetesCluster = "k8s.test.evironment.com"
Name = "eu-west-1c.k8s.test.evironment.com"
}
}

#-----------------------------------------
#end private subnets
#-----------------------------------------

#-------------------------------------------------------
#private nating begin
#-------------------------------------------------------

resource "aws_eip" "nat-1a" {
vpc = true
lifecycle { create_before_destroy = true }
}

resource "aws_eip" "nat-1b" {
vpc = true
lifecycle { create_before_destroy = true }
}

resource "aws_eip" "nat-1c" {
vpc = true
lifecycle { create_before_destroy = true }
}

resource "aws_nat_gateway" "gw-1a" {
allocation_id = "${aws_eip.nat-1a.id}"
subnet_id = "${aws_subnet.eu-west-1a-k8s-test-evironment-com.id}"
}

resource "aws_nat_gateway" "gw-1b" {
allocation_id = "${aws_eip.nat-1b.id}"
subnet_id = "${aws_subnet.eu-west-1b-k8s-test-evironment-com.id}"
}

resource "aws_nat_gateway" "gw-1c" {
allocation_id = "${aws_eip.nat-1c.id}"
subnet_id = "${aws_subnet.eu-west-1c-k8s-test-evironment-com.id}"
}
#-------------------------------------------------------
#private nating end
#-------------------------------------------------------

#-------------------------------------------------------
#private routing begin
#-------------------------------------------------------

resource "aws_route" "0-0-0-0--nat-1a" {
route_table_id = "${aws_route_table.k8s-test-evironment-com_private_1a.id}"
destination_cidr_block = "0.0.0.0/0"
nat_gateway_id = "${aws_nat_gateway.gw-1a.id}"
}

resource "aws_route" "0-0-0-0--nat-1b" {
route_table_id = "${aws_route_table.k8s-test-evironment-com_private_1b.id}"
destination_cidr_block = "0.0.0.0/0"
nat_gateway_id = "${aws_nat_gateway.gw-1b.id}"
}

resource "aws_route" "0-0-0-0--nat-1c" {
route_table_id = "${aws_route_table.k8s-test-evironment-com_private_1c.id}"
destination_cidr_block = "0.0.0.0/0"
nat_gateway_id = "${aws_nat_gateway.gw-1c.id}"
}

resource "aws_route_table" "k8s-test-evironment-com_private_1a" {
vpc_id = "${aws_vpc.k8s-test-evironment-com.id}"
tags = {
Name = "k8s.test.evironment.com_private"
}
}

resource "aws_route_table" "k8s-test-evironment-com_private_1b" {
vpc_id = "${aws_vpc.k8s-test-evironment-com.id}"
tags = {
Name = "k8s.test.evironment.com_private"
}
}

resource "aws_route_table" "k8s-test-evironment-com_private_1c" {
vpc_id = "${aws_vpc.k8s-test-evironment-com.id}"
tags = {
Name = "k8s.test.evironment.com_private"
}
}

resource "aws_route_table_association" "eu-west-1a-k8s-test-evironment-com_private" {
subnet_id = "${aws_subnet.eu-west-1a-k8s-test-evironment-com_private.id}"
route_table_id = "${aws_route_table.k8s-test-evironment-com_private_1a.id}"
}

resource "aws_route_table_association" "eu-west-1b-k8s-test-evironment-com_private" {
subnet_id = "${aws_subnet.eu-west-1b-k8s-test-evironment-com_private.id}"
route_table_id = "${aws_route_table.k8s-test-evironment-com_private_1b.id}"
}

resource "aws_route_table_association" "eu-west-1c-k8s-test-evironment-com_private" {
subnet_id = "${aws_subnet.eu-west-1c-k8s-test-evironment-com_private.id}"
route_table_id = "${aws_route_table.k8s-test-evironment-com_private_1c.id}"
}

#-------------------------------------------------------
#private routing end
#-------------------------------------------------------
`

@MrTrustor
Copy link
Contributor

MrTrustor commented Sep 13, 2016

Hello!

Thank you for the work done so far. This is far better than everything I've seen so far for spinning up K8s clusters on AWS.

I would very much like to see this implemented. My company uses a predefined, well designed, network topology that kind of matches this. AWS also recommends this kind of setups.

AWS published a CloudFormation stack that implements a neat network topology:

To sum it up here:

  • For each AZ: one small public subnet, one large private subnet, one NAT gateway in the public subnet
  • The public subnets all use the same route table with 0.0.0.0/0 routed to the Internet gateway
  • Each private subnet has its own route table with 0.0.0.0/0 routed to the NAT gateway of its AZ.
  • The public subnets are only used for the NAT Gateways and the public ELBs.
  • Everything else is hosted in the private subnets (i.e no EC2 instances in the public subnets).

I would also like to contribute to this project, so I'd be happy to take on a part of the work linked to this issue.

@justinsb justinsb added this to the 1.3.1 milestone Sep 24, 2016
@chrislovecnm
Copy link
Contributor

@tazjin - few questions for you

  • Can we get a PR in with a design doc? This is complex enough network wise that I would like to see a rough pattern outlined. Let us know if you need help with this. I already have a diagram of how kops works now.
  • N00b question - do we have to have a nat gw?
  • How do we design ingress and loadbalancer with private IP space?
  • Do we need a bastion server as well with the same ssh pem?
  • How do we route the API server? Right now we have round robin DNS, and the API endpoints are not behind an elb.
  • We are putting in a validate option into kops, how would someone validate the cluster? ssh port forward?

In regards to the route table. Because we are not using an overlay network we are utilizing that routing to communicate between az. Which limits us to 50 severs total.

We need to think full HA with 3+ masters and multiple AZ. Only way to roll ;)

@tazjin
Copy link
Contributor Author

tazjin commented Sep 25, 2016

@chrislovecnm Hi!

Can we get a PR in with a design doc?

Yes, I'm hoping to find some time for that this week.

N00b question - do we have to have a nat gw?

We need something in a public subnet that can NAT the private traffic to the internet (this is assuming people want their clusters to be able to access the internet!)

How do we design ingress and loadbalancer with private IP space?

Not sure about ingress (depends on the ingress type I suppose?) but normal type: LoadBalancer should be able to handle target instances in private subnets.

How do we route the API server?

There's several options and this needs discussing. For example:

  1. Public API server (like right now). Potentially configured to enforce valid client certificates (I believe it isn't right now!)
  2. kops is hands-off - it's up to the user to forward their traffic into the private subnet.
  3. Some sort of tunnel (e.g. SSH-based) into where the master is running.

We are putting in a validate option into kops, how would someone validate the cluster?

I'm not familiar with what that option will do, so no clue!

@MrTrustor
Copy link
Contributor

Just to answer this question:

Do we need a bastion server as well with the same ssh pem?

If the machine you want to SSH in is in a private subnet, you can setup an ELB to forward the trafic of port 22 to this server.

@chrislovecnm
Copy link
Contributor

cc: @ajayamohan

@jkemp101
Copy link

jkemp101 commented Oct 4, 2016

I propose that as the first step kops is made to work in an existing environment (shared VPC) using existing NGWs. As a minimum first set of requirements I would think we need to:

  1. Create a route table for each zone so we are HA.
  2. Specify an existing NAT Gateway(s) to be used in each route table.
  3. Change existing route management to manage multiple route tables instead of the single table used today when nodes are added/removed.

I think having kops spin up an entire environment including public and private subnets with IGWs, NAT Gateways, bastions, proper Security Groups, etc. is asking a lot of kops. Making this work with existing infrastructure as the first step and then adding onto it if necessary is potentially a cleaner path than doing it the other way around. For instance, I already have 3 NGWs and wouldn't want to have to pay for 3 more if kops created them automatically.

@chulkilee
Copy link

I'm new to kops and don't know how kops handle infrastructure (not k8s), but I think it would be nice if I can use kops to deploy k8s cluster on all infrastructure (set up with other tools) by giving all information.

  • bastion connection: public ip, ssh credentials
  • private subnet, public subnet

For example, if kops has separates steps for infrastructure setup and k8s cluster creation, it would be easier to test this.

Also I think it would be better to start with single-AZ set up (no cross-AZ cluster / no HA on the top of multiple AZ).

@chrislovecnm
Copy link
Contributor

chrislovecnm commented Oct 6, 2016

@tazjin one of the interesting things that @justinsb just brought up was that this may need to use overlay networking for HA to function. We would need a bastion connection in each AZ otherwise, which seems a tad complicated. Thoughts?

@chulkilee I understand that it would be simpler, but non-HA is not useable for us in production. Also we need to probably address overlay networking as well.

@chulkilee
Copy link

@chrislovecnm kops should support HA eventually - what I'm saying is that for this new feature, kops may support the simple use case at first. I don't know what's the most used deployment scenario for HA (e.g. HA over multiple AZ, HA in single AZ, or leveraging federation) and which options kops supports..

@MrTrustor
Copy link
Contributor

@chulkilee The advertised goal of kops is to set up a production cluster. IMHO, a cluster cannot be production-ready if it is not HA. On AWS, if every service is not hosted concurrently on at least 2 AZs, you don't have HA.

@chrislovecnm @justinsb I'm not sure I understand why the overlay networking would be mandatory for HA to function: routing between AZs and subnets, in a given VPC, is pretty transparent in AWS.

@jkemp101
Copy link

jkemp101 commented Oct 6, 2016

I drew a picture of what I am currently testing. It may help with the discussion.

  • Supports multiple independent clusters in same VPC. Diagram shows two clusters A & B. Each cluster gets:
    • 3 Private Subnets - Internal ELBs end up in these subnets
    • 3 Public Subnets - These are required to support Internet facing ELBs.
  • 3 Utility subnets that hold the NAT Gateways. These are shared by all private subnets in the VPC. Will also hold 1+ Bastions
  • Right now I manually create the public k8s subnets and add the KubernetesCluster tag after the cluster is brought up with kops update cluster.
  • I also manually change the IGW route in the K8s route tables to point to NGW A.

Everything seems to be workings so far but there is one HA deficiency. Internally sourced outbound traffic is routed through a single NGW, NGW A in the diagram. To fix this we would need to:

  • Replace the single K8s Route Table Cluster A/B route table with 3 independent tables, one for each subnet. Each table should point to the NGW in the matching zone.
  • Route changes would need to be updated in all three tables instead of just the one.

I tried creating a second route table with the right tag. Unfortunately it just causes all route tables updates to stop. I was hoping it would magically get updated.

k8s-priv-pub-network

@chrislovecnm
Copy link
Contributor

@jkemp101 you mention that public facing nodes are required for public elb. Can you not force an elb to use a public IP, and connect to a node that is in private ip space?

@jkemp101
Copy link

jkemp101 commented Oct 7, 2016

@chrislovecnm That is correct. AWS will stop you because it will detect the routing table for a private subnet does not have an Internet Gateway set as the default route. This is the error message in AWS console
This is an Internet-facing ELB, but there is no Internet Gateway attached to the subnet you have just selected: subnet-0653921d. When you create an ELB you first connect it to a subnet and then assign it to instances.

And k8s is also smart enough to know it needs a public subnet to create the ELB. So it will refuse if it can't find one.. But @justinsb suggested labeling manually created subnets after the kops run. That worked fine. I can create services in k8s and it attaches Internet ELBs to the 3 public subnets (I created and labelled manually) and Internal ELBs to the 3 private subnets (kops created automatically).

@MrTrustor
Copy link
Contributor

@jkemp101 Nice work. 2 questions:

  • why do you need so many public subnets? Can't you use only one per AZ in which you place the ELBs, the NGW and the bastions?
  • according to your schema, all the private subnets for a given cluster use the same route table and the same NGW. If the AZ in which this NGW becomes unavailable, you won't have any connectivity anymore. Each subnet should use an NGW in its own AZ.

@jkemp101
Copy link

jkemp101 commented Oct 7, 2016

@MrTrustor Hope this clarifies. Keep the questions coming.

  • One of my requirements is to be able to support multiple k8s clusters per VPC and have them interoperate with existing infrastructure that will remain out of the k8s cluster. So:
    • I need a set of public subnets for each cluster because they currently cannot be shared between clusters. Each set gets either a cluster-a.example.com or cluster-b.example.com value for the KubernetesCluster tag. This allows them to be found when the public ELB needs to be created. BTW: I show 3 ELBs in the diagram but it is really 1 ELB attached to 3 subnets for HA.
    • I could put a NGW in each of k8s public subnets but that would increase the number and cost of NGWs needed. For two clusters I would need 6 NGWs instead of 3 to meet my HA requirements. I also share these NGWs in the Util subnets with non-k8s infrastructure.
    • @justinsb Had suggested an additional enhancement to support a new tag on subnets like k8s.io/cluster/cluster-a.example.com: use-for-public-elbs. This would allow a single set of public subnets to be shared across multiple k8s clusters.
  • The NGW HA deficiency is the one issue I don't think can be solved without code changes. Below is an updated diagram that breaks the single route table into three so that each zone can use its own NGW. I removed Cluster A from the diagram to make room but I would still need this to scale to multiple clusters per VPC.

k8s-priv-pub-network-fixed

@chrislovecnm
Copy link
Contributor

@jkemp101 much thanks for the help btw, I think if you are at kubecon, I owe you libation. Anyways...

Have you been able to do this with kops in its current state or how are you doing this? You mentioned a problem with HA. Please elaborate.

@jkemp101
Copy link

jkemp101 commented Oct 7, 2016

@chrislovecnm I am currently running/testing the configuration depicted in my first diagram. Everything is working well so far. The only HA issue at the moment is that the clusters rely on a single NGW for outbound Internet connections. So if the NGW zone goes down the clusters can no longer do outbound connections to the Internet. Inbound through the Public ELB should still work fine.

I've automated the cluster build and delete process so a single command brings up the cluster and applies all modifications. All settings for public subnets id, NGWs id, IGW id, VPC id, zones, etc. are in a custom yaml file. Here are the 11 steps for a cluster create.

  1. Load my yaml configuration file
  2. Confirm state folder does not exist in S3 for this cluster (Paranoid step 1)
  3. Check if any instances already are running with this cluster tag on them already (Paranoid step 2)
  4. Run kops create cluster with appropriate flags.
  5. Fixup state files in S3 (e.g Add my tags, change IP subnets, etc).
  6. Run kops update cluster
  7. Fixup infrastructure by finding the route table created by kops and replace the route to the IGW with a default route to point to my NGW (ids for both are set in my config file). I have plenty of time as the ASGs start bringing up instances. This is technically the step that turns the cluster into a private configuration.
  8. Using pykube I poll the cluster to wait for it to return the right number of nodes. I know how many make a complete cluster based on settings in my configuration file.
  9. Deploy Dashboard addon
  10. I apply a more restrictive master and node iam policy.
  11. I label the manually created public subnets with this cluster's name. The subnet ids are configured in my yaml configuration file. I never delete the public subnets. Just untag before deleting cluster and retagging after creating cluster.

This script brings up a complete cluster as depicted in the diagram in about 8 minutes with a cluster of 3 masters/3 nodes.

@chrislovecnm
Copy link
Contributor

I am working on testing weave with kops, and once that is done I would like to see how to incorporate this using an external networking provider. With an external networking provider, I don't think K8s will have to manage the three routing tables. Probably setting up a hangout with you to determine where the product gaps are specifically.

@chulkilee
Copy link

@jkemp101 glad to hear the progress.. but shouldn't each cluster has own NAT, so that clusters are more isolated from others?

@chrislovecnm
Copy link
Contributor

@jkemp101 do you want to setup a hangout to review this? I have work items scheduled to knock this out, and would like to get the requirements clear. clove at datapipe.com is a good email for me.

@rbtcollins
Copy link
Contributor

Hi, pointed here from kubernetes/kubernetes#34430

Firstly, I want to second #428 (comment) - nat gw per AZ + subnets in that AZ have their routes (default or otherwise) pointed at it.

Secondly, I have a couple of variations on the stock use case.

The first one is that I really want to be able to reuse my nat gw's EIP's - though I don't particularly care about the nat gw or subnet, reusing those could be a minor cost saving. The reuse of EIP is to avoid 10 working day lead times on some APIs I need which use source IP ACLs on stuff :).

The second one is that I don't particularly care about private vs public subnets - as long as I can direct traffic for those APIs out via a nat gw with a long lived IP address, I'm happy :) - which may mean that my use case should be a separate bug, but I was pointed here :P.

@justinsb asked about DHCP and routing - I don't think thats feasible in AWS, since their DHCP servers don't support the options needed - https://ercpe.de/blog/advanced-dhcp-options-pushing-static-routes-to-clients - covers the two options, but neither are supported by DHCP Option Set objects in the AWS VPC API.

That said, since a NAT GW is as resilient as an AZ, treat the combination of NAT GW + private subnet as a scaling unit - to run in three AZ's, run three NAT GW's, three private subnets, and each subnet will have one and only one NAT GW route.

@ajohnstone
Copy link
Contributor

@jkemp101 possible to share what you've done so far in a gist/git repo? Sounds like a custom bit of python wrapped over kops.

@jkemp101
Copy link

@ajohnstone Yup. I'll share a git repo shortly with the Python script I'm using. This weekend at the latest.

@jkemp101
Copy link

@ajohnstone Here it is https://github.com/closeio/devops/tree/master/scripts/k8s. Let me know if you have any questions.

@chrislovecnm chrislovecnm added this to the backlog milestone Oct 15, 2016
@erutherford
Copy link

erutherford commented Oct 17, 2016

Allowing for additional routes or existing network infrastructure would be great. We're using VPC Peering for environment interconnectivity. This is also how we're currently accessing our kubernetes API via our VPN client. I'm also using a single TCP Load Balancer in front of our HA Kubernetes Backplane to alleviate any DNS stickiness.

krisnova added a commit to DualSpark/kops that referenced this issue Oct 19, 2016
- Publishing documentation to grow with the PR
- Defining command line flags
@krisnova
Copy link
Contributor

Code changes for the PR coming soon #694

@starkers
Copy link

Thanks Kris, I'll test soon also

@krisnova
Copy link
Contributor

@starkers - It's still a WIP - let me hammer on it this weekend a bit more before testing. Was just adding the pointer last night, some people were asking about it.

krisnova added a commit to DualSpark/kops that referenced this issue Oct 22, 2016
- Publishing documentation to grow with the PR
- Defining command line flags
@chrislovecnm
Copy link
Contributor

Little bird is telling me that we may have out first demo on Friday .. no promises, but @kris-nova is kicking some butt!!

@druidsbane
Copy link

@kris-nova @chrislovecnm How is that demo going? :) This would be super-useful for creating a simple kubernetes cluster without having to modify the kops terraform output to create the private subnets. Also, hoping for user-supplied security-groups on instance groups soon as well!

@hsyed
Copy link

hsyed commented Nov 1, 2016

I have put together almost the exact same architecture as listed above. We are trying to get this to a state for production usage.I generate the VPC in terraform and then graft the output of kops onto the VPC output.

Weavenet doesn't stay stable for very long. When I begin populating services into the cluster. It ends with all sorts of networking weirdness I can't diagnose (kubedns becoming blind, certain nodes having 3 second network latency etc etc). Flannel / Callico doesn't work either (out of the box).

I'm happy to battle test the changes. Is there anything I could do to get the Egress route tables populating before Friday ?

@chrislovecnm
Copy link
Contributor

chrislovecnm commented Nov 1, 2016

@hsyed need more details on weave. Can you provide details on an open cni issue?

Work still in progress with private networking

@jschneiderhan
Copy link

I'm very excited to see all of the progress on this issue! Thanks for all the hard work!

I have been running a cluster in AWS using a setup similar to the "private" topology mentioned in #694. and pretty much a spot on match to to the diagram @jkemp101 created above, where each private subnet has a public "utility" subnet with its own NAT gateway and corresponding route tables which send 0.0.0.0/0 through the AZ's NAT. It all works fine except for one thing: kubernetes stops updating routes because mutliple route tables are found (@jkemp101 also mentioned seeing this behavior). I've had to manually add the routes to all the routing tables every time my set of nodes changes.

It looks as though kubernetes itself does not currently support multiple route tables (https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws_routes.go#L45). I could definitely be missing something (I'm new to go and so my investigation speed is slow), but it seems to me that having kubernetes support multiple routing tables would be a prerequisite to supporting multple private subnets with dedicated NAT gateways, right? I tried searching kubernetes for an existing issue about supporting multiple routing tables, but can't find one (perhaps I'm not using the correct keywords).

@hsyed
Copy link

hsyed commented Nov 2, 2016

@jschneiderhan I do not know what approach is being taken by the work being done by @kris-nova. I assumed it was updating multiple route tables.

I had a realisation that there is an alternative architecture that could work. We would need a second network interface on each node. This would be 9 Subnets per cluster. 3 Subnets for kubenet connected to the route table it manages, 3 additional subnets (NAT routing subnet) for the nodes where each subnet is connected to a route table which is connected to a shared NAT gateway in it's AZ. Finally 3 public subnets for ELBs.

The NAT routing subnet would mean dynamically attaching elastic network interfaces as auto scaling groups do not support these.

@jschneiderhan
Copy link

@kris-nova I'd be happy to add an issue over in the kubernetes project, but before I do I could use another pair of eyes to make sure I'm not just being stupid:

I think for all of this (really awesome) work to function in AWS without and overlay network, an improvement needs to be make to kubernetes itself. If a subnet-per-AZ is being created, we will end up with multiple VPC routing tables that need to be updated with the CIDR ranges assigned to each node. When kubernetes goes to create the Route, it's going to find multiple tables with the cluster name tag and return an error on this line https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws_routes.go#L45. At least that's what I'm seeing with multiple manually created route tables. It just logs that line multiple times and never updates the route.

So I think this PR does everything right, but in order for kubernetes to do it's thing properly it needs to be improved to iterate over all the route tables and create a route for each one. Again, if that makes sense to you I'm happy to create a kubernetes issue and give a shot at an implementation, but my confidence is pretty low since I'm new to just about every technology involved here :).

@chrislovecnm
Copy link
Contributor

@justinsb thoughts about @jschneiderhan comment? cc: @kubernetes/sig-network ~ can someone give @jschneiderhan any guideance?

@jschneiderhan we are only initially going to be running with CNI in private mode btw.

@krisnova
Copy link
Contributor

krisnova commented Nov 3, 2016

If you are interested in testing the current branch you are welcome to run it. More information can be found in the PR #694

@krisnova
Copy link
Contributor

krisnova commented Nov 9, 2016

Closed with #694

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests