Allow workers to sit inside private subnets within an existing VPC #44

cknowles · 2016-11-08T07:02:36Z

Use case proposal:

Use an existing VPC, e.g. has an existing RDS deployment inside
Private subnets used for at least workers, perhaps controllers too if we use a bastion (first use case we want public controllers so we can SSH into the cluster)
Use existing NAT Gateway per AZ for HA with the appropriate route tables used per worker AZ

It was part of what coreos/coreos-kubernetes#716 was intended to cover. Pre-filing this issue so we can move discussion out of #35.

cknowles · 2016-11-08T07:08:24Z

@pieterlange @mumoshu the original pull request does two things:

Allows existing route tables to be associated with kube-aws created subnets
Separates out subnets of the controller(s) so their route tables can be different

The reason using the existing routeTableId is less useable for this case is because unless we are very careful with the route table we defeat the HA aspects as we need to route traffic on very specific CIDRs to the right AZs.

cknowles · 2016-11-08T07:09:13Z

I have a feeling we should separate out the various use cases as I've kind of mixed them together since the change was relatively small to cope with all of this.

pieterlange · 2016-11-08T07:20:49Z

The reason using the existing routeTableId is less useable for this case is because unless we are very careful with the route table we defeat the HA aspects as we need to route traffic on very specific CIDRs to the right AZs.

Are you referring to the route to 0.0.0.0/0 via NAT gateway here? As the NAT gateway can only live in on AZ at a time, there is an inherent dependency here. There is more than one way to skin this cat.

Maintain multiple route tables, one per AZ (your suggestion)
Keep using one routetable and manually / lamba failover the NAT gateway to a different NAT gateway on AZ downtime

I believe that ultimately this problem can be solved without constraining operator choices by implementing node pools. The controllers would be in a different pool. Workers in each AZ could be in different pools, one pool per AZ. Or they can be in one pool and use one routetable.

cknowles · 2016-11-08T07:52:36Z

Yeah, our private subnet route tables look something like:

10.0.0.0/16 -> local
0.0.0.0/0 - > nat-UNIQUEID

Having looked at it again just now I think there is no way to switch traffic based on the source CIDR in a single route table. So in other words, we can't use a single route table while still routing the worker traffic from each individual AZ to the appropriate NAT Gateway in the same AZ.

I think generally we'd want to have a route table per AZ to mirror a typical multi-AZ NAT Gateway setup.

The original pull was a simple version on the basis of a whole AZ going down, however if we are talking about service level outages per AZ then if a single NAT Gateway is down I'm not sure of how the failover could/should work. Probably something along the lines of what you mention with lambda/manual.

pieterlange · 2016-11-08T08:36:15Z

if a single NAT Gateway is down

The NAT gateways themselves are HA (or so amazon says). The issue currently is if the gateway lives in AZ-a and that entire AZ goes down, AZ-b and AZ-c lose connectivity to the internet until a new NAT gateway is booted in one of those AZ's and the route table is updated.

I think generally we'd want to have a route table per AZ to mirror a typical multi-AZ NAT Gateway setup.

👍

cknowles · 2016-11-08T08:38:53Z

Yep agreed. From the docs:
NAT gateways in each Availability Zone are implemented with redundancy. Create a NAT gateway in each Availability Zone to ensure zone-independent architecture.

Do you have more details on the node pools? I'd like to have a look over it.

pieterlange · 2016-11-08T09:11:04Z

@mumoshu wrote something about this in #46 just now :)

mumoshu · 2016-11-17T04:17:43Z

@c-knowles I've announced my POC for node pools in #46 (comment) yesterday!

Could it be something that can be a foundation to address this issue?

With the poc, we now have separate subnet(s) for each node pool(=a set of worker nodes).
Is that enough or do we need ways to differentiate route table per subnet/AZ as you've discussed so far?

P.S. I don't recommend reading each commit in the node-pools branch for now, mainly because those are really dirty and not ready for reviews 😆

cknowles · 2016-11-17T07:58:42Z

Cool, thanks. I was going to review it soon for this use case. Prior to that, what I would say is the current setup we are using from code at coreos/coreos-kubernetes#716 uses a different subnet per AZ because each AZ has it's own NAT Gateway. I think control over different AZs within a tool like kube-aws will usually come down wanting control over what gets set per AZs (with sensible defaults).

cknowles · 2016-12-05T14:29:54Z

@mumoshu After reviewing node pools further, I think this use case should be a matter of:

For workers, specify the appropriate routeTableId in each AZ via a node pool for each AZ which allows us to then use the autoscaler as per Feature: Node pools #46.
For controllers, specify the appropriate routeTableId depending on whether public/private controllers are needed.

I think that's it. I'm going to try it out tomorrow. My only reservation would be that it's a quite complex setup for what I think is a very common scenario. I also could not see a way to disable the new separate etcd node creation, I don't really want/need one in the dev cluster but I guess that's not supported any more?

mumoshu · 2016-12-06T04:32:27Z

@c-knowles Really looking forward to your results!

Regarding the separate etcd nodes: Yes, collocating etcd and control plane in a single node is not supported.

I also believe that the uniform architectures for the dev and the prod envs would be good in several points including e.g. dev-prod parity, less user confusion/support, less code complexity.
I'm a bit afraid of being obtrusive here but basically,I recommend adapting to etcd node separation, regardless of dev/prod.

If you're suffering from cost, I guess you can use smaller instance types for etcd nodes in a dev env and then wasted cost would be minimum.

cknowles · 2016-12-06T05:35:41Z

I tend to agree to a certain level. Our dev cluster is where we development cluster changes/upgrades and we run automated tests against it for our apps and deployment mechanisms. We may even have a few of those at any time. We actually have a staging cluster after dev as well. The cost currently went up 50% minimum for any of those clusters (min was 2 nodes, now it's 3). I'd like to support external etcd as well such as compose.io.

Updates/feedback so far:

The model above only really works if we can create the initial stack with zero workers. We want the initial stack to be only controllers with a chosen route table. Currently not possible due to hard coded min/max as count-1 and count+1:
CREATE_FAILED AWS::AutoScaling::AutoScalingGroup AutoScaleWorker Min bound, -1, must be non-negative Allow more control over ASG definition #142
When creating a node-pool, it may be nice to also copy across the values for worker config like workerInstanceType, workerRootVolumeSize etc from the main cluster.yaml. stackTags also. Easier to configure, easier to version-control, more tightly integrated node pools #315
Node pool default cluster.yaml differences from main cluster.yaml - useCalico is commented out for main but explicit in node pool (my preference is go with all explicit defaults in the cluster.yaml and remove defaults from the go code). Same with amiId, node pool default has empty YAML property. Easier to configure, easier to version-control, more tightly integrated node pools #315
Given the auto scaler can only work in a single AZ, should we allow node pools to be multi AZ? Currently the node pool cluster.yaml contains subnets commented out (I didn't touch it though). Setup would be more complex for some use cases but it could cut what we need to support if we provide standard examples in the docs. Easier to configure, easier to version-control, more tightly integrated node pools #315 covers making node pools generally easier to configure, multi-AZ node pools are still allowed but perhaps there are uses cases for that too.
Some inconsistency in command naming. kube-aws says node-poolsas does init output but render output says nodepool. It also mentions --pool-name instead of --node-pool-name and actually we cannot validate yet? Correct node pool command inconsistencies #174
How are we going to manage upgrades to the main stack with the node pools in place? I'm thinking about if I have to go and rename the stacks each time or whether we can prefix the node pool stack names with the main cluster name when we render/up. Node Pool Nested Stack Name #187
How to update node pool stacks? The Stack Policy is set to {} so updates are currently disallowed. Make kube-aws node-pools update not to fail #140
Definitely need more instructions for node pools! Happy to contribute to this based on my usage. Documentation for v0.9.2 #162
Prefix node pool subnets with node pool name - Name the node pool subnets after the node pool #129

mumoshu · 2016-12-06T06:38:56Z

@c-knowles 50% increase is serious. I'd definitely like to make it more cost effective in another ways.
I know it would sound a bit adventurous but how about utilizing spot fleets not only for worker nodes but also for controller nodes?

cknowles · 2016-12-06T07:47:50Z

Could do. Is there any simple way to provide at least one controller? e.g. some ASG/fleet automated scaling rule+action like "if all spots are going to shutdown, increment on demand ASG from min 0 to min 1". Or alternatively leave that for now and assume at least one spot will come back?

mumoshu · 2016-12-06T08:01:19Z

@c-knowles AFAIK there's no way to coordinate an ASG and a spot fleet like that.

However, I guess setting very low TargetCapacity(maybe 1) for a spot fleet and enabling the "diversified" strategy to diversify your instance types will work like that.

For example, if you've chosen 1 unit = 3.75GB of memory:

worker:
  spotFleet:
    targetCapacity: 1
    launchSpecifications:
    - weightedCapacity: 1
       instanceType: m3.medium
    - weightedCapacity: 1
       instanceType: c3.large
    - weightedCapacity: 1
       instanceType: c4.large
    - weightedCapacity: 2
       instanceType: m3.large
    - weightedCapacity: 2
       instanceType: m4.large

would ideally bring up only 1 m3.medium in peaceful days and if and only if the spot fleet loses to bids, a larger instance would be brought up.

c3.large, c4.large, m3.large, m4.large is approximately 2x more expensive than m3.medium.
However, they're spot instances hence discounts are 80~90%.
High availability plus reduced cost is achieved.

Not tested though 😉

cknowles · 2016-12-06T08:10:06Z

That may be ok to start with. Possible future changes could be termination notices linked to an ASG action/lambda. BTW, if you are available in the #sig-aws k8s Slack it may be useful to chat so we can keep it out of the issues.

mumoshu · 2016-12-06T08:44:28Z

@c-knowles I guess in such case my kube-spot-termination-notice-handler in combination with cluster-autoscaler would fit.

My kube-spot-termination-notice-handler basically kubectl drain the node almost immediately after it recognizes a termination notice(~2min before the actual termination).

kubectl drain would end up making several pending pods over the cluster.
cluster-autoscaler notices pending pods and scales up an ASG.

However, I'm not yet sure if cluster-autoscaler supports scaling down an ASG to zero nodes, which I believe required for your use-case.

cknowles · 2016-12-06T11:09:41Z

I have good news! It's possible to fulfil this use case with only minor modifications to kube-aws using node pools. I used one of our shared VPCs and managed to create the main cluster for just the controller(s) in the public subnets while putting a node pool in a private subnet of one AZ. I also double checked rolling out a k8s Service with ELB to double check it attached ok to the workers and that works too with a basic pod responding to HTTP calls.

Assuming adding a couple more node pools works, we only need minor mods for this.

I have a few feedback items which I will start to put some pull requests in for. I may need some initial feedback on those items to start on them.

cknowles · 2016-12-07T02:35:39Z

@mumoshu on my feedback above, I want to check a few things before doing the pulls if you could provide a very quick comment on each:

Do you know why worker count min/max are hard coded to count-1 and count+1? I'd like to change to explicitly support min/max config and deprecate the existing config. We can keep supporting it for at least one release if we do that and just set min/max to the count (which was what it was previously before 1f37a9b)
I will copy some more values from the main stack to the node pool like the tags and worker instance type.
I will align the main cluster.yaml default to the node pool one (useCalico explicitly set to false)
I'm not sure whether we should allow multi AZ node pools, thoughts?
Is there a plan to add node pools validate? If so I will leave the inconsistencies in command naming for that task otherwise we should remove the instructions from the render output which state a validation command is available.
Where can I add instructions for node pools? Is it ok inside Documentation?

mumoshu · 2016-12-07T03:22:55Z

@c-knowles First of all, thanks for your feedbacks!

1: No. Anyways, It would be nice to make them configurable in cluster.yaml if and only if we keep backward-compatiblity in stack-template.json i.e. compute default values for {Worker,Controller}{Min,Max}Count according to {Worker,Controller}Count if they're omitted. So 👍

2: I'm not yet sure it copying instance type is good. I guess you should explicitly select appropriate instance type regardless of what is selected in the main cluster.

Copying stack tags would be basically good as I assume stack tags are used to identify whole the cluster resources not only main cluster or node pool.
However, I guess you should'nt put the exactly same set of tags to both main cluster and node pools. Doing that results in inability to identify only part of the whole cluster e.g. you can't identify only the resources from node pool if all the stacks are tagged identically.

Not 👍 for now but definitely like to discuss more!

3: I'd like to keep cluster.yaml simple. Only keys for required configuration with no default values would be commented out in cluster.yaml. I'd rather like to fix node pools' cluster.yaml so that it properly comments out useCalico: if useCalico is false in main cluster's cluster.yaml.

Basically 👍 but I'd rather like to fix node pools' cluster.yaml instead of main cluster's.

4: Indeed. For now, I guess we should at least error out when different AZs are specified in multiple subnets, instead of just removing all the subnets: from node pools' cluster.yaml(just for minimum delta from current code hence easier pull request plus extra period of time to receive feedbacks from users if subnets itself are required or not.

5: Yes! I'd appreciate it if you could add it but I'm willing to do myself.

For example, kube-aws node-pools update could be added like this.

6: Thanks for your attention to documentation! I'd greatly appreciate it.

Adding the brand-new documentation named like kubernetes-on-aws-node-pools.md and linking it from existing kubernetes-on-aws.md which is the entrance of CoreOS's doc would be nice.

mumoshu · 2016-12-08T08:23:59Z

@c-knowles

How to update node pool stacks? The Stack Policy is set to {} so updates are currently disallowed. #130?

Thanks for spotting it! Fixed via #140

cknowles · 2016-12-08T12:05:42Z

Submitted #142 for the ASG definitions. Getting the defaults to work in golang is more tricky than I'd hoped but it works, will see what you think.

mumoshu · 2017-02-16T05:46:18Z

@c-knowles I believe that almost all the TODOs related to this issue are addressed via recently merged PRs. Can we close this now?

mumoshu · 2017-02-24T07:20:30Z

ping @c-knowles

cknowles · 2017-02-26T11:31:33Z

@mumoshu I'm happy this has been covered off now, the remaining items were mostly in #315. For anyone who ends up here, I've ended up using the additions in #315 to have kube-aws manage the subnets and route tables while referencing an existing set of NAT Gateways (node pools set to private).

cknowles mentioned this issue Nov 8, 2016

Allow ECR pull from controller IAM role #35

Merged

cknowles changed the title ~~Allow workers to sit inside a private subnet with a NAT Gateway per AZ~~ Allow workers to sit inside private subnets within an existing VPC Nov 8, 2016

mumoshu added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Nov 9, 2016

mumoshu mentioned this issue Dec 6, 2016

KubernetesCluster tag is not applied to existing objects #135

Closed

This was referenced Dec 13, 2016

Create etcd and workers in private subnets, controllers in public subnet #152

Closed

Correct node pool command inconsistencies #174

Merged

cknowles mentioned this issue Jan 15, 2017

Support the use-case to manage multiple kube-aws clusters' configurations optionally inheriting organization-specific customizations with a version control system like Git #238

Closed

cknowles mentioned this issue Jan 30, 2017

Add support for customization of network topologies #284

Merged

cknowles closed this as completed Feb 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow workers to sit inside private subnets within an existing VPC #44

Allow workers to sit inside private subnets within an existing VPC #44

cknowles commented Nov 8, 2016 •

edited

Loading

cknowles commented Nov 8, 2016

cknowles commented Nov 8, 2016

pieterlange commented Nov 8, 2016 •

edited

Loading

cknowles commented Nov 8, 2016

pieterlange commented Nov 8, 2016 •

edited

Loading

cknowles commented Nov 8, 2016 •

edited

Loading

pieterlange commented Nov 8, 2016 •

edited

Loading

mumoshu commented Nov 17, 2016 •

edited

Loading

cknowles commented Nov 17, 2016

cknowles commented Dec 5, 2016

mumoshu commented Dec 6, 2016

cknowles commented Dec 6, 2016 •

edited

Loading

mumoshu commented Dec 6, 2016

cknowles commented Dec 6, 2016 •

edited

Loading

mumoshu commented Dec 6, 2016

cknowles commented Dec 6, 2016

mumoshu commented Dec 6, 2016

cknowles commented Dec 6, 2016 •

edited

Loading

cknowles commented Dec 7, 2016

mumoshu commented Dec 7, 2016 •

edited

Loading

mumoshu commented Dec 8, 2016

cknowles commented Dec 8, 2016

mumoshu commented Feb 16, 2017

mumoshu commented Feb 24, 2017

cknowles commented Feb 26, 2017

Allow workers to sit inside private subnets within an existing VPC #44

Allow workers to sit inside private subnets within an existing VPC #44

Comments

cknowles commented Nov 8, 2016 • edited Loading

cknowles commented Nov 8, 2016

cknowles commented Nov 8, 2016

pieterlange commented Nov 8, 2016 • edited Loading

cknowles commented Nov 8, 2016

pieterlange commented Nov 8, 2016 • edited Loading

cknowles commented Nov 8, 2016 • edited Loading

pieterlange commented Nov 8, 2016 • edited Loading

mumoshu commented Nov 17, 2016 • edited Loading

cknowles commented Nov 17, 2016

cknowles commented Dec 5, 2016

mumoshu commented Dec 6, 2016

cknowles commented Dec 6, 2016 • edited Loading

mumoshu commented Dec 6, 2016

cknowles commented Dec 6, 2016 • edited Loading

mumoshu commented Dec 6, 2016

cknowles commented Dec 6, 2016

mumoshu commented Dec 6, 2016

cknowles commented Dec 6, 2016 • edited Loading

cknowles commented Dec 7, 2016

mumoshu commented Dec 7, 2016 • edited Loading

mumoshu commented Dec 8, 2016

cknowles commented Dec 8, 2016

mumoshu commented Feb 16, 2017

mumoshu commented Feb 24, 2017

cknowles commented Feb 26, 2017

cknowles commented Nov 8, 2016 •

edited

Loading

pieterlange commented Nov 8, 2016 •

edited

Loading

pieterlange commented Nov 8, 2016 •

edited

Loading

cknowles commented Nov 8, 2016 •

edited

Loading

pieterlange commented Nov 8, 2016 •

edited

Loading

mumoshu commented Nov 17, 2016 •

edited

Loading

cknowles commented Dec 6, 2016 •

edited

Loading

cknowles commented Dec 6, 2016 •

edited

Loading

cknowles commented Dec 6, 2016 •

edited

Loading

mumoshu commented Dec 7, 2016 •

edited

Loading