-
Notifications
You must be signed in to change notification settings - Fork 465
Production Quality Deployment #340
Comments
All good points. Very happy to see you working on making it easier to deploy to existing VPCs! I'm currently using https://github.com/MonsantoCo/etcd-aws-cluster/ to do bootstrapping to a dedicated etcd cluster (discovery happens by specifying the ASG for the etcd cluster and assigning appropriate IAM describe roles) I'm not too sure about automatically provisioning an AWS elasticsearch cluster. The AWS native cluster is stuck on a very old ES version. Maybe this'll become a whole lot easier once kubernetes EBS support matures a bit, and we could just host it in the provisioned kube cluster. |
Very eager for this work and happy to help if I can. I can't advocate for deploying a k8s+coreos cluster in AWS at work until I have a good answer for many of the items on this list, especially the upgrade path and high availability. |
@bfallik do you want to work on any of the bullet points in particular? |
@colhom nothing in particular though I suppose I'm most interested in the cluster upgrades and ELB+ASG work |
@colhom if you like the discovery method used for etcd i think i can help with that. |
@pieterlange putting etcd in an autoscaling group worries me as of now. The monsantoCo script seems kind of rickety: for example, does not support scaling down the cluster as far as I can tell. |
This list is fantastic. It represents exactly what we need in order to consider Kubernetes+CoreOS production ready for our use. I can't wait to see these executed! |
This is just what I have always wanted! |
Currently it's on the user to create a record, via Route53 or otherwise, in order to make the controller IP accessible via externalDNSName. This commit adds an option to automatically create a Route53 record in a given hosted zone. Related to: coreos#340, coreos#257
@colhom i suggest adding #420 to the list as well as even the deployment guidelines point it out as a production deficiency You are right about having etcd in an autoscaling group of course. I'm running a dedicated etcd cluster across all available zones, which feels a little bit safer but is still a hazard as i'm depending on a majority of the etcd cluster to stay up & reachable. Not sure what the answer is here. I'm spending some time on HA controllers myself, i'll try to make whatever adjustments i make mergeable. |
@colhom Hi, thanks for maintaining this project :)
Would you mind sharing me what you think about requirements for/how to do this? If so, I guess I can contribute on that (auto scaling lifecycle hooks + sqs + tiny golang app container which runs |
I was thinking that nodes would trigger kubectl drain via systemd service
|
@colhom Sounds much better than my idea in regard to simplicity! I'd like to contribute to that but I'm not sure what to include in the (Also, we may want to create a separate issue for this) |
I believe this is solved for workers by the PR #439 |
@mumoshu that excerpt is referring to the fact that our controllers are not in an autoscaling group, and if the instance is killed the control plane will be down pending human intervention. I do believe the worker pool ASG should recover from an instance failure on it's own, though. Will edit that line to just reference the controller. |
@colhom I have just subimitted #465 for #340 (comment) |
FYI, regarding this:
in addition to MonsantoCo/etcd-aws-cluster @pieterlange mentioned, I have recently looked into crewjam/etcd-aws with the blog post. It seems to be a great work. |
We've been working on https://github.com/Capgemini/kubeform which is based on terraform, ansible + CoreOS and it's inline with some of the thinking here. Happy to help contribute to something here. |
When multi-az support was announced combined with the checked off #346 in the list mentioned above we got excited and tried to deploy a kube-aws cluster without actually verifying that existing subnets are supported. Obviously we ran into issues. What we ended up doing was to take the CF template output after running Here are a few things IMHO that would make the cluster launch more "
May be this list should be split into must-haves vs nice-to-haves? Or better, layers of cloudformation templates? (I might be over-simplifying things here, but you get the idea)
When we initially launched our k8s clusters last year, there were very few solutions that solved some of the requirements we had. So we went ahead and wrote a lengthy but working cloudformation template and that solved most of our reqs. But we ended up with a template that was hard to maintain and a cluster that needed to be replaced whenever we wanted to upgrade/patch - which doesn't really work well when you're running production workloads unless you have some serious orchestration around the cluster. The current toolset(kargo/kube-aws) around CoreOS/Kubernetes still leave much to be desired. |
@harsha-y Thank you for this info |
Hi @igalbk
And in cluster.go (quick & dirty, I'm sorry):
It's good for the stack creation but I've an error with the kubernetes-wrapper (need to investigate). |
@igalbk I can send you patches if you want. Thanks for your support. |
Thank you @sdouche |
|
update: I can't create ELB (see kubernetes/kubernetes#29298) with an existing subnet. EDIT: Must add |
@dgoodwin not sure if you've seen this. |
Update here on work that is closing in on being ready for review:
|
Cluster upgrade PR is in #608 |
Heapster now fully support ElasticSearch Sink(also hosted ES clusters on AWS): Documentation |
kubernetes-retired/heapster#1313 this PR will fix the ES Sink compatability.. however since AWS doesn't allow "scripted fields" it's still impossible to calculate usage rate of resources as percentage out of the capacity. |
The kube-aws tool has been moved to its own top-level directory @ https://github.com/coreos/kube-aws If this issue still needs to be addressed, please re-open the issue under the new repository. |
No worries, nobody cares about production quality deploys. That'd be ridiculous... |
@drewblas the project has simply moved to a new repo- where significant progress has been made in merging functionality towards these goals in the last few weeks. |
The goal is to offer a "production ready solution" for provisioning a coreos kubernetes cluster. These are the major functionality blockers that I can think of.
controllerand worker AutoscalingGroups to recover from ec2 instance failureskube-aws up
-- DONE kube-aws: add option to create a record for externalDNSName automatic… #389 (requires that the hosted zone already exist)The text was updated successfully, but these errors were encountered: