-
Notifications
You must be signed in to change notification settings - Fork 294
Best practices in configuring and running production clusters #1050
Comments
@c-knowles Just noticed the high availability part is covered in your recent PR 😄 |
@c-knowles Thanks for the correction! Yes - we really should set it to false. |
Availability: |
@mumoshu, would you be able to clarify or provide references to kubeDns.nodeLocalResolver.enabled? |
@Vincemd Probably. For me, kube-dns occasionally failed to resolve AWS managed DNS names. I suspect that it is due to temporary failure in Amazon DNS and/or communication between your node and the Amazon DNS. kubeDns.nodeLocalResolver would be a solution if it was a communication issue - as long as nodeLocalResolver has a cached dns entry available for your query, it will "hide" the failure. |
This issue has been mostly about configuring and managing "source code" of your cluster. |
@mumoshu kubeDns.nodeLocalResolver.enabled. What does that do if I enable it? I mean what configuration or deployment does it alter? Is it possible to make that change without doing kube-aws update by simply editing a k8s resource or is it more complicated than that? Getting a lot of Unable to execute HTTP request: MYBUCKET.s3.amazonaws.com: Temporary failure in name resolution on my pods running aws S3 cli on kube-aws 0.9.8. I'm going to retire this cluster for a 0.9.9 but wanted to try quick fix as the impact is great |
@Vincemd I understand that it should be possible to introduce nodeLocalResolver without downtime. But I'd prefer creating an another cluster for migration to protect your production service with maximum care :) Also, we should be very good at creating/deleting k8s clusters. So that we don't need to fear any kind of cluster failures too much. Anyway, the name resolution error seems like what I have seen before due to kube-dns instability in higher loads. I just made my apps to tolerate transient dns look failures by retrying. If you don't have retries in your apps, I suggest you to implement ones. One more thing. I scaled kube-dns by adding more replicas which greatly reduced such errors. So do it, even if kube-dns doesn't seem to be very overloaded. The move to nodeLocalResolver is the last thing you should try. |
Reserve compute resources for kubelet and system daemons #1356 |
@mumoshu thanks. Indeed we have the concept of blue/green so most of the time we make changes to an inactive cluster. We are getting better at creating cluster. The hard bit used to be Prometheus Operator but since kube-aws has opened the ports, it's easier. Anyway, I tried adding more dns pods without success. nodeLocalResolver is now enabled and no issue so far. |
@Vincemd Awesome! You're a very experienced k8s admin 🎉 Probably I have a similar sentiment about prometheus. I basically wanted to migrate a k8s monitoring system across k8s clusters without downtime, but gave up and went to a different route. That's metricbeat + Kinesis + AWS ES for multi-cluster monitoring. Managing AWS ES cluster seems not that easy compared to Prometheus Operator but I thought migrating stateful services across k8s clusters would be way harder.
Good to know. Thanks for sharing your experience! |
Hello All, I reference this issue from time to time when building out new clusters (with new kube-aws versions) and wanted to note since there were mentions about prometheus here. Just came across this today and planning on testing/evaluating thanos to help with our prometheus infrastructure:
Blog/overview thanos-prometheus-at-scale |
Following this thread. 😁 |
Regarding best practices, it would be great to have concrete and working examples with the kube-aws docs showing exactly how to configure clusters for desired functionality. On that note, I understand there is the baseline provisioning testing tool I have the following scrubbed cluster.yaml file which has been upgraded for the latest I am providing it for review and encouraging feedback with the goal of giving everyone coming to kube-aws a working cluster config ready for deployment which provides:
This could provide new users (and existing if needed) configurations that can start using immediately. Perhaps this might be something that could go under the kube-aws/tree/master/docs/advanced-topics location? Another thought proposal on this subject would be creating template (command-line) options for kube-aws which would be called/used during the cluster.yaml initialize assets creation process, that would configure (and make active) specified configuration options such as: multi-AZ nodepools, autoscaler, kiam, etc.
I understand this is a complex ask and is difficult to maintain. But, if possible, this could greatly enhance our ability to manage our kube clusters with kube-aws using a more desirable IAC methodology. I think something like the above options would help us gathering more kube-aws adoption moving forward, as the criticisms I have faced personally from other kube deployment consultants/companies/engineers, etc., essentially can be summarized with the current lack of fully IAC management of our clusters' configurations and deployment code and the requisite intermediate steps during initial (i.e. new version) deployments, etc. On that note, I see that there is a nice existing project (similar to what I've done internally with the exception that ours is not interactive) which provides scaffolding around kube-aws deployment procsses camilb's kube-aws-secure I would appreciate thoughts and feedback on these ideas. |
One thing I think people would find useful on kube-aws best practices is how to confirm that their |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is a fast, incomplete write-up to start discussing towards documenting best practices to help users configure their production clusters.
As far as I remember, we don't have all of these in a single page as of today, right?
cc @c-knowles
Availability
etcd.count
controller.count
affinity.podAntiAffinity
to prefer or require pods to be not collocated in the same nodeSecurity
Something like published recently in the GCP blog would be nice:
Harden kube-dashboard
Enable Calico for Network Policies
RBAC
Enabled by default since v0.9.9-rc.1
User Authentication
experimental.authentication.webhook.*
Node Authn/Authz
experimental.tlsBootstrap.enabled
experimental.nodeAuthorizer.enabled
Auditing
experimental.auditLog.enabled
Misc
kubeDns.nodeLocalResolver.enabled
addons.clusterAutoscaler.enabled
controller.clusterAutoscalerSupport.enabled
worker.nodePools[].autoscaling.clusterAutoscaler.enabled
The text was updated successfully, but these errors were encountered: