Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up the HA docs #9387

Merged
merged 3 commits into from
Jun 20, 2020
Merged

Clean up the HA docs #9387

merged 3 commits into from
Jun 20, 2020

Conversation

olemarkus
Copy link
Member

@olemarkus olemarkus commented Jun 17, 2020

Fixes #8769

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 17, 2020
--networking weave \
--cloud-labels "Team=Dev,Owner=John Doe" \
--image 293135079892/k8s-1.4-debian-jessie-amd64-hvm-ebs-2016-11-16 \
--networking cilium \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't mention Cilium here. If you want, maybe Calico or Weave which seem simpler to digest for beginners (less options).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weave is listed as experimental on our networking page, which is why I am changing weave for something else when I can. But I am happy with using calico.

Copy link
Member

@hakman hakman Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be better as I think is the default in other places, like Docker Desktop and Enterprise. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how experimental Weave is. I don't use it, but kept it up to date and seems pretty stable in tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weave is experimental per https://kops.sigs.k8s.io/networking/ :) I think I actually had weave a stable initially, but someone said it probably wasn't and I decided to be conservative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After getting a ten-hour 3AM call about Weave, I'm reluctant to call it stable. I'm given to understand it has at its core an algorithm with polynomial complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK 😄
Do we all agree that Calico is stable and the easiest one to get started with?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't :)
After some sleep, I am wondering if our examples should just put "" placeholder instead of a specific one. Or we need to have an "official" opinion on what users should go with. Most of the docs tries to be neutral here.
If we do go for e.g calico as the recommended provider, we should probably also use that one as default instead of kubenet, since you typically don't want to use kubenet.
This is probably worth an issue on its own though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neutral is good enough for me :)


Kubernetes has two strategies for high availability:
For testing purposes, kubernetes works just fine with a single master. However, when the master becomes unavailable, for example due to upgrade or instance failure, the kubernetes API will be unavailable. Pods and services that are running on the continues to operate as long as they do not depend on interacting with the API, but operations such as adding nodes, scaling pods, replacing terminated pods will not work. Running kubectl will also not work.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing word (maybe "nodes"?): "running on the continues to operate"


* Run multiple independent clusters and combine them behind one management plane: [federation](https://kubernetes.io/docs/user-guide/federation/)
* Run a single cluster in multiple cloud zones, with redundant components
kops runs each master in a dedicated autoscaling groups (ASG) and stores data on ESB volumes. That way, if a master node is terminated the ASG will launch a new master instance with the master's volume. Because of the dedicated ESB volumes, each master is bound to a fixed Availability Zone (AZ). If the AZ becomes unavailable, the master instance in that AZ will also become unavailable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/groups/group (or remove "a" from "a dedicated autoscaling group")

Nit: s/ESB/EBS

We should probably try to call them "control plane nodes" instead of "master" also.

Nits aside, love this paragraph - it explains a tricky subject well!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nits.
I am all for changing "master" to "control plane nodes" (something a bit less verbose would be nice though) if that is what k/k is doing as well. But we should have a plan on changing this everywhere. It would probably be a good idea to do this as part of #9178

kops has good support for a cluster than runs
with redundant components. kops is able to create multiple kubernetes masters, so in the event of
a master instance failure, the kubernetes API will continue to operate.
For production use, you therefor want to run kubernetes in a HA setup with multiple masters. With multiple master nodes, you will be able both to do graceful, zero-down time upgrades, and you will be able to survive AZ failures.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: therefor -> therefore


When you first call `kops create cluster`, you specify the `--master-zones` flag listing the zones you want your masters
to run in, for example:
The simplest way to get started with a HA cluster is to run `kops create cluster` as shown below. The `--master-zones` flag listing the zones you want your masters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/listing/lists

docs/operations/high_availability.md Show resolved Hide resolved
@justinsb
Copy link
Member

A few nits, and we should probably prefer the term "control plane" over "master", but I guess that is confusing because of the flags also.

@hakman
Copy link
Member

hakman commented Jun 18, 2020

I would also add somewhere to be mindful of how many AZs are used for HA. Transferring data between AZs can be expensive. This is why, maybe limit to 2 AZs for worker nodes.

@olemarkus
Copy link
Member Author

Do we really want to recommend running workers in two AZs? That would at least be a "you better know what you are doing" disclaimer with that as if you run apps that requires quorum, you'll have downtime should the wrong AZ fail.

I am considering a warning about running e.g 5 AZs though. It gives you higher fault tolerance, but in most cases, a bit too much.

@olemarkus
Copy link
Member Author

/retest

@hakman
Copy link
Member

hakman commented Jun 18, 2020

Apps with quorum is a totally different story. To get to those you actually have to get past the beginner status. Also have to understand how pod scheduling works because you may only think it works and instead all your quorum pods are in same AZ.

I don't say we recommend 2 AZs. I just mean to phrase it in a way that explains that inter-AZ traffic costs in most cases. You need 2+ AZ for HA, but depends on use case how many.

@olemarkus
Copy link
Member Author

Something like this?

docs/operations/high_availability.md Outdated Show resolved Hide resolved
docs/operations/high_availability.md Outdated Show resolved Hide resolved
docs/operations/high_availability.md Outdated Show resolved Hide resolved
@hakman
Copy link
Member

hakman commented Jun 18, 2020

Something like this?

Yes, sounds pretty good.
There are a few more small nits, but other than that it's a nice change.

@hakman
Copy link
Member

hakman commented Jun 18, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 18, 2020
@hakman
Copy link
Member

hakman commented Jun 18, 2020

/retest

@olemarkus
Copy link
Member Author

/assign @zetaab

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 20, 2020
@hakman
Copy link
Member

hakman commented Jun 20, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 20, 2020
@rifelpet
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: olemarkus, rifelpet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 20, 2020
@k8s-ci-robot k8s-ci-robot merged commit 13ad625 into kubernetes:master Jun 20, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Jun 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/documentation cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Not clear when/how to configure more than 3 masters
7 participants