Clean up the HA docs #9387

olemarkus · 2020-06-17T19:28:57Z

hakman · 2020-06-17T19:40:18Z

docs/operations/high_availability.md

-    --networking weave \
-    --cloud-labels "Team=Dev,Owner=John Doe" \
-    --image 293135079892/k8s-1.4-debian-jessie-amd64-hvm-ebs-2016-11-16 \
+    --networking cilium \


I wouldn't mention Cilium here. If you want, maybe Calico or Weave which seem simpler to digest for beginners (less options).

Weave is listed as experimental on our networking page, which is why I am changing weave for something else when I can. But I am happy with using calico.

Should be better as I think is the default in other places, like Docker Desktop and Enterprise. Thanks!

Not sure how experimental Weave is. I don't use it, but kept it up to date and seems pretty stable in tests.

Weave is experimental per https://kops.sigs.k8s.io/networking/ :) I think I actually had weave a stable initially, but someone said it probably wasn't and I decided to be conservative.

After getting a ten-hour 3AM call about Weave, I'm reluctant to call it stable. I'm given to understand it has at its core an algorithm with polynomial complexity.

OK 😄
Do we all agree that Calico is stable and the easiest one to get started with?

I don't :)
After some sleep, I am wondering if our examples should just put "" placeholder instead of a specific one. Or we need to have an "official" opinion on what users should go with. Most of the docs tries to be neutral here.
If we do go for e.g calico as the recommended provider, we should probably also use that one as default instead of kubenet, since you typically don't want to use kubenet.
This is probably worth an issue on its own though.

Neutral is good enough for me :)

justinsb · 2020-06-18T02:31:22Z

docs/operations/high_availability.md


-Kubernetes has two strategies for high availability:
+For testing purposes, kubernetes works just fine with a single master. However, when the master becomes unavailable, for example due to upgrade or instance failure, the kubernetes API will be unavailable. Pods and services that are running on the continues to operate as long as they do not depend on interacting with the API, but operations such as adding nodes, scaling pods, replacing terminated pods will not work. Running kubectl will also not work. 


Missing word (maybe "nodes"?): "running on the continues to operate"

justinsb · 2020-06-18T02:32:56Z

docs/operations/high_availability.md


-* Run multiple independent clusters and combine them behind one management plane: [federation](https://kubernetes.io/docs/user-guide/federation/)
-* Run a single cluster in multiple cloud zones, with redundant components
+kops runs each master in a dedicated autoscaling groups (ASG) and stores data on ESB volumes. That way, if a master node is terminated the ASG will launch a new master instance with the master's volume. Because of the dedicated ESB volumes, each master is bound to a fixed Availability Zone (AZ). If the AZ becomes unavailable, the master instance in that AZ will also become unavailable.


Nit: s/groups/group (or remove "a" from "a dedicated autoscaling group")

Nit: s/ESB/EBS

We should probably try to call them "control plane nodes" instead of "master" also.

Nits aside, love this paragraph - it explains a tricky subject well!

Thanks for the nits.
I am all for changing "master" to "control plane nodes" (something a bit less verbose would be nice though) if that is what k/k is doing as well. But we should have a plan on changing this everywhere. It would probably be a good idea to do this as part of #9178

justinsb · 2020-06-18T02:35:03Z

docs/operations/high_availability.md

-kops has good support for a cluster than runs
-with redundant components.  kops is able to create multiple kubernetes masters, so in the event of
-a master instance failure, the kubernetes API will continue to operate.
+For production use, you therefor want to run kubernetes in a HA setup with multiple masters. With multiple master nodes, you will be able both to do graceful, zero-down time upgrades, and you will be able to survive AZ failures.


Nit: therefor -> therefore

justinsb · 2020-06-18T02:35:25Z

docs/operations/high_availability.md

-
-When you first call `kops create cluster`, you specify the `--master-zones` flag listing the zones you want your masters
-to run in, for example:
+The simplest way to get started with a HA cluster is to run `kops create cluster` as shown below. The `--master-zones` flag listing the zones you want your masters


s/listing/lists

docs/operations/high_availability.md

justinsb · 2020-06-18T02:40:34Z

A few nits, and we should probably prefer the term "control plane" over "master", but I guess that is confusing because of the flags also.

hakman · 2020-06-18T05:45:46Z

I would also add somewhere to be mindful of how many AZs are used for HA. Transferring data between AZs can be expensive. This is why, maybe limit to 2 AZs for worker nodes.

olemarkus · 2020-06-18T07:10:27Z

Do we really want to recommend running workers in two AZs? That would at least be a "you better know what you are doing" disclaimer with that as if you run apps that requires quorum, you'll have downtime should the wrong AZ fail.

I am considering a warning about running e.g 5 AZs though. It gives you higher fault tolerance, but in most cases, a bit too much.

olemarkus · 2020-06-18T07:10:34Z

/retest

hakman · 2020-06-18T07:20:52Z

Apps with quorum is a totally different story. To get to those you actually have to get past the beginner status. Also have to understand how pod scheduling works because you may only think it works and instead all your quorum pods are in same AZ.

I don't say we recommend 2 AZs. I just mean to phrase it in a way that explains that inter-AZ traffic costs in most cases. You need 2+ AZ for HA, but depends on use case how many.

olemarkus · 2020-06-18T08:55:33Z

Something like this?

docs/operations/high_availability.md

hakman · 2020-06-18T09:22:11Z

Something like this?

Yes, sounds pretty good.
There are a few more small nits, but other than that it's a nice change.

hakman · 2020-06-18T09:40:25Z

/lgtm

hakman · 2020-06-18T10:45:36Z

/retest

olemarkus · 2020-06-19T06:37:00Z

/assign @zetaab

docs/operations/high_availability.md

Co-authored-by: Ciprian Hacman <[email protected]>

hakman · 2020-06-20T06:38:47Z

/lgtm

rifelpet · 2020-06-20T15:09:34Z

/approve

k8s-ci-robot · 2020-06-20T15:09:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: olemarkus, rifelpet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rifelpet]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 17, 2020

k8s-ci-robot requested review from hakman and mikesplain June 17, 2020 19:29

k8s-ci-robot added the area/documentation label Jun 17, 2020

olemarkus force-pushed the docs-ha branch from e738275 to 62375e6 Compare June 17, 2020 19:35

hakman reviewed Jun 17, 2020

View reviewed changes

justinsb reviewed Jun 18, 2020

View reviewed changes

hakman requested changes Jun 18, 2020

View reviewed changes

docs/operations/high_availability.md Outdated Show resolved Hide resolved

docs/operations/high_availability.md Outdated Show resolved Hide resolved

docs/operations/high_availability.md Outdated Show resolved Hide resolved

olemarkus force-pushed the docs-ha branch from b99d044 to cfeef6b Compare June 18, 2020 09:36

k8s-ci-robot assigned hakman Jun 18, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 18, 2020

olemarkus mentioned this pull request Jun 19, 2020

Not clear when/how to configure more than 3 masters #8769

Closed

k8s-ci-robot assigned zetaab Jun 19, 2020

Ole Markus With added 2 commits June 20, 2020 08:25

Clean up the HA docs

41790bc

Add missing words

6f75c76

olemarkus force-pushed the docs-ha branch from cfeef6b to 6f75c76 Compare June 20, 2020 06:25

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 20, 2020

hakman reviewed Jun 20, 2020

View reviewed changes

docs/operations/high_availability.md Outdated Show resolved Hide resolved

Update docs/operations/high_availability.md

e80f203

Co-authored-by: Ciprian Hacman <[email protected]>

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 20, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 20, 2020

k8s-ci-robot merged commit 13ad625 into kubernetes:master Jun 20, 2020

k8s-ci-robot added this to the v1.19 milestone Jun 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up the HA docs #9387

Clean up the HA docs #9387

olemarkus commented Jun 17, 2020 •

edited

Loading

hakman Jun 17, 2020

olemarkus Jun 17, 2020

hakman Jun 17, 2020 •

edited

Loading

hakman Jun 17, 2020

olemarkus Jun 17, 2020

johngmyers Jun 18, 2020

hakman Jun 18, 2020

olemarkus Jun 18, 2020

hakman Jun 18, 2020

justinsb Jun 18, 2020

justinsb Jun 18, 2020

olemarkus Jun 18, 2020

justinsb Jun 18, 2020

justinsb Jun 18, 2020

justinsb commented Jun 18, 2020

hakman commented Jun 18, 2020

olemarkus commented Jun 18, 2020

olemarkus commented Jun 18, 2020

hakman commented Jun 18, 2020

olemarkus commented Jun 18, 2020

hakman commented Jun 18, 2020

hakman commented Jun 18, 2020

hakman commented Jun 18, 2020

olemarkus commented Jun 19, 2020

hakman commented Jun 20, 2020

rifelpet commented Jun 20, 2020

k8s-ci-robot commented Jun 20, 2020


		Kubernetes has two strategies for high availability:
		For testing purposes, kubernetes works just fine with a single master. However, when the master becomes unavailable, for example due to upgrade or instance failure, the kubernetes API will be unavailable. Pods and services that are running on the continues to operate as long as they do not depend on interacting with the API, but operations such as adding nodes, scaling pods, replacing terminated pods will not work. Running kubectl will also not work.

Clean up the HA docs #9387

Clean up the HA docs #9387

Conversation

olemarkus commented Jun 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hakman Jun 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinsb commented Jun 18, 2020

hakman commented Jun 18, 2020

olemarkus commented Jun 18, 2020

olemarkus commented Jun 18, 2020

hakman commented Jun 18, 2020

olemarkus commented Jun 18, 2020

hakman commented Jun 18, 2020

hakman commented Jun 18, 2020

hakman commented Jun 18, 2020

olemarkus commented Jun 19, 2020

hakman commented Jun 20, 2020

rifelpet commented Jun 20, 2020

k8s-ci-robot commented Jun 20, 2020

olemarkus commented Jun 17, 2020 •

edited

Loading

hakman Jun 17, 2020 •

edited

Loading