hypershift

Alt-Shivam · Jan 19, 2023 · 7034a5a · 7034a5a
1 parent bf28203
commit 7034a5a
Show file tree

Hide file tree

Showing 3 changed files with 789 additions and 34 deletions.
diff --git a/redhat/ocp4/4.11/4.11.acm.hypershift.md b/redhat/ocp4/4.11/4.11.acm.hypershift.md
@@ -1,25 +1,44 @@
-# openshift4.11 acm with hypershift
+# openshift4.11 acm with hypershift on baremetal
 
 本文介绍，在openshift4.11上，装 ACM 组件以后，然后通过hypershift的方式，来部署一个单worker节点openshift4.11的控制面托管的集群，在部署的过程中，我们模拟离线的网络环境，并且禁止DHCP，只用静态IP。
 
+> This document, will describe how to deploy a single worker node cluster using hypershift, on a ocp 4.11 hub cluster with ACM. During the deployment process, we simulate an offline network environment, and Disable DHCP, only use static IP.
+
 控制面托管（hypershift)模式，之所以诱人，是因为他能够让控制面变成一个namespace，然后托管到中心控制面集群上，这样就能把多个集群的控制面集中到一个中心集群上，能大大提高master节点的计算密度，节约master节点的成本。并且能够把集群master节点的运行维护工作，交给专业团队运维的控制面集群，作为最终用户，只要关心worker节点的运行和维护，而worker节点的运行维护相对来说，是非常简单的。
 
+> The control plane hosting (hypershift) mode is attractive because it can turn the control plane into a namespace and then host it on the central cluster, so that the control planes of multiple clusters can be concentrated on one central cluster, which can greatly increase the computing density of the master node and save the cost of the master node. And the operation and maintenance of the master node can be handed over to the central cluster operated by a professional team. As an end user, you only need to care about the operation and maintenance of the worker node, and the operation and maintenance of the worker node is relatively simple.
+
 对比SNO，compact cluster这种master/worker混合部署的方案，hypershift通过剥离控制面业务负载，到中心集群，防止work load对master的不利影响，比如用户部署了一个UPF这种极度消耗CPU的应用，就会无意间影响master，从而让整个集群垮掉。而hypershift就从方案层面，避免了这种情况。而从中心集群的角度来说，他的业务负载种类比较单一，就能刚好的有针对性的优化和运维。
 
+> Compared with the master/worker combind deployment mod of SNO and compact cluster, hypershift removes the control plane work load and transfers it to the central cluster to prevent the adverse impact of work load on the master. For example, if a user deploys an application that consumes CPU such as UPF, It will inadvertently affect the master, causing the entire cluster to collapse. And hypershift avoids this situation from the architecture level. From the perspective of the central cluster, its work load type is relatively simple and consistent, and it can be optimized for operation and maintenance by focusing on the control plan.
+
+![](../4.11/dia/hypershift.s10.drawio.svg)
+
 本次实验，整个流程如下：
-1. 在openshift4上安装ACM组件，
+1. 在openshift4上安装ACM组件。
 2. 在ACM上配置cluster, infra env等配置。
 3. MCE通过网络 redfish 协议启动kvm
 4. kvm自动开始集群安装，但是由于kvm+redfish的限制，安装过程中的重启，需要手动停止kvm，配置由硬盘启动，然后再手动启动kvm。
 5. 集群安装完成，保存集群登录信息
 
+> In this experiment, the whole process is as follows:
+>1. Install the ACM component on openshift4.
+>2. Configure cluster, infra env and other configurations on ACM.
+>3. MCE starts kvm through network redfish protocol
+>4. Kvm automatically starts the cluster installation, but due to the limitation of kvm+redfish, the restart during the installation process requires manually stopping kvm, configuring it to start from the hard disk, and then manually starting kvm.
+>5. The cluster installation is complete, save the cluster login information
+
 本次实验的部署架构图：
+>The deployment architecture diagram of this experiment:
+
+![](../4.11/dia/hypershift.s20.drawio.svg)
 
-![](../4.10/dia/4.10.bm.ipi.sno.static.ip.drawio.svg)
+本次实验的网络架构，和服务器, kvm部属架构，是依托之前的一个未完成的实验，[工厂模式](../4.10/4.10.factory.md)，虽然工厂模式实验的网络模型比较复杂，但是我们就不重复配置环境了。如果想了解IPI模式如何部署集群，可以参考上述文档。
 
-本次实验的网络架构，和服务器,kvm部属架构，是依托之前的一个未完成的实验，[工厂模式](../4.10/4.10.factory.md)，虽然工厂模式实验的网络模型比较复杂，但是我们就不重复配置环境了。
+>The network architecture of this experiment, as well as the server and kvm deployment architecture, are based on a previous unfinished experiment, [Factory Mode](../4.10/4.10.factory.md), although the network model of the factory mode experiment is more complicated , but we will not repeat the configuration environment. If you want to know how to deploy clusters in IPI mode, you can refer to the above documents.
 
 参考资料：
+> reference:
 - https://cloud.redhat.com/blog/how-to-build-bare-metal-hosted-clusters-on-red-hat-advanced-cluster-management-for-kubernetes
 - https://cloud.redhat.com/blog/a-guide-to-red-hat-hypershift-on-bare-metal
 
@@ -30,12 +49,16 @@
 - [bilibili](https://www.bilibili.com/video/bv1F3411n7tT)
 - [youtube](https://youtu.be/tX2iozE2Rn0) -->
 
-# 静态变量和 kvm 配置
+# 静态变量 / static variable
 
 根据factory的安装过程，我们弄了一个 3 node IPI 模式安装的 openshift， 是一个 ipi 的 compact cluster. 我们把这个集群作为hub集群，里面要装ACM组件。
 
+> According to the installation process of the factory, we have installed openshift in 3 node IPI mode, which is an ipi compact cluster. We use this cluster as a hub cluster, and ACM components must be installed in it.
+
 以下的参数，是我们用这个hub集群，通过hypershift创建出来新集群的参数，新集群只有1个worker节点。
 
+> The following parameters are the parameters of the new cluster created by using this hub cluster through hypershift. The new cluster has only one worker node.
+
 ```bash
 # on helper
 
@@ -60,12 +83,16 @@ SNO_CORE_PWD=redhat
 
 ```
 
-另外，要说明的是，我们发现参考材料里面，对dns的配置不太对，至少对于单一worker节点来说，api, apps都指向这个worker节点就可以。
+另外，要说明的是，我们发现参考材料里面，对dns的配置不需要那么，至少对于单一worker节点来说，apps都指向这个worker节点就可以，api，api-int的域名指向并不重要，因为我们的实验，通过nodeport暴露API server，然后ip地址和端口号被静态的写入了kubelet的配置。
+
+> In addition, it should be noted that we found that in the reference materials, the configuration of dns does not need to be so, at least for a single worker node, apps can all point to this worker node, and the domain names of api and api-int are not important. Because of our experiment, the API server is exposed through nodeport, and then the ip address and port number are statically written into the kubelet configuration.
 
-# 部署ACM
+# 部署ACM / deploy ACM
 
 接下来，我们就部署ACM，我们用最简单的部署模式。
 
+> Next, we deploy ACM, we use the simplest deployment mode.
+
 ```bash
 # install operator Advanced Cluster Management for Kubernetes
 
@@ -148,33 +175,63 @@ oc get managedclusteraddons -A
 
 ```
 
-装好了是这样：
+装好了是这样，我们能看到装了2个operator, ACM和MCE
+
+> This is how it is installed, we can see that 2 operators, ACM and MCE are installed
 
 ![](imgs/2023-01-16-19-12-47.png)
 
-我们可以通过webUI访问ACM： https://console-openshift-console.apps.factory.wzhlab.top/multicloud/infrastructure/clusters/managed
+我们可以通过webUI访问ACM： 
+
+> We can access ACM through webUI:
+
+https://console-openshift-console.apps.factory.wzhlab.top/multicloud/infrastructure/clusters/managed
+
+可以看到，默认有一个local-cluster，类型是hub，这个就是我们这个装了ACM的集群。
+
+> As you can see, there is a local-cluster by default, the type is hub, and this is our cluster with ACM installed.
 
 ![](imgs/2023-01-16-19-55-49.png)
 
+点击进去，就能看到这个cluster的详细信息。
+
+> Click into it, you can see the detailed information of this cluster.
+
 ![](imgs/2023-01-16-19-56-31.png)
 
-<!-- ![](imgs/2023-01-16-20-01-07.png) -->
+以及这个cluster包含的节点。
+
+> And the nodes contained in this cluster.
 
 ![](imgs/2023-01-16-20-01-35.png)
 
+这个集群装的ACM插件。
+
+> The ACM addon installed in this cluster.
+
 ![](imgs/2023-01-16-20-02-50.png)
 
+新版本的ACM还有一个cluster set的概念，用来分类cluster.
+
+> The new version of ACM also has a concept of cluster set, which is used to classify clusters.
+
 ![](imgs/2023-01-16-20-03-59.png)
 
+在ACM概览页面，能看到这个ACM管理的多云环境。
+
+> On the ACM overview page, you can see the multi-cloud environment managed by this ACM.
+
 ![](imgs/2023-01-16-20-04-49.png)
 
 其他的链接，都没有内容，页面是空的。
 
-# 用hypershift模式部署只有一个worker的集群
+> Other links have no content and the page is empty.
+
+# 用hypershift模式部署集群 / Deploy the cluster using hypershift
 
-有过部署assisted install service，并通过AIS来部署SNO的经验，那么通过ACM，用hypershift的模式来部署，就容易理解了，整个过程一样，都是配置ACM里面的assisted install service，然后定义infr env，调用BMC API，来直接挂载iso，并启动主机。不同的地方，之前是定义一个 ClusterDeployment, 现在定义一个 HostedCluster，这个hosted cluster会帮助我们创建 cluster deployment 。
+有过部署assisted install service，并通过AIS来部署SNO的经验，那么通过ACM，用hypershift的模式来部署，就容易理解了，整个过程一样，都是配置ACM里面的assisted install service，然后定义infr env，调用BMC API，来直接挂载iso，并启动主机。不同的地方，以前的实验，之后是定义一个 ClusterDeployment, 现在要定义一个 HostedCluster，这个hosted cluster会帮助我们创建 cluster deployment 。
 
-## setup ACM for cluster deploy
+## setup ACM for agent service
 
 ACM 2.6 UI 是完全支持hypershift的，但是，我们现在的实验，是为了项目上能定制，所以有些配置要用命令行完成。
 
@@ -244,18 +301,6 @@ EOF
 oc create -f ${BASE_DIR}/data/install/acm.cm.asc.yaml
 # oc delete -f ${BASE_DIR}/data/install/acm.cm.asc.yaml
 
-# cat << EOF > ${BASE_DIR}/data/install/acm.secret.yaml
-# apiVersion: v1
-# kind: Secret
-# metadata:
-#   name: assisted-deployment-pull-secret
-#   namespace: multicluster-engine
-# stringData:
-#   .dockerconfigjson: '$PULL_SECRET'
-# EOF
-# oc create -f ${BASE_DIR}/data/install/acm.secret.yaml
-# # oc delete -f ${BASE_DIR}/data/install/acm.secret.yaml
-
 openshift-install version
 # openshift-install 4.11.21
 # built from commit d3fb15afdbf1558344ea88a1e134c8e9a011440f
@@ -367,6 +412,8 @@ oc get pod -n multicluster-engine | grep assisted
 
 ## create the infra env
 
+infra env这个概念比较古怪，他的意思是，一组相同的主机共享的配置，共享什么配置呢？主要是网络参数配置，启动盘ISO的定制化配置等等。
+
 ```bash
 
 oc create ns ${ACM_DEMO_CLUSTER}
@@ -507,16 +554,26 @@ oc get infraenv/${ACM_DEMO_CLUSTER} -n ${ACM_DEMO_CLUSTER} -o json | jq .status
 
 ```
 
+定义好了infra env，我们就能在ACM的web界面上看到啦。
+
 ![](imgs/2023-01-18-23-05-13.png)
 
+infra env的详细信息，似乎没什么有用的，就是一些普通的配置。
+
 ![](imgs/2023-01-18-23-06-04.png)
 
+在infra env的host配置里面，我们看到，现在还没有一个主机添加进来。
+
 ![](imgs/2023-01-18-23-06-30.png)
 
 ## add host to infra env
 
+我们接下来要做的，就是给infra env添加主机，从web界面上看，大概有3种添加方法，一个是手动挂载discovery ISO，然后在infra env里面自动发现，一个是通过web界面，配置BMC等参数，来添加host，最后一种，是通过上传yaml配置文件来完成导入host的操作。
+
 ![](imgs/2023-01-17-19-30-49.png)
 
+本文是通过命令行的方式来添加，那么就类似界面上最后一种，通过上传yaml的方式来导入host。
+
 ```bash
 # lets confirm that the metal3 component is ready
 # then we can use ocp to manage the baremetal
@@ -599,13 +656,18 @@ oc get BareMetalHost/${ACM_DEMO_CLUSTER}-${SNO_HOSTNAME} -n ${ACM_DEMO_CLUSTER}
 #   online: true
 # ......
 
-
 ```
 
+配置完成以后，在web界面上，就能看到这个主机啦，其实在openshift的界面里面，也能看到这个baremetal，我们看到系统正在试图配置这个主机。
+
 ![](imgs/2023-01-17-19-33-16.png)
 
+其实在目标kvm上，是启动了一个定制的coreos live cd，启动了以后，运行了一个服务，他会搜集本机的信息，然后上报，上述操作顺利的话，我们就能在界面上看到主机信息更新了。
+
 ![](imgs/2023-01-17-19-37-05.png)
 
+这里面的host，在后台对应的是agent的配置，我们可以通过命令行查看agent对应的详细信息。
+
 ```bash
 
 oc get agent -n ${ACM_DEMO_CLUSTER}
@@ -1017,15 +1079,21 @@ oc get pod -n ${ACM_DEMO_CLUSTER}-${ACM_DEMO_CLUSTER} | tail -n +2 | wc -l
 
 ```
 
+配置导入以后，我们就能看到多了一个集群edge01, 类型是hosted.
+
 ![](imgs/2023-01-17-19-25-03.png)
 
+安装过程稍微有一点时间，期间，我们能看到集群状态，nodepool状态有所变化。
+
 ![](imgs/2023-01-17-19-24-12.png)
 
+我们还能看到hub集群上，有了一个edge01-edge01的namespace，里面有集群控制面的pod，其中就有我们熟悉的etcd, api-server
+
 ![](imgs/2023-01-18-23-24-22.png)
 
 ## import the hosted cluster
 
-复制页面上的命令，并到helper上，运行这2个命令，他们是登录到hosted control plan，然后配置一些CR进去
+经过一段时间，新集群就安装成功了，但是页面上提示，需要手动导入。我们复制页面上的命令，并到helper上，运行这2个命令，他们是登录到hosted control plan，然后配置一些CR进去
 
 ![](imgs/2023-01-18-23-50-05.png)
 
@@ -1051,7 +1119,7 @@ echo "Ci0tLQphc............" | base64 -d | oc create -f - || test $? -eq 0 && sl
 # lets decode the first 2 base64 content, the 3rd one is just a message.
 ```
 
-第一个导入hosted control plan的yaml
+我们很好奇到底导入了什么东西，那让我们解码看看。第一个导入hosted control plan的yaml是一个CRD。
 
 ```yaml
 ---
@@ -1266,7 +1334,9 @@ status:
   storedVersions: []
 
 ```
-第二个yaml是这样的。
+
+第二个yaml是这样的，这个是配置了一个新的namespace，然后部署了一个klusterlet应用和配置，这个到底是啥，作者暂时也说不出。
+
 ```yaml
 
 ---
@@ -1436,26 +1506,50 @@ spec:
 
 ```
 
+导入配置以后，我们就能看到集群导入成功啦。
+
 ![](imgs/2023-01-18-23-56-11.png)
 
+cluster set页面上也都是正常的标志。
+
 ![](imgs/2023-01-18-23-56-37.png)
 
+集群详细页面上也都是正常的标志。
+
 ![](imgs/2023-01-18-23-57-36.png)
 
+新集群的host页面，也有了一个新的worker节点。
+
 ![](imgs/2023-01-18-23-57-57.png)
 
+集群详细信息的插件页面上，也都是正常的标志。
+
 ![](imgs/2023-01-18-23-58-13.png)
 
+我们登录到新装的edge01集群的管理页面看看。
+
 ![](imgs/2023-01-18-23-59-01.png)
 
+新的edge01集群，是不能自行升级的，提示这是一个特殊的hosted集群。
+
 ![](imgs/2023-01-16-11-43-31.png)
 
+回想一下，在ACM界面里面，edge01是hosted类型。
+
 ![](imgs/2023-01-16-11-40-30.png)
 
+我们简单的看看，这个hosted control plan的资源消耗。
+
 ![](imgs/2023-01-16-11-42-55.png)
 
+我们看一下这个control plan里面都些什么pod。
+
 ![](imgs/2023-01-16-11-44-58.png)
 
+## cli login into the hosted cluster
+
+接下来，我们通过命令行来登录到新的edge01集群，看看命令行上，这个新的集群有什么特殊的地方。
+
 ```bash
 oc extract -n ${ACM_DEMO_CLUSTER} secret/${ACM_DEMO_CLUSTER}-admin-kubeconfig --to=- > ${BASE_DIR}/data/install/kubeconfig-${ACM_DEMO_CLUSTER}
 
@@ -1489,8 +1583,8 @@ oc --kubeconfig=${BASE_DIR}/data/install/kubeconfig-${ACM_DEMO_CLUSTER} get co
 # storage                                    4.11.21   True        False         False      6h25m
 
 oc --kubeconfig=${BASE_DIR}/data/install/kubeconfig-${ACM_DEMO_CLUSTER} get node
-# NAME             STATUS   ROLES    AGE     VERSION
-# edge-master-01   Ready    worker   6h28m   v1.24.6+5658434
+# NAME             STATUS   ROLES    AGE   VERSION
+# edge-worker-01   Ready    worker   17h   v1.24.6+5658434
 
 oc --kubeconfig=${BASE_DIR}/data/install/kubeconfig-${ACM_DEMO_CLUSTER} get mcp
 # error: the server doesn't have a resource type "mcp"
@@ -1520,9 +1614,6 @@ oc --kubeconfig=${BASE_DIR}/data/install/kubeconfig-${ACM_DEMO_CLUSTER} get clus
 # version   4.11.21   True        False         6h35m   Cluster version is 4.11.21
 
 
-# to delete cluster
-
-
 ```
 
 ## post operation