diff --git a/docs/installation/bootstrap.rst b/docs/installation/bootstrap.rst index beff806fe9..079b5918fd 100644 --- a/docs/installation/bootstrap.rst +++ b/docs/installation/bootstrap.rst @@ -27,8 +27,11 @@ Configuration root@bootstrap $ mkdir /etc/metalk8s -#. Create the :file:`/etc/metalk8s/bootstrap.yaml` file. Change the networks, - IP address, and hostname to conform to your infrastructure. +#. Create the :file:`/etc/metalk8s/bootstrap.yaml` file. + This file contains initial configuration settings which are mandatory for + setting up a MetalK8s :term:`Bootstrap node`. + Change the networks, IP address, and hostname fields to conform to your + infrastructure. .. code-block:: yaml @@ -37,6 +40,8 @@ Configuration networks: controlPlane: workloadPlane: + pods: + services: proxies: http: https: @@ -48,20 +53,58 @@ Configuration archives: - -The ``archives`` field is a list of absolute paths to MetalK8s ISO files. When -the bootstrap script is executed, those ISOs are automatically mounted and the -system is configured to re-mount them automatically after a reboot. +The ``networks`` field specifies a range of IP addresses written in CIDR +notation for it's various subfields. + + The ``controlPlane`` and ``workloadPlane`` entries are **mandatory**. + These values specify the range of IP addresses that will be used at the + host level for each member of the cluster. + + .. code-block:: yaml + + networks: + controlPlane: 10.200.1.0/28 + workloadPlane: 10.200.1.0/28 + + All nodes within the cluster **must** connect to both the control plane + and workload plane networks. If the same network range is chosen for both + the control plane and workload plane networks then the same interface + may be used. + + The ``pods`` and ``services`` fields are not mandatory, though can be + changed to match the constraints of existing networking infrastructure + (for example, if all or part of these default subnets is already routed). + During installation, by default ``pods`` and ``services`` are set to the + following values below if omitted. + + For **production clusters**, we advise users to anticipate future + expansions and use sufficiently large networks for pods and services. + + .. code-block:: yaml + + networks: + pods: 10.233.0.0/16 + services: 10.96.0.0/12 The ``proxies`` field can be omitted if there is no proxy to configure. The 2 entries ``http`` and ``https`` are used to configure the containerd daemon proxy to fetch extra container images from outstide the MetalK8s cluster. The ``no_proxy`` entry specifies IPs that should be excluded from proxying, -it must a list of hosts, IP addresses or IP ranges in CIDR format. +it must be a list of hosts, IP addresses or IP ranges in CIDR format. +For example; -.. todo:: + .. code-block:: shell + + no_proxy: + - localhost + - 127.0.0.1 + - 10.10.0.0/16 + - 192.168.0.0/16 - - Explain the role of this config file and its values +The ``archives`` field is a list of absolute paths to MetalK8s ISO files. When +the bootstrap script is executed, those ISOs are automatically mounted and the +system is configured to re-mount them automatically after a reboot. .. _Bootstrap SSH Provisioning: @@ -105,6 +148,8 @@ SSH Provisioning user@host $ ssh-copy-id -i /tmp/salt-bootstrap.pub root@ +.. _Bootstrap installation: + Installation ------------ @@ -123,29 +168,18 @@ Bootstrap node. destination fields of IP packets to correspond to the MAC address(es)), :ref:`IP-in-IP needs to be enabled`. -Provision Storage for Prometheus Services -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -After bootstrapping the cluster, the Prometheus and AlertManager services used -to monitor the system will not be running (their respective :term:`Pods ` -will remain in a **Pending** state), because they require persistent storage to -be available. You can either provision these storage volumes on the bootstrap -node, or later on other nodes joining the cluster. Templates for the required -volumes are available in :download:`examples/prometheus-sparse.yaml -<../../examples/prometheus-sparse.yaml>`. Note, however, these templates use -the `sparseLoopDevice` *Volume* type, which is not suitable for production -installations. Refer to :ref:`volume-management` for more information on how to -provision persistent storage. +Validate the install +^^^^^^^^^^^^^^^^^^^^ +- Check that all :term:`Pods ` on the Bootstrap node are in the + **Running** state. Note that Prometheus and Alertmanager pods will remain in + a **Pending** state until their respective persistent storage volumes are + provisioned. .. note:: - When deploying using Vagrant, persistent volumes for Prometheus and - AlertManager are already provisioned. - -Validate the Installation -^^^^^^^^^^^^^^^^^^^^^^^^^ -Check if all :term:`Pods ` on the Bootstrap node are running. - -.. note:: + The administrator :term:`kubeconfig` file is used to configure access to + Kubernetes when used with :term:`kubectl` as shown below. This file contains + sensitive information and should be kept securely. On all subsequent :term:`kubectl` commands, you may omit the ``--kubeconfig`` argument if you have exported the ``KUBECONFIG`` @@ -162,42 +196,51 @@ Check if all :term:`Pods ` on the Bootstrap node are running. root@bootstrap $ kubectl get nodes --kubeconfig /etc/kubernetes/admin.conf NAME STATUS ROLES AGE VERSION - bootstrap Ready bootstrap,etcd,infra,master 17m v1.11.7 + bootstrap Ready bootstrap,etcd,infra,master 17m v1.15.5 root@bootstrap $ kubectl get pods --all-namespaces -o wide --kubeconfig /etc/kubernetes/admin.conf - NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE - kube-system calico-kube-controllers-b7bc4449f-6rh2q 1/1 Running 0 4m 10.233.132.65 bootstrap - kube-system calico-node-r2qxs 1/1 Running 0 4m 172.21.254.12 bootstrap - kube-system coredns-7475f8d796-8h4lt 1/1 Running 0 4m 10.233.132.67 bootstrap - kube-system coredns-7475f8d796-m5zz9 1/1 Running 0 4m 10.233.132.66 bootstrap - kube-system etcd-bootstrap 1/1 Running 0 4m 172.21.254.12 bootstrap - kube-system kube-apiserver-bootstrap 2/2 Running 0 4m 172.21.254.12 bootstrap - kube-system kube-controller-manager-bootstrap 1/1 Running 0 4m 172.21.254.12 bootstrap - kube-system kube-proxy-vb74b 1/1 Running 0 4m 172.21.254.12 bootstrap - kube-system kube-scheduler-bootstrap 1/1 Running 0 4m 172.21.254.12 bootstrap - kube-system repositories-bootstrap 1/1 Running 0 4m 172.21.254.12 bootstrap - kube-system salt-master-bootstrap 2/2 Running 0 4m 172.21.254.12 bootstrap - metalk8s-ingress nginx-ingress-controller-46lxd 1/1 Running 0 4m 10.233.132.73 bootstrap - metalk8s-ingress nginx-ingress-default-backend-5449d5b699-8bkbr 1/1 Running 0 4m 10.233.132.74 bootstrap - metalk8s-monitoring alertmanager-main-0 2/2 Running 0 4m 10.233.132.70 bootstrap - metalk8s-monitoring alertmanager-main-1 2/2 Running 0 3m 10.233.132.76 bootstrap - metalk8s-monitoring alertmanager-main-2 2/2 Running 0 3m 10.233.132.77 bootstrap - metalk8s-monitoring grafana-5cb4945b7b-ltdrz 1/1 Running 0 4m 10.233.132.71 bootstrap - metalk8s-monitoring kube-state-metrics-588d699b56-d6crn 4/4 Running 0 3m 10.233.132.75 bootstrap - metalk8s-monitoring node-exporter-4jdgv 2/2 Running 0 4m 172.21.254.12 bootstrap - metalk8s-monitoring prometheus-k8s-0 3/3 Running 1 4m 10.233.132.72 bootstrap - metalk8s-monitoring prometheus-k8s-1 3/3 Running 1 3m 10.233.132.78 bootstrap - metalk8s-monitoring prometheus-operator-64477d4bff-xxjw2 1/1 Running 0 4m 10.233.132.68 bootstrap - -Check that you can access the MetalK8s GUI, following -:ref:`this procedure `. - -.. todo:: - - Troubleshooting section - - - Mention ``/var/log/metalk8s-bootstrap.log`` and the command-line options - for verbosity. - - Add Salt master/minion logs, and explain how to run a specific state from - the Salt master. - - Then refer to a troubleshooting section in the installation guide. + NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES + kube-system calico-kube-controllers-7c9944c5f4-h9bsc 1/1 Running 0 6m29s 10.233.220.129 bootstrap + kube-system calico-node-v4qhb 1/1 Running 0 6m29s 10.200.3.152 bootstrap + kube-system coredns-ff46db798-k54z9 1/1 Running 0 6m29s 10.233.220.134 bootstrap + kube-system coredns-ff46db798-nvmjl 1/1 Running 0 6m29s 10.233.220.132 bootstrap + kube-system etcd-bootstrap 1/1 Running 0 5m45s 10.200.3.152 bootstrap + kube-system kube-apiserver-bootstrap 1/1 Running 0 5m57s 10.200.3.152 bootstrap + kube-system kube-controller-manager-bootstrap 1/1 Running 0 7m4s 10.200.3.152 bootstrap + kube-system kube-proxy-n6zgk 1/1 Running 0 6m32s 10.200.3.152 bootstrap + kube-system kube-scheduler-bootstrap 1/1 Running 0 7m4s 10.200.3.152 bootstrap + kube-system repositories-bootstrap 1/1 Running 0 6m20s 10.200.3.152 bootstrap + kube-system salt-master-bootstrap 2/2 Running 0 6m10s 10.200.3.152 bootstrap + kube-system storage-operator-7567748b6d-hp7gq 1/1 Running 0 6m6s 10.233.220.138 bootstrap + metalk8s-ingress nginx-ingress-control-plane-controller-5nkkx 1/1 Running 0 6m6s 10.233.220.137 bootstrap + metalk8s-ingress nginx-ingress-controller-shg7x 1/1 Running 0 6m7s 10.233.220.135 bootstrap + metalk8s-ingress nginx-ingress-default-backend-7d8898655c-jj7l6 1/1 Running 0 6m7s 10.233.220.136 bootstrap + metalk8s-monitoring alertmanager-prometheus-operator-alertmanager-0 0/2 Pending 0 6m1s + metalk8s-monitoring prometheus-operator-grafana-775fbb5b-sgngh 2/2 Running 0 6m17s 10.233.220.130 bootstrap + metalk8s-monitoring prometheus-operator-kube-state-metrics-7587b4897c-tt79q 1/1 Running 0 6m17s 10.233.220.131 bootstrap + metalk8s-monitoring prometheus-operator-operator-7446d89644-zqdlj 1/1 Running 0 6m17s 10.233.220.133 bootstrap + metalk8s-monitoring prometheus-operator-prometheus-node-exporter-rb969 1/1 Running 0 6m17s 10.200.3.152 bootstrap + metalk8s-monitoring prometheus-prometheus-operator-prometheus-0 0/3 Pending 0 5m50s + metalk8s-ui metalk8s-ui-6f74ff4bc-fgk86 1/1 Running 0 6m4s 10.233.220.139 bootstrap + +- From the console output above, :term:`Prometheus` and :term:`Alertmanager` + pods are in a ``Pending`` state because their respective persistent + storage volumes need to be provisioned. To provision these persistent storage + volumes, follow :ref:`this procedure `. + +- Check that you can access the MetalK8s GUI after the + :ref:`installation ` is completed by following + :ref:`this procedure `. + +- At this stage, the MetalK8s GUI should be up and ready for you to + explore. + + .. note:: + + Monitoring through the MetalK8s GUI will not be available until persistent + storage volumes for both Prometheus and Alertmanager have been successfully + provisioned. + +- If you encouter an error during installation or have difficulties + validating a fresh MetalK8s installation, visit our + :ref:`Troubleshooting guide `. diff --git a/docs/installation/post-install.rst b/docs/installation/post-install.rst index 793f56ec30..046f4beb3f 100644 --- a/docs/installation/post-install.rst +++ b/docs/installation/post-install.rst @@ -1,6 +1,29 @@ Post-Installation Procedure =========================== +.. _Provision Prometheus storage: + +Provision storage for Prometheus services +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +After bootstrapping the cluster, the Prometheus and AlertManager services used +to monitor the system will not be running (the respective :term:`Pods ` +will remain in *Pending* state), because they require persistent storage to be +available. + +You can either provision these storage volumes on the bootstrap +node, or later on other nodes joining the cluster. Templates for the required +volumes are available in :download:`examples/prometheus-sparse.yaml +<../../examples/prometheus-sparse.yaml>`. + +Note, however, these templates use the `sparseLoopDevice` *Volume* type, which +is not suitable for production installations. Refer to :ref:`volume-management` +for more information on how to provision persistent storage. + +.. note:: + + When deploying using Vagrant, persistent volumes for Prometheus and + AlertManager are already provisioned. + .. todo:: - Explain in one sentence why it is needed diff --git a/docs/operation/account_administration.rst b/docs/operation/account_administration.rst index ae5c2fe227..305c62092c 100644 --- a/docs/operation/account_administration.rst +++ b/docs/operation/account_administration.rst @@ -51,8 +51,10 @@ perform the following procedures: Administering MetalK8s GUI, Kubernetes API and Salt API ******************************************************* -During installation, MetalK8s configures the Kubernetes API to accept -authentication via OpenID Connect(OIDC). +.. _Administering-MetalK8s-GUI-Kubernetes-API-and-Salt-API: + +During installation, MetalK8s configures the Kubernetes API to accept Basic +authentication, with default credentials ``admin`` / ``admin``. Services exposed by MetalK8s, such as :ref:`its GUI ` or diff --git a/docs/operation/index.rst b/docs/operation/index.rst index 1e563097d6..8d31587496 100644 --- a/docs/operation/index.rst +++ b/docs/operation/index.rst @@ -18,3 +18,4 @@ do not have a working MetalK8s_ setup. changing_node_hostname volume_management/index account_administration + troubleshooting diff --git a/docs/operation/troubleshooting.rst b/docs/operation/troubleshooting.rst new file mode 100644 index 0000000000..4fe48358a5 --- /dev/null +++ b/docs/operation/troubleshooting.rst @@ -0,0 +1,202 @@ + +.. _Troubleshooting Guide: + +Troubleshooting Guide +^^^^^^^^^^^^^^^^^^^^^ + +This section highlights some of the common problems users face during and +after a MetalK8s installation. If you do not find a solution to a problem you +are facing, please reach out to **Scality support** or create a +`Github issue `_. + +Bootstrap Installation Errors ++++++++++++++++++++++++++++++ + +Bootstrap Installation fails with no straightforward reason +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If during a MetalK8s installation you encounter a failure and the console +output does not provide sufficient information in order to pin-point the cause +of failure, then re-run the installation with the verbose flag (``--verbose``). + +.. parsed-literal:: + + root@bootstrap $ /srv/scality/metalk8s-|release|/bootstrap.sh --verbose + +Errors after restarting the Bootstrap node +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If you reboot the Bootstrap node and for some reason, some containers +(especially the salt-master container) refuses to start then perform the +following checks: + +- Check and ensure that the **MetalK8s ISO** is mounted properly. + + .. parsed-literal:: + + [root@bootstrap vagrant]# mount | grep /srv/scality/metalk8s-|release| + /home/centos/metalk8s.iso on /srv/scality/metalk8s-|release| type iso9660 (ro,relatime) + + +- If the ISO is unmounted, run the following command which will check the + the status of the ISO file and remount it automatically. + + .. parsed-literal:: + + [root@bootstrap vagrant]# salt-call state.sls metalk8s.archives.mounted saltenv=metalk8s-|release| + Summary for local + ------------ + Succeeded: 3 + Failed: 0 + +Bootstrap fails and console log is unscrollable +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If during a MetalK8s installation, the Bootstrap process fails and the console +output is unscrollable then you can consult the Bootstrap logs in +``/var/log/metalk8s-bootstrap.log``. + +Account Administration Errors ++++++++++++++++++++++++++++++ + +Forgot the MetalK8s GUI password +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If you forget the MetalK8s GUI username and/or password combination, +follow +:ref:`this procedure ` +to reset or change it. + +General Kubernetes Resource Errors +++++++++++++++++++++++++++++++++++ + +Pod status shows "CrashLoopBackOff" +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If after a MetalK8s installation, you notice some Pods are in a state of +"CrashLoopBackOff", then it means pods are crashing because they start up then +immediately exit, thus Kubernetes restarts them and the cycle continues. +To get possible clues about this error, run the following commands and inspect +the output. + +.. code-block:: shell + + [root@bootstrap vagrant]# kubectl -n kube-system describe pods + Name: + Namespace: kube-system + Priority: 2000000000 + Priority Class Name: system-cluster-critical + +Persistent Volume Claim(PVC) stuck in "Pending" state +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If after provisioning a Volume for a Pod (e.g. Prometheus) and the PVC still +hangs in a **Pending** state, then try checking the following: + +- Check that the volumes have been provisioned and are in a **Ready** state: + + .. code-block:: shell + + kubectl describe volume + [root@bootstrap vagrant]# kubectl describe volume test-volume + Name: + Status: + Conditions: + Last Transition Time: 2020-01-14T12:57:56Z + Last Update Time: 2020-01-14T12:57:56Z + Status: True + Type: Ready + +- Check that a corresponding PersistentVolume exist: + + .. code-block:: shell + + [root@bootstrap vagrant]# kubectl get pv + NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS STORAGECLASS AGE CLAIM + 10Gi RWO Retain Bound 4d22h + +- Check that the PersistentVolume matches the PersistentVolume Claim + constraints (size, labels, storage class) by doing the following: + + - Find the name of your PersistentVolume Claim: + + .. code-block:: shell + + [root@bootstrap vagrant]# kubectl get pvc -n + NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE + Bound 10Gi RWO 24h + + - Then check the PersistentVolume Claim constraints if they match: + + .. code-block:: shell + + [root@bootstrap vagrant]# kubectl describe pvc -n + Name: + Namespace: + StorageClass: + Status: Bound + Volume: + Capacity: 10Gi + Access Modes: RWO + VolumeMode: Filesystem + +- If no PersistentVolume exist, then check that the storage operator is up + and running. + + .. code-block:: shell + + [root@bootstrap vagrant]# kubectl -n kube-system get deployments storage-operator + NAME READY UP-TO-DATE AVAILABLE AGE + storage-operator 1/1 1 1 4d22h + +Access to MetalK8s GUI fails with "undefined backend" +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If in the cause of using the MetalK8s GUI, you encounter an "undefined +backend" error then perform the following checks: + +- Check that the Ingress pods are running: + + .. code-block:: shell + + [root@bootstrap vagrant]# kubectl -n metalk8s-ingress get daemonsets + NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE + nginx-ingress-control-plane-controller 1 1 1 1 1 node-role.kubernetes.io/master= 4d22h + nginx-ingress-controller 1 1 1 1 1 4d22h + +- Check the Ingress controller logs: + + .. code-block:: shell + + [root@bootstrap vagrant]# kubectl logs -n metalk8s-ingress nginx-ingress-control-plane-controller-ftg6v + ------------------------------------------------------------------------------- + NGINX Ingress controller + Release: 0.26.1 + Build: git-2de5a893a + Repository: https://github.com/kubernetes/ingress-nginx + nginx version: openresty/1.15.8.2 + +Pod and Service CIDR conflicts +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If after installation of a MetalK8s cluster you notice that Pod-to-Pod +communication has routing problems, perform the following: + +- Check the configured values for the internal Pod and Service networks: + + .. code-block:: shell + + [root@bootstrap vagrant]# salt-call pillar.get networks + local: + ---------- + control_plane: + 172.21.254.0/28 + pod: + 10.233.0.0/16 + service: + 10.96.0.0/12 + workload_plane: + 172.21.254.32/27 + + Make sure the configured IP ranges (CIDR notation) do not conflict with your + infrastructure. + +.. todo:: + + - Add Salt master/minion logs, and explain how to run a specific state from + the Salt master. + - Add troubleshooting for networking issues.